Reading text from html source

Marius M

New member
Joined
May 3, 2024
Messages
1
Programming Experience
Beginner
I want to extract text from html source

C#:
using System;
using System.IO;
using System.Net;
using System.Threading;
using System.Runtime.InteropServices;
using System.Text.RegularExpressions;
 
namespace ConsoleApplication1
{
    class Program
    {
         
 
        public static void Main(string[] args)
        {
            string outputpath = Console.ReadLine();
            using(StreamWriter  sw = new StreamWriter(outputpath))
            {
            string inputpathstart = "https://literat.ug.edu.pl/faraon/";
            string inputpath;
            int noofchapters = 0;
            inputpath = inputpathstart + String.Empty;
            try
            {
				WebRequest request = HttpWebRequest.Create(inputpath);  
				WebResponse response = request.GetResponse();  
				System.Text.Encoding enc = System.Text.Encoding.GetEncoding ("iso-8859-2");
                using(StreamReader sr = new StreamReader(response.GetResponseStream(),enc))
                {
                    String text = sr.ReadToEnd(); 
					while(text.Contains((noofchapters+1).ToString("D3")+".htm"))
						noofchapters++;
                 }   
            }
            catch (Exception e)
            {
                Console.WriteLine("The file could not be read");
                Console.WriteLine(e.Message);
            }
			Console.WriteLine("{0} ",noofchapters);
            for(int i=1;i<=noofchapters;i++)
            {
				inputpath  = inputpathstart + i.ToString("D3")+".htm";  
            try
            {
				WebRequest request = HttpWebRequest.Create(inputpath);  
				WebResponse response = request.GetResponse();  
				System.Text.Encoding enc = System.Text.Encoding.GetEncoding ("iso-8859-2");
                using(StreamReader sr = new StreamReader(response.GetResponseStream(),enc))
                {
                    String text = sr.ReadToEnd();
					text = System.Web.HttpUtility.HtmlDecode(text);
					text = RemoveHTMLTagsCompiled(text);
                    sw.WriteLine(text);
					sw.WriteLine();
                 }   
            }
            catch (Exception e)
            {
                Console.WriteLine("The file could not be read");
                Console.WriteLine(e.Message);
            }
			}
		}
            Console.WriteLine("Press Enter to exit");
            Console.ReadKey();
        }
 
		public static string RemoveHTMLTagsCompiled(string html)
		{
			string s;
			Regex htmlRegex = new Regex("<.*?>", RegexOptions.Compiled);
			s = htmlRegex.Replace(html, string.Empty);
			return s;
		}
    }
}

In resulting text file i have

<meta name="Keywords"
content="Prus, Bolesław Prus, Głowacki, Aleksander Głowacki, Faraon, powieść, powieść polska, roman, polish roman, literary masterpiece, kultura polska, polish culture, Polska, Poland">

But why function with regular expression did not delete this
How to fix this
 
In general, regular expressions cannot span multiple lines. Those look to be two lines.

As an aside, the common response you'll get is that if you are using regular expressions to parse HTML, you are doing it wrong. In this case, it looks like you are trying to use regular expressions to remove HTML tags. You might be able to get away with it for simple HTML, but you'll likely run into problems with the general case where the input file is HTML compliant, but your regular expressions won't be able to handle them.

A better solution is to use the HTML Agility Pack which is a much more robust library for parsing HTML.
 
I found this function on the internet
I do not remember where exactly where but they claimed that it should work
Is there documentation of this Agility Pack with some examples ?
 
You can use the Visual Studio Package Installer to install the HTML Agility Pack.

Here's a direct link to the Nuget Package: HtmlAgilityPack 1.11.61
From that link you can find a link to the web site that has more details about the library: Html Agility Pack
As well as the source code: GitHub - zzzprojects/html-agility-pack: Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.

I suggest starting with the web site.

Most modern projects will have an links on the GitHub or GitLab repositories for a wiki with documentation as well as bugs or issues where sometimes there are tidbits of information that don't make it into the documentation.
 
Yes i downloaded it and installed with nuget but
it is problem that program cannot find dll unless i have it in the same folder as program.exe
For one program it is acceptable but when i have multiple programs using this Agility Pack there will be problem with inefficeient memory usage
Is there other solution than setting environment variable
 
No it will not be inefficient for memory. Windows knows of it is the same DLL, and so it will keep only one version for each program that uses it.

Now if it disk space, rather than memory that you are worried about. The modern mantra is that: "disk space is cheap". Each program can have its own copy installed adjacent to it. If it really a concern, you can copy it to a common location that all the programs can access it from there. Just modify the assembly search path of your program by putting that location in with your program config file and/or manifest. You don't set an environment variable for the assembly search path.

But know that there is a reason why Microsoft encourages each program to carry along its own dependencies: DLL Hell. Look it up if you didn't live through it.
 
Yes, disk space
Is it possible to generate config file / manifest file automatically from command line
or must I do it manually
I prefer to compile simple programs like this from command line
 
Personally, though, I think this is a case of premature optimization.

For .NET Framework:
1714849984875.png

Just a over half a megabyte of diskspace with debug symbols. You likely will never use the debugging symbols, so you can deleted the .PDB and be less that 400KB.

For .NET:
1714850049226.png


Less than 400KB of diskspace.
 
Last edited:
Actually, I should have asked first whether you are targeting .NET Framework, or .NET.
.Net Framework

"To create an assembly, you can use the Al.exe (Assembly Linker) tool with a command such as the following:"
Al.exe /link:asm6.exe.config /out:policy.3.0.asm6.dll /keyfile: compatkey.dat /v:3.0.0.0
but when I write Al in the command line i get

'Al' is not recognized as an internal or external command,
operable program or batch file.

Where it can be found (in what location in the system)

System search does not work well
 
If you launch the "Developer Command Prompt" or the "Developer PowerShell" it will set up the PATH environment variable within the shell to include the location that the primary .NET Framework tool chain.
1714857533184.png


Alternatively, if you run where.exe /r \ al.exe you can also find it.
1714857716727.png



And if you don't have Visual Studio installed, you can install the .NET Framework SDK or the Win10 or higher SDK. That will also bring all the tools with it, but again you'll need to setup your PATH environment variable to point to the appropriate place.
 
If it really a concern, you can copy it to a common location that all the programs can access it from there. Just modify the assembly search path of your program by putting that location in with your program config file and/or manifest. You don't set an environment variable for the assembly search path.

Yes that is what i want to get
Could you show me how to generate config file and manifest file using command line
It would be useful also for other programs using dlls from nuget packages
 
I don't think there is a way to generate one using the command line (for .NET Framework). You will need to create the file using a text editor.

Or if you insist on command line, you can use sed (if you have access to it somehow). But to feed the text you would need to feed into sed or any other command line too, you still have to provide the contents anyway.
 
Back
Top Bottom