Reading text from html source

Marius M · May 3, 2024

I want to extract text from html source

C#:

using System;
using System.IO;
using System.Net;
using System.Threading;
using System.Runtime.InteropServices;
using System.Text.RegularExpressions;
 
namespace ConsoleApplication1
{
    class Program
    {
         
 
        public static void Main(string[] args)
        {
            string outputpath = Console.ReadLine();
            using(StreamWriter  sw = new StreamWriter(outputpath))
            {
            string inputpathstart = "https://literat.ug.edu.pl/faraon/";
            string inputpath;
            int noofchapters = 0;
            inputpath = inputpathstart + String.Empty;
            try
            {
				WebRequest request = HttpWebRequest.Create(inputpath);  
				WebResponse response = request.GetResponse();  
				System.Text.Encoding enc = System.Text.Encoding.GetEncoding ("iso-8859-2");
                using(StreamReader sr = new StreamReader(response.GetResponseStream(),enc))
                {
                    String text = sr.ReadToEnd(); 
					while(text.Contains((noofchapters+1).ToString("D3")+".htm"))
						noofchapters++;
                 }   
            }
            catch (Exception e)
            {
                Console.WriteLine("The file could not be read");
                Console.WriteLine(e.Message);
            }
			Console.WriteLine("{0} ",noofchapters);
            for(int i=1;i<=noofchapters;i++)
            {
				inputpath  = inputpathstart + i.ToString("D3")+".htm";  
            try
            {
				WebRequest request = HttpWebRequest.Create(inputpath);  
				WebResponse response = request.GetResponse();  
				System.Text.Encoding enc = System.Text.Encoding.GetEncoding ("iso-8859-2");
                using(StreamReader sr = new StreamReader(response.GetResponseStream(),enc))
                {
                    String text = sr.ReadToEnd();
					text = System.Web.HttpUtility.HtmlDecode(text);
					text = RemoveHTMLTagsCompiled(text);
                    sw.WriteLine(text);
					sw.WriteLine();
                 }   
            }
            catch (Exception e)
            {
                Console.WriteLine("The file could not be read");
                Console.WriteLine(e.Message);
            }
			}
		}
            Console.WriteLine("Press Enter to exit");
            Console.ReadKey();
        }
 
		public static string RemoveHTMLTagsCompiled(string html)
		{
			string s;
			Regex htmlRegex = new Regex("<.*?>", RegexOptions.Compiled);
			s = htmlRegex.Replace(html, string.Empty);
			return s;
		}
    }
}

In resulting text file i have

<meta name="Keywords"
content="Prus, Bolesław Prus, Głowacki, Aleksander Głowacki, Faraon, powieść, powieść polska, roman, polish roman, literary masterpiece, kultura polska, polish culture, Polska, Poland">

But why function with regular expression did not delete this
How to fix this

Skydiver · May 3, 2024

In general, regular expressions cannot span multiple lines. Those look to be two lines.

As an aside, the common response you'll get is that if you are using regular expressions to parse HTML, you are doing it wrong. In this case, it looks like you are trying to use regular expressions to remove HTML tags. You might be able to get away with it for simple HTML, but you'll likely run into problems with the general case where the input file is HTML compliant, but your regular expressions won't be able to handle them.

A better solution is to use the HTML Agility Pack which is a much more robust library for parsing HTML.

John J Doe · May 4, 2024

I found this function on the internet
I do not remember where exactly where but they claimed that it should work
Is there documentation of this Agility Pack with some examples ?

Skydiver · May 4, 2024

You can use the Visual Studio Package Installer to install the HTML Agility Pack.

Here's a direct link to the Nuget Package: HtmlAgilityPack 1.11.61
From that link you can find a link to the web site that has more details about the library: Html Agility Pack
As well as the source code: GitHub - zzzprojects/html-agility-pack: Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.

I suggest starting with the web site.

Most modern projects will have an links on the GitHub or GitLab repositories for a wiki with documentation as well as bugs or issues where sometimes there are tidbits of information that don't make it into the documentation.

John J Doe · May 4, 2024

Yes i downloaded it and installed with nuget but
it is problem that program cannot find dll unless i have it in the same folder as program.exe
For one program it is acceptable but when i have multiple programs using this Agility Pack there will be problem with inefficeient memory usage
Is there other solution than setting environment variable

Skydiver · May 4, 2024

No it will not be inefficient for memory. Windows knows of it is the same DLL, and so it will keep only one version for each program that uses it.

Now if it disk space, rather than memory that you are worried about. The modern mantra is that: "disk space is cheap". Each program can have its own copy installed adjacent to it. If it really a concern, you can copy it to a common location that all the programs can access it from there. Just modify the assembly search path of your program by putting that location in with your program config file and/or manifest. You don't set an environment variable for the assembly search path.

But know that there is a reason why Microsoft encourages each program to carry along its own dependencies: DLL Hell. Look it up if you didn't live through it.

John J Doe · May 4, 2024

Yes, disk space
Is it possible to generate config file / manifest file automatically from command line
or must I do it manually
I prefer to compile simple programs like this from command line

John J Doe · May 4, 2024

I wrote csc -help in console and found
/win32manifest:<file>
I will play with it and write what i will get

Skydiver · May 4, 2024

Actually, I should have asked first whether you are targeting .NET Framework, or .NET.

If you are targeting .NET Framework: How the Runtime Locates Assemblies - .NET Framework

If you are targeting .NET, this is more relevant: Default probing - .NET Core - .NET

Skydiver · May 4, 2024

Personally, though, I think this is a case of premature optimization.

For .NET Framework:

Just a over half a megabyte of diskspace with debug symbols. You likely will never use the debugging symbols, so you can deleted the .PDB and be less that 400KB.

For .NET:

Less than 400KB of diskspace.

John J Doe · May 4, 2024

Skydiver said:
Actually, I should have asked first whether you are targeting .NET Framework, or .NET.

.Net Framework

"To create an assembly, you can use the Al.exe (Assembly Linker) tool with a command such as the following:"
Al.exe /link:asm6.exe.config /out

olicy.3.0.asm6.dll /keyfile: compatkey.dat /v:3.0.0.0
but when I write Al in the command line i get

'Al' is not recognized as an internal or external command,
operable program or batch file.

Where it can be found (in what location in the system)

System search does not work well

Skydiver · May 4, 2024

If you launch the "Developer Command Prompt" or the "Developer PowerShell" it will set up the PATH environment variable within the shell to include the location that the primary .NET Framework tool chain.

Alternatively, if you run where.exe /r \ al.exe you can also find it.

And if you don't have Visual Studio installed, you can install the .NET Framework SDK or the Win10 or higher SDK. That will also bring all the tools with it, but again you'll need to setup your PATH environment variable to point to the appropriate place.

John J Doe · May 6, 2024

Skydiver said:
If it really a concern, you can copy it to a common location that all the programs can access it from there. Just modify the assembly search path of your program by putting that location in with your program config file and/or manifest. You don't set an environment variable for the assembly search path.

Yes that is what i want to get
Could you show me how to generate config file and manifest file using command line
It would be useful also for other programs using dlls from nuget packages

Skydiver · May 6, 2024

I don't think there is a way to generate one using the command line (for .NET Framework). You will need to create the file using a text editor.

Or if you insist on command line, you can use sed (if you have access to it somehow). But to feed the text you would need to feed into sed or any other command line too, you still have to provide the contents anyway.

Reading text from html source

Marius M

New member

Skydiver

John J Doe

Member

Skydiver

John J Doe

Member

Skydiver

John J Doe

Member

John J Doe

Member

Skydiver

Skydiver

John J Doe

Member

Skydiver

John J Doe

Member

Skydiver

Similar threads

Share this page

Latest posts