regular expression to extract data from a html table with TD marked with id

indexonindia · Jul 15, 2019

A method returning a string value of html content when using webbrowser control in C# windows Application .I need to extract a specific data from the html table TD which has an id to specify . If some help me to extract the data easily with any methods . Thanks

<table class="userList w990 marginTop10">
<tbody>
<tr>
<th class="w195 whiteFont leftAlign">Status</th>
<td class="even width150" id="status">Active</td>
<th class="w195 whiteFont leftAlign">Name</th>
<td class="even" id="name"> NATESAN</td>
</tr>
</tbody>
</table>

NoUserHere · Jul 15, 2019

You don't need regex. You just need to get element by tag name. I'm assuming you want to inner text, but since you didn't clarify... Let me give you an example from getting the source from csharpforums.net and itterate to get the inner text. - While I don't think this is the best way to do it, it is what you asked for, just without regex which isn't necessary. Instead, you could also use WebClient.Downloadstring : WebClient.DownloadString Method (System.Net) and then cipher through the code. As an aside, you will need to check that your page has loaded fully before you begin to run the function. You can use the DocumentCompleted and Navigating events for this to set a bool. If the bool is false, it won't run the function until the completed event is executed. I've commented it for you, so I don't need to explain further.

C#:

using System;
using System.Collections.Generic;
using System.Windows.Forms;

namespace TestCSharpApp
{
    public partial class Form1 : Form
    {
        private bool CanExecute; //Checks for navigation later
        private List<string> ListOfElems = new List<string>();

        public Form1()
        {
            InitializeComponent();
            webBrowser1.Navigate("https://csharpforums.net/"); //Navigate first to the page with the values you want
        }

        private void WebBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e) //Has doc completed?
        {
            CanExecute = true;
        }
        private void WebBrowser1_Navigating(object sender, WebBrowserNavigatingEventArgs e) //Are we about to navigate?
        {
            CanExecute = false;
        }
        private string RunSearch(string HtmlClass)
        {
            var Element = webBrowser1.Document.GetElementsByTagName("a"); //a is the attribute of links
            foreach (HtmlElement Meta in Element) //Iterate and search
            {
                if (Meta.GetAttribute("className") == HtmlClass.PadRight(1)) //Check if the class name exists in the page you're searching
                //I used PadRight(1) to add spacing that is required for this example to retrieve HtmlClass passed into the function : "menu-linkRow u-indentDepth0 js-offCanvasCopy "
                //Do not change "className", instead pass in your html class element.
                {
                    var result = Meta.InnerText;
                    if (result.Contains("Current visitors")) //Check if it contains the value you are looking for
                    return result; //Or return an individual result
                }
            }
            return string.Empty; //No matching element to return
        }

        private void Button1_Click(object sender, EventArgs e)
        {
            if (CanExecute == true) //Check if page completed navigating first
            {
                var retValue = RunSearch("menu-linkRow u-indentDepth0 js-offCanvasCopy "); //Run the function

                if (retValue != string.Empty) //Check that the value was found
                {
                    Console.WriteLine(retValue); //Do as you please with your value
                }
            }
            else
                MessageBox.Show("Wait for the page to finish navigating first.");
        }
    }
}

The above is tested and working, and will return one result per each request you send it. Or you could change it around a little to get all the elements with the html tag you send to said function. Change the actual function to this ::

C#:

        private List<string> RunSearch(string HtmlClass)
        {
            var Element = webBrowser1.Document.GetElementsByTagName("a"); //a is the attribute of links
            foreach (HtmlElement Meta in Element) //Iterate and search
            {
                if (Meta.GetAttribute("className") == HtmlClass.PadRight(1)) //Check if the class name exists in the page you're searching
                //I used PadRight(1) to add spacing that is required for this example to retrieve HtmlClass passed into the function : "menu-linkRow u-indentDepth0 js-offCanvasCopy "
                //Do not change "className", instead pass in your html class element.
                {
                    var result = Meta.InnerText;
                    ListOfElems.Add(result); //Add them to a class for reuse later, I've just used a list above
                }
            }
            return ListOfElems; //No matching element to return
        }

And change the iniciating button to this, so that you can accept a different return type to the main snipped above. Since we will now be returning a list of the elements collected. Like this ::

C#:

        private void Button1_Click(object sender, EventArgs e)
        {
            if (CanExecute == true) //Check if page completed navigating first
            {
                 RunSearch("menu-linkRow u-indentDepth0 js-offCanvasCopy "); //Run the function
                    //Or values
                    ListOfElems.ForEach(delegate (string EachVal)
                    {
                        Console.WriteLine(EachVal);
                    });
            }
            else
                MessageBox.Show("Wait for the page to finish navigating first.");
        }

C#:

Current visitors
Current visitors
New posts
Search forums
New posts
New profile posts
Latest activity
Current visitors
New profile posts
Search profile posts
New posts
Search forums
New posts
New profile posts
Latest activity
Current visitors
New profile posts
Search profile posts
New posts
Search forums
New posts
New profile posts
Latest activity
Current visitors
New profile posts
Search profile posts

Html Source Code I Searched:

<a href="/search/?type=profile_post"
        class="menu-linkRow u-indentDepth0 js-offCanvasCopy "
       
       
        data-nav-id="searchProfilePosts">Search profile posts</a>

To add; your tag would be th and the value you would pass in is even.

Hope this helps.

NoUserHere · Jul 15, 2019

Also cross posted here regular expression to extract data from a html table with TD here Parsing HTML Table in C#

JohnH · Jul 15, 2019

indexonindia said:
I need to extract a specific data from the html table TD which has an id to specify

GetElementById will get you that:

C#:

var text = webBrowser1.Document.GetElementById("status").InnerText;

NoUserHere · Jul 15, 2019

That'll work in cases where an ID is actually present; just like our OP's html above displays an ID to get the inner text by unsheathing it using GetElementById.

Actually, the only reason i done it as I did above is because; in all the hundreds of times I've answered this type of question on other boards, the OP would normally want to retrieve more than one ID from a page, and it's easier to loop through the class tags for that in my opinion.

For future readers; if you don't have an ID, and happen to not be able to change the html source on the html page. The way I demonstrated above will also work without an ID being present in html. But to add - its advised to wrap any code where GetElementById is allowing you to cipher inner text from a html page, and put that section of code in a try catch block. If you don't, you will be faced with a null reference exception if the ID is not found. You should also note, - if you have more than one ID, you will need to be more explicit in your extraction of any such text.

Skydiver · Jul 15, 2019

If you have a WebBrowserControl, make full use of it since it's already done the parsing. Use the GetElementById() if you can.

If you don't have a WebBrowserControl handy, I highly recommend using the HTML Agility Pack. It's a a lighter weight (and more reliable HTML parser) than the web browser control.

Obligatory links:
Parsing Html The Cthulhu Way
which refers to: RegEx match open tags except XHTML self-contained tags:

... Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. ...

and
Regular Expressions: Now You Have Two Problems

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

JohnH · Jul 16, 2019

Sheepings said:
That'll work in cases where an ID is actually present

That was what OP specifically asked for.

Skydiver said:
If you don't have a WebBrowserControl handy, I highly recommend using the HTML Agility Pack.

I have used that a lot myself, but found a newer library AngleSharp better, especially since it has CSS query selectors (like jQuery).

NoUserHere · Jul 16, 2019

JohnH said:
That was what OP specifically asked for.

Indeed, and I wasn't disagreeing. Merrily just hilighting the point that an ID needs to be present. ?

regular expression to extract data from a html table with TD marked with id

indexonindia

New member

NoUserHere

Well-known member

NoUserHere

Well-known member

JohnH

C# Forum Moderator

NoUserHere

Well-known member

Skydiver

JohnH

C# Forum Moderator

NoUserHere

Well-known member

Similar threads

Share this page

Latest posts