Resolved Crawler the price(why error always appear!)

Noob_The_ Jacky · Dec 21, 2020

method for crawler:

 public static void webgrab(string http) {
            WebClient httpins = new WebClient();

            httpins.Headers.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0");
            httpins.Headers.Add("Method", "GET");           
              Stream resp = httpins.OpenRead(http);
            StreamReader resstring = new StreamReader(resp);
            string s = resstring.ReadToEnd();
            HtmlDocument doc = new HtmlDocument();
            doc.Load(resp, Encoding.Default);
            HtmlNodeCollection pricenode = doc.DocumentNode.SelectNodes("/html/body/form/div[4]/div[1]/div[4]/div[5]/table/tbody/tr[1]/td[1]/div[4]");
            foreach (HtmlNode i in pricenode)
            {
                Console.WriteLine(i.InnerText.Trim());
            }
            resstring.Close();       
        }

OK! That's the code of mine. I have to say I don't know much, so i'm hoping to get help here.
I think it didn't work because I fail the convert the stream to htmldocument , or maybe the xpath can't find the right way in the htmlnode. But I really check so many times, it just the same as what I find and also what really working, except the link and the specific info..
What could go wrong?

Main:

 static void Main(string[] args)
        {
            webcliclassl.webgrab("http://www.aastocks.com/tc/stocks/market/bmpfutures.aspx?future=200300");
        }

Here is the Main. Should be OK here. The web I want to crawler is: http://www.aastocks.com/tc/stocks/market/bmpfutures.aspx?future=200300;
Attachment shows that little things i want!
Tell me what's go wrong. Please!

jmcilhinney · Dec 21, 2020

You need to debug the code. Set a breakpoint at the top of the code and then step through it line by line. Before each step, ask yourself what you expect to happen. After the step, check whether what you expected actually did happen. If it didn't then you have found an issue and you can investigate that specifically. If you can't work it out then at least you can provide us with all the relevant information.

Noob_The_ Jacky · Dec 21, 2020

Did you mean the try and catch box? I can't even get through till the end.
I stuck at the point of as attachment. It said"My pricenode is NULL".

jmcilhinney · Dec 21, 2020

If you would like us to help you with an issue and your code throws an exception and you know the error message and where it is thrown, that's the sort of information you should be providing at the outset.

If SelectNodes is returning null then that would suggest that your path doesn't match anything in the document, so you need to reexamine the document and reevaluate your path. Maybe you should garb the HTML from the page and then strip out everything that isn't relevant so that you can compare what's left to your specified path.

Noob_The_ Jacky · Dec 21, 2020

Main:

using System;
using HtmlAgilityPack;
namespace http
{
    class Program
    {
        static void Main(string[] args)
        {   
            webcliclassl.webgrab("http://www.aastocks.com/tc/stocks/market/bmpfutures.aspx?future=200300");
         }
    }

Crawler class and method:

using System;
using System.Collections.Generic;
using System.Text;
using System.Net;
using System.IO;
using HtmlAgilityPack;

namespace http
{
    class webcliclassl
    {
        public static void webgrab(string http) {

            try
            {
                WebClient httpins = new WebClient();

                httpins.Headers.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0");
                httpins.Headers.Add("Method", "GET");
                //a.Headers.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0");
                Stream resp = httpins.OpenRead(http);
                StreamReader resstring = new StreamReader(resp);
                string s = resstring.ReadToEnd();
                HtmlDocument doc = new HtmlDocument();
                doc.Load(resp, Encoding.Default);    // convert to doc
                //string xpath = "/html/body/form/div[4]/div[1]/div[4]/div[5]/table/tbody/tr[1]/td[1]/div[4]";///html/body/form/div[4]/div[1]/div[4]/div[5]/table/tbody/tr[1]/td[1]/div[4] 
                HtmlNodeCollection pricenode = doc.DocumentNode.SelectNodes("/html/body/form/div[4]/div[1]/div[4]/div[5]/table/tbody/tr[1]/td[1]/div[4]");
                foreach (HtmlNode i in pricenode)
                {
                    Console.WriteLine(i.InnerText.Trim());
                }
                resstring.Close();
            }
            catch (Exception e) {
                Console.WriteLine(e.Message);
            }
        }
    }
}

The Xpath is dicertly copy through the web inspector,should be right there(I believe).
I am good to get the stream. see attachment.
So I think that glitch must be on htmldocument, when I try to print htmldoc. It happen like attachment. I don't know is that bad? or maybe just C# doesn't support that kind of format printing on console.
Thur, the next step is to select by Xpath. see the result at attachment. I even try to remain the console by adding "console.readline()", but the console still collapse and error come out.

Skydiver · Dec 21, 2020

Use a simpler Xpath to grab a different piece of the page. If that also fails, then that may mean the Xpath generated by your tool does not match the Xpath expected by the HtmlDocument object.

Also have you considered that the page you me hitting is dynamically generated in the browser? So the entire DOM you are seeing in the browser is not the DOM initially sent in response to the first HTTP GET?

JohnH · Dec 21, 2020

There are two problems:

Your code tries to read response stream twice, it can only be read once. Debugging HtmlDocument shows it is empty.
'View source' (not inspector) in browser, you may see it contains some scripts and looks like only html for sitemap, the price value can not be found when searching the source, this means it was loaded dynamically by script. This also means you must use a browser component to render the source before extracting the value.

Skydiver · Dec 21, 2020

I'll channel my good twin @Sheepings during his absence and say: "Why are you page scraping? You should be using an API offered by the site. If the site does not offer an API even after you contact the site owners, that means that the site owner does not want you pulling their data against their terms of service."

Noob_The_ Jacky · Dec 21, 2020

To be clear of what you mean dynamically. Is that you mean the price would change without refreshing the page? No, I must press "F5" to refresh the price.
1. With using "webclient " get the link, I can't get through Xpath(seem to be I can't get anythings).
2. But when it come to using "HtmlWeb", it seem to go through all the process. Even though I still can't crawler the price , I able to reach some useless words(as attachment).
Here is the coding:

HtmlWeb:

using System;
using HtmlAgilityPack;
namespace http
{
    class Program
    {
        static void Main(string[] args)
        {
            HtmlWeb webClient = new HtmlWeb();
            HtmlDocument doc = webClient.Load("http://www.aastocks.com/tc/stocks/market/bmpfutures.aspx?future=200300");
            HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("/html/body/form/div[5]/div[1]/div[2]/div[2]");        
            foreach (HtmlNode node in nodes)
            {
                Console.WriteLine(node.InnerText.Trim());
            }
            doc = null;
            nodes = null;
            webClient = null;
            Console.WriteLine("Completed.");
            Console.ReadLine();
        }      
    }
}

WebClient:

using System;
using System.Collections.Generic;
using System.Text;
using System.Net;
using System.IO;
using HtmlAgilityPack;

namespace http
{
    class webcliclassl
    {
        public static void webgrab(string http) {

            try
            {
                WebClient httpins = new WebClient();

                httpins.Headers.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0");
                httpins.Headers.Add("Method", "GET");              
                Stream resp = httpins.OpenRead(http);
                StreamReader resstring = new StreamReader(resp);
                string s = resstring.ReadToEnd();
                HtmlDocument doc = new HtmlDocument();
                doc.LoadHtml(http);    // convert to doc
                [1]/div[4]/div[5]/table/tbody/tr[1]/td[1]/div[4]
                HtmlNodeCollection pricenode = doc.DocumentNode.SelectNodes("/html/body/form/div[4]/div[1]/div[4]/div[5]/table/tbody/tr[1]/td[1]/div[4]");
                foreach (HtmlNode i in pricenode)
                {
                    Console.WriteLine(i.InnerText.Trim());
                }
                resstring.Close();
            }
            catch (Exception e) {
                Console.WriteLine(e.Message);
            }
        }
    }
}

To be very honest, I am looking for someone can modify my coding since I have no idea how to deal with DOM.
Let me study your brilliant work and learn from your wisdom.

And a little remark:
It isn't that complex when I am using Python(I am not using xpath, I don't know how to name that method, see attachment).
Is that possible to using that method to local what I want in C#?

JohnH · Dec 21, 2020

By dynamically I mean it is scripts on the page that load data in background and only a browser can do that, the price you see is not in the source code of the page until the browser have loaded and executed the scripts.
You can use for example a Windows Forms application and the WebBrowser control for this.
Here I found an example that get the source from webbrowser: C# can I Scrape a webBrowser control for links?

Noob_The_ Jacky · Dec 26, 2020

You are too right. I finally solve this. I look at the source and rewrite the xpath line by line, instead of using copy.

Success!:

WebClient httpins = new WebClient();
                httpins.Headers.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0");
                httpins.Headers.Add("Method", "GET");        
                Stream resp = httpins.OpenRead(http);
                StreamReader resstring = new StreamReader(resp);
                string s = resstring.ReadToEnd();
                HtmlDocument doc = new HtmlDocument();
                doc.LoadHtml(s);
                HtmlNode pricenos = doc.DocumentNode.SelectSingleNode("/html/body/form/div[4]/div/div[4]/div[5]/table/tr/td/div[4]"); 
                Console.WriteLine(pricenos.InnerText);
                resstring.Close();

Thank you everyone for trying to help me !!!!!

Skydiver · Dec 26, 2020

Congratulations!

In the future, please post your code in code tags, not as a screenshot.

JohnH · Dec 26, 2020

Looks like I was wrong about the dynamic content, it is possible the price changed and I searched for old value. (the html source is quite large)
You can also use this simpler xpath:

C#:

var pricenos = doc.DocumentNode.SelectSingleNode("//div[@id='divContentContainer']//table//div[4]");

Resolved Crawler the price(why error always appear!)

Noob_The_ Jacky

Member

Attachments

jmcilhinney

C# Forum Moderator

Noob_The_ Jacky

Member

Attachments

jmcilhinney

C# Forum Moderator

Noob_The_ Jacky

Member

Attachments

Skydiver

JohnH

C# Forum Moderator

Skydiver

Noob_The_ Jacky

Member

Attachments

JohnH

C# Forum Moderator

Noob_The_ Jacky

Member

Attachments

Skydiver

JohnH

C# Forum Moderator

Similar threads

Share this page

Latest posts