Resolved Crawler the price(why error always appear!)

Joined
Dec 21, 2020
Messages
14
Programming Experience
Beginner
method for crawler:
 public static void webgrab(string http) {
            WebClient httpins = new WebClient();

            httpins.Headers.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0");
            httpins.Headers.Add("Method", "GET");           
              Stream resp = httpins.OpenRead(http);
            StreamReader resstring = new StreamReader(resp);
            string s = resstring.ReadToEnd();
            HtmlDocument doc = new HtmlDocument();
            doc.Load(resp, Encoding.Default);
            HtmlNodeCollection pricenode = doc.DocumentNode.SelectNodes("/html/body/form/div[4]/div[1]/div[4]/div[5]/table/tbody/tr[1]/td[1]/div[4]");
            foreach (HtmlNode i in pricenode)
            {
                Console.WriteLine(i.InnerText.Trim());
            }
            resstring.Close();       
        }
OK! That's the code of mine. I have to say I don't know much, so i'm hoping to get help here.
I think it didn't work because I fail the convert the stream to htmldocument , or maybe the xpath can't find the right way in the htmlnode. But I really check so many times, it just the same as what I find and also what really working, except the link and the specific info..
What could go wrong?
Main:
 static void Main(string[] args)
        {
            webcliclassl.webgrab("http://www.aastocks.com/tc/stocks/market/bmpfutures.aspx?future=200300");
        }
Here is the Main. Should be OK here. The web I want to crawler is: http://www.aastocks.com/tc/stocks/market/bmpfutures.aspx?future=200300;
Attachment shows that little things i want!
Tell me what's go wrong. Please!
 

Attachments

  • iurgeit.png
    iurgeit.png
    32.1 KB · Views: 28
You need to debug the code. Set a breakpoint at the top of the code and then step through it line by line. Before each step, ask yourself what you expect to happen. After the step, check whether what you expected actually did happen. If it didn't then you have found an issue and you can investigate that specifically. If you can't work it out then at least you can provide us with all the relevant information.
 
Did you mean the try and catch box? I can't even get through till the end.
I stuck at the point of as attachment. It said"My pricenode is NULL".
 

Attachments

  • umeanthisone.png
    umeanthisone.png
    51.8 KB · Views: 29
If you would like us to help you with an issue and your code throws an exception and you know the error message and where it is thrown, that's the sort of information you should be providing at the outset.

If SelectNodes is returning null then that would suggest that your path doesn't match anything in the document, so you need to reexamine the document and reevaluate your path. Maybe you should garb the HTML from the page and then strip out everything that isn't relevant so that you can compare what's left to your specified path.
 
Main:
using System;
using HtmlAgilityPack;
namespace http
{
    class Program
    {
        static void Main(string[] args)
        {   
            webcliclassl.webgrab("http://www.aastocks.com/tc/stocks/market/bmpfutures.aspx?future=200300");
         }
    }
Crawler class and method:
using System;
using System.Collections.Generic;
using System.Text;
using System.Net;
using System.IO;
using HtmlAgilityPack;

namespace http
{
    class webcliclassl
    {
        public static void webgrab(string http) {

            try
            {
                WebClient httpins = new WebClient();

                httpins.Headers.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0");
                httpins.Headers.Add("Method", "GET");
                //a.Headers.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0");
                Stream resp = httpins.OpenRead(http);
                StreamReader resstring = new StreamReader(resp);
                string s = resstring.ReadToEnd();
                HtmlDocument doc = new HtmlDocument();
                doc.Load(resp, Encoding.Default);    // convert to doc
                //string xpath = "/html/body/form/div[4]/div[1]/div[4]/div[5]/table/tbody/tr[1]/td[1]/div[4]";///html/body/form/div[4]/div[1]/div[4]/div[5]/table/tbody/tr[1]/td[1]/div[4] 
                HtmlNodeCollection pricenode = doc.DocumentNode.SelectNodes("/html/body/form/div[4]/div[1]/div[4]/div[5]/table/tbody/tr[1]/td[1]/div[4]");
                foreach (HtmlNode i in pricenode)
                {
                    Console.WriteLine(i.InnerText.Trim());
                }
                resstring.Close();
            }
            catch (Exception e) {
                Console.WriteLine(e.Message);
            }
        }
    }
}

The Xpath is dicertly copy through the web inspector,should be right there(I believe).
I am good to get the stream. see attachment.
So I think that glitch must be on htmldocument, when I try to print htmldoc. It happen like attachment. I don't know is that bad? or maybe just C# doesn't support that kind of format printing on console.
Thur, the next step is to select by Xpath. see the result at attachment. I even try to remain the console by adding "console.readline()", but the console still collapse and error come out.
 

Attachments

  • howigetxpath.png
    howigetxpath.png
    96 KB · Views: 28
  • goodwithstream.png
    goodwithstream.png
    358.3 KB · Views: 30
  • printhtmldoc.png
    printhtmldoc.png
    146.7 KB · Views: 31
  • thatwhathappentome.png
    thatwhathappentome.png
    215.6 KB · Views: 28
Use a simpler Xpath to grab a different piece of the page. If that also fails, then that may mean the Xpath generated by your tool does not match the Xpath expected by the HtmlDocument object.

Also have you considered that the page you me hitting is dynamically generated in the browser? So the entire DOM you are seeing in the browser is not the DOM initially sent in response to the first HTTP GET?
 
There are two problems:
  • Your code tries to read response stream twice, it can only be read once. Debugging HtmlDocument shows it is empty.
  • 'View source' (not inspector) in browser, you may see it contains some scripts and looks like only html for sitemap, the price value can not be found when searching the source, this means it was loaded dynamically by script. This also means you must use a browser component to render the source before extracting the value.
 
I'll channel my good twin @Sheepings during his absence and say: "Why are you page scraping? You should be using an API offered by the site. If the site does not offer an API even after you contact the site owners, that means that the site owner does not want you pulling their data against their terms of service."
 
To be clear of what you mean dynamically. Is that you mean the price would change without refreshing the page? No, I must press "F5" to refresh the price.
1. With using "webclient " get the link, I can't get through Xpath(seem to be I can't get anythings).
2. But when it come to using "HtmlWeb", it seem to go through all the process. Even though I still can't crawler the price , I able to reach some useless words(as attachment).
Here is the coding:
HtmlWeb:
using System;
using HtmlAgilityPack;
namespace http
{
    class Program
    {
        static void Main(string[] args)
        {
            HtmlWeb webClient = new HtmlWeb();
            HtmlDocument doc = webClient.Load("http://www.aastocks.com/tc/stocks/market/bmpfutures.aspx?future=200300");
            HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("/html/body/form/div[5]/div[1]/div[2]/div[2]");        
            foreach (HtmlNode node in nodes)
            {
                Console.WriteLine(node.InnerText.Trim());
            }
            doc = null;
            nodes = null;
            webClient = null;
            Console.WriteLine("Completed.");
            Console.ReadLine();
        }      
    }
}

WebClient:
using System;
using System.Collections.Generic;
using System.Text;
using System.Net;
using System.IO;
using HtmlAgilityPack;

namespace http
{
    class webcliclassl
    {
        public static void webgrab(string http) {

            try
            {
                WebClient httpins = new WebClient();

                httpins.Headers.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0");
                httpins.Headers.Add("Method", "GET");              
                Stream resp = httpins.OpenRead(http);
                StreamReader resstring = new StreamReader(resp);
                string s = resstring.ReadToEnd();
                HtmlDocument doc = new HtmlDocument();
                doc.LoadHtml(http);    // convert to doc
                [1]/div[4]/div[5]/table/tbody/tr[1]/td[1]/div[4]
                HtmlNodeCollection pricenode = doc.DocumentNode.SelectNodes("/html/body/form/div[4]/div[1]/div[4]/div[5]/table/tbody/tr[1]/td[1]/div[4]");
                foreach (HtmlNode i in pricenode)
                {
                    Console.WriteLine(i.InnerText.Trim());
                }
                resstring.Close();
            }
            catch (Exception e) {
                Console.WriteLine(e.Message);
            }
        }
    }
}

To be very honest, I am looking for someone can modify my coding since I have no idea how to deal with DOM.
Let me study your brilliant work and learn from your wisdom.


And a little remark:
It isn't that complex when I am using Python(I am not using xpath, I don't know how to name that method, see attachment).
Is that possible to using that method to local what I want in C#?
 

Attachments

  • uselesswords.png
    uselesswords.png
    144.6 KB · Views: 30
  • applypricexpath.png
    applypricexpath.png
    231.1 KB · Views: 29
  • pythonscript.png
    pythonscript.png
    64.4 KB · Views: 30
Last edited:
By dynamically I mean it is scripts on the page that load data in background and only a browser can do that, the price you see is not in the source code of the page until the browser have loaded and executed the scripts.
You can use for example a Windows Forms application and the WebBrowser control for this.
Here I found an example that get the source from webbrowser: C# can I Scrape a webBrowser control for links?
 
You are too right. I finally solve this. I look at the source and rewrite the xpath line by line, instead of using copy.
Success!:
WebClient httpins = new WebClient();
                httpins.Headers.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0");
                httpins.Headers.Add("Method", "GET");        
                Stream resp = httpins.OpenRead(http);
                StreamReader resstring = new StreamReader(resp);
                string s = resstring.ReadToEnd();
                HtmlDocument doc = new HtmlDocument();
                doc.LoadHtml(s);
                HtmlNode pricenos = doc.DocumentNode.SelectSingleNode("/html/body/form/div[4]/div/div[4]/div[5]/table/tr/td/div[4]"); 
                Console.WriteLine(pricenos.InnerText);
                resstring.Close();

Thank you everyone for trying to help me !!!!!
 

Attachments

  • success.png
    success.png
    291.3 KB · Views: 30
Last edited:
Congratulations!

In the future, please post your code in code tags, not as a screenshot.
 
Looks like I was wrong about the dynamic content, it is possible the price changed and I searched for old value. (the html source is quite large)
You can also use this simpler xpath:
C#:
var pricenos = doc.DocumentNode.SelectSingleNode("//div[@id='divContentContainer']//table//div[4]");
 

Latest posts

Back
Top Bottom