Question Cannot Crawl Web Page Data Using ScrapySharp

gcfernando

New member
Joined
Sep 9, 2017
Messages
1
Programming Experience
10+
Hi all,

I am facing a technical issue. I browsed several articles to find the answer but I couldn’t get a proper answer from any web site.
I am using ScrapySharp for my project to crawl web page data. This issue came when I try to crawl data from the http://edition.cnn.com/POLITICS website.
Firstly, I loaded the page via IE, and I selected Developer tools to inspect the tags. After the I selected the tag what I need for my code “//div[@class='cd__content']”, Moreover when I load the above mentioned web page through ScrapySharp

C#:
ScrapingBrowser browser = new ScrapingBrowser();
WebPage rootPage = browser.NavigateToPageAsync(new Uri(url));
HtmlNodeCollection rootNodes = rootPage.Html.SelectNodes(“//div[@class='cd__content']”);

The result for rootNodes shows as null

When I investigate deep, What I saw is the above-mentioned cd__content is inside the “SECTION” tag when the page loads the “SECTION” tag is empty. But when I Inspect via IE or Chrome all tags are filled with information that’s why I could able to pick the element, but when I load the page programmatically it won’t.

My question is, how can I load the page with filling all information using ScrapySharp.

Experts, please help on this.
 
It is probably loaded dynamically with Javascript. I don't know ScrapySharp, but like with Html Agility Pack a common way to work with that is to use a WebBrowser and let it load and render the page then retrieve the source from there, for example from browser.Document.Body.InnerHtml
 
Like @JohnH I have never worked with ScrappySharp, actually never heard of it, but most decent frameworks have a way of waiting for the page load event to complete. You can also test out if this would work by temporarily putting a thread sleep in before getting your root.
 
Back
Top Bottom