Question Cannot Crawl Web Page Data Using ScrapySharp

gcfernando · Sep 9, 2017

Hi all,

I am facing a technical issue. I browsed several articles to find the answer but I couldn’t get a proper answer from any web site.
I am using ScrapySharp for my project to crawl web page data. This issue came when I try to crawl data from the http://edition.cnn.com/POLITICS website.
Firstly, I loaded the page via IE, and I selected Developer tools to inspect the tags. After the I selected the tag what I need for my code “//div[@class='cd__content']”, Moreover when I load the above mentioned web page through ScrapySharp

C#:

ScrapingBrowser browser = new ScrapingBrowser();
WebPage rootPage = browser.NavigateToPageAsync(new Uri(url));
HtmlNodeCollection rootNodes = rootPage.Html.SelectNodes(“//div[@class='cd__content']”);

The result for rootNodes shows as null

When I investigate deep, What I saw is the above-mentioned cd__content is inside the “SECTION” tag when the page loads the “SECTION” tag is empty. But when I Inspect via IE or Chrome all tags are filled with information that’s why I could able to pick the element, but when I load the page programmatically it won’t.

My question is, how can I load the page with filling all information using ScrapySharp.

Experts, please help on this.

JohnH · Sep 9, 2017

It is probably loaded dynamically with Javascript. I don't know ScrapySharp, but like with Html Agility Pack a common way to work with that is to use a WebBrowser and let it load and render the page then retrieve the source from there, for example from browser.Document.Body.InnerHtml

Tosa · Sep 11, 2017

Like @JohnH I have never worked with ScrappySharp, actually never heard of it, but most decent frameworks have a way of waiting for the page load event to complete. You can also test out if this would work by temporarily putting a thread sleep in before getting your root.

Question Cannot Crawl Web Page Data Using ScrapySharp

gcfernando

New member

JohnH

C# Forum Moderator

Tosa

Member

Similar threads

Share this page

Latest posts