WebRequest and WebResponse has issues

complete

Active member
Joined
Oct 24, 2012
Messages
25
Programming Experience
3-5
WebRequest and WebResponse has issues

I wrote a C# program that uses WebRequest and WebResponse to perform a simple web crawler. I discovered something about web sites. Web browsers such as IE and FireFox offer the capacity to view the HTML source code. But it seems that html code that is sent to the browser is one thing and what the browser interprets and displays is something else. For example, if you run a google search in IE and run the same google search in FireFox, the content that you can see when you view the source in IE will NOT have the hyperlinks and content from the search results, but you can see the html hyperlinks and content from the search results when you view the source in FireFox. So my question is this. How do you specialise the WebRequest and WebResponse to show the content after it is processed by the browser instead of before?

One possible solution might be to use HttpWebRequest instead of WebRequest and use the UserAgent property to somehow trick C# into thinking I am using the Firefox browser. But this does not seem to me to plausable.
 
First up, if you making requests using HTTP then you're already using HttpWebRequest. The WebRequest class is just a base for other more specific types. The WebRequest.Create method will actually create the appropriate type based on the protocol in the URL you provide. Requests to web servers will produce an HttpWebRequest, requests to FTP servers will produce an FtpWebRequest and so on.

As for mimicking a particular browser, it will make no difference. The HTML that gets sent to each browser is generally pretty much the same and often exactly the same. What you see when you view source is simply what the authors of the browser have chosen to show you. In some cases you'll see the HTML of the original page that was loaded and in some cases you'll see the HTML for the page as it's currently displayed. As far as viewing source in a browser, the second option is probably the better of the two but that doesn't mean that that HTML was actually received from a web server. It's quite possible that the HTML code that produced the page as you're currently viewing it is the result of the original page that was loaded and the execution of some scripts. Those scripts might include some on document load, handling of events raised by controls, integration of JSON data received from AJAX calls, etc. The page you're currently viewing is constructed by the browser from all those inputs.

Pretending your request came from a particular browser won't magically create that output because it's the browser, not the server, that produced the output. Without the browser to do it, you'd have to do it yourself. You would have to build an engine that will parse the page and load linked script files, parse script and execute it and make the additional AJAX calls. Basically, you'll be building a browser engine.
 
Just to be clear, my goal is to programatically get search results from a google search.

After looking closely at the different html source from IE and firefox and seeing the point at which they start to differ, I can safely conclude that what is happening is that firefox is showing the html prior to the browser running the javascript and IE is showing the html that results after javascript has been processed by the browser.

I think if I take your advice and use the WebRequest.Create method, but the question I know have is this. How do I send to this method the desire that I want to create the WebRequest such that I set the visibility to hidden. I believe that therein might lie the key to getting this done. It seems that the javascript processes the html in IE because the visibility is not hidden.
 
Back
Top Bottom