scrape data from website with login

nkat

Member
Joined
Jun 29, 2022
Messages
18
Programming Experience
1-3
Hello!
My goal is to scrape some data from a webpage 192.168.1.21/app/admin/directories.asp?id=username
The website that controls it needs login before querying directories. This is the problem I’m trying to solve. For that, I’ve read this post, and it addresses a lot of my questions, but does not solve the problem completely.

Important note, the 192.168.1.21 provides a user with 3 ways to login:
  1. a button “recognize me automatically” that runs JavaScript function loginNTLM()
  2. two fields, “username” and “password” that prompt a user to use his Active Directory credentials
  3. a field “superuser password” that requires just a password to login. The password is known to me
Following guidelines from the post I installed Fiddler, captured the traffic while doing #3 and got the following string in the TextView tab of the POST request " login_maintenance=%24MAINTENANCE%5Croot&pwd_maintenance=BadDog22&url_redirect=%2Fwatchdoc%2Fadmin%2Fdefault.asp%3Fs%3DDEFAULT "
Then the post suggests a dev to replicate the request in his code. I took an example from the post and edited in the following way
login into a website:
using System.Net;

var cookieContainer = new CookieContainer();

HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create("http://192.168.1.21/app/admin");
request.CookieContainer = cookieContainer;
//set the user agent and accept header values, to simulate a real web browser
request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36";
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";


//SET AUTOMATIC DECOMPRESSION
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;

Console.WriteLine("FIRST RESPONSE");
Console.WriteLine();
using (WebResponse response = request.GetResponse())
{
    using (StreamReader sr = new StreamReader(response.GetResponseStream()))
    {
        Console.WriteLine(sr.ReadToEnd());
    }
}

request = (HttpWebRequest)HttpWebRequest.Create("http://192.168.1.21/app/admin/directories.asp?s=DEFAULT");
//set the cookie container object
request.CookieContainer = cookieContainer;
request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36";
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";

//set method POST and content type application/x-www-form-urlencoded
request.Method = "POST";
request.ContentType = "application/x-www-form-urlencoded";

//SET AUTOMATIC DECOMPRESSION
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;

//insert your username and password
string data = string.Format("username={0}&password={1}", "root", "BadDog22");
byte[] bytes = System.Text.Encoding.UTF8.GetBytes(data);

request.ContentLength = bytes.Length;

using (Stream dataStream = request.GetRequestStream())
{
    dataStream.Write(bytes, 0, bytes.Length);
    dataStream.Close();
}

Console.WriteLine("LOGIN RESPONSE");
Console.WriteLine();
using (WebResponse response = request.GetResponse())
{
    using (StreamReader sr = new StreamReader(response.GetResponseStream()))
    {
        Console.WriteLine(sr.ReadToEnd());
    }
}

but in both responses I got back the same login page. Just can not get through it.
Would someone please suggest a solution?
 
If you have admin access, why do you need to scrape the data. Just get access to the actual data being shown by that ASP page. Or are you not supposed to have access to this data, and you just happen to have the admin password?
 
If you have admin access, why do you need to scrape the data. Just get access to the actual data being shown by that ASP page. Or are you not supposed to have access to this data, and you just happen to have the admin password?
Excellent questions )
Could do that, if the data to scrape was not calculated on the fly, but stored somewhere.
As I need thousands of samples, automation is very welcome
 
Get access to those calculations. If you can't get access to them directly, ask the developers to expose an API. Page scraping is never a good idea.
 
Get access to those calculations. If you can't get access to them directly, ask the developers to expose an API. Page scraping is never a good idea.
API does not exist, unfortunately. And trying to get info from devs would easily take months.
While all you said is relevant and valid concerns, this does not bring me closer to solving the problem at hand.
Do you know how to write a code that can get access to the page?
 
If the authentication protocol is well documented, then this is doable. Did your Fiddler traces reveal that it was truly simply Forms Authentication?
 
If the authentication protocol is well documented, then this is doable. Did your Fiddler traces reveal that it was truly simply Forms Authentication?
from Fiddler Classic I got POST request containing
login_maintenance=%24MAINTENANCE%5Croot&pwd_maintenance=BadDog22&url_redirect=%2Fwatchdoc%2Fadmin%2Fdefault.asp%3Fs%3DDEFAULT
from that I derive that parameters for the form (if that is the form) are
login_maintenance:$MAINTENANCE\root
pwd_maintenance:BadDog22
But, now I don't really know what to do with this
 
That's still not enough information. How is the authentication token transmitted and retained? Is it sent back as a cookie? Is it a string that you need to put in your subsequent web requests?

If the devs can't provide you an API, they can at least provide you documentation of how authentication and authorization works for their site. While you are talking to them:
1) Put in the request for the API;
2) Ask for advice from them on how to best get the data that you require while waiting for the API.
 
That's still not enough information. How is the authentication token transmitted and retained? Is it sent back as a cookie? Is it a string that you need to put in your subsequent web requests?
Thank you for the reply!
If the answer to your 2nd and 3rd question is "Yes", what would be a code example to proceed?
 
No. That's not how this site works. You present your code and you tell us what problems you are running into. Then we try to guide you towards a solution.

Again, a key piece of information is documentation on how their particular authentication protocol works.
 
Fair enough, thank you!
Here you are. I've switched off the NTLM authentication in the app, now the login page just prompts for maintenance password and Fiddler tells me that it is a form based authentication

1658234087221.png


Now, googling for "HttpClient scrape data from website with login c#" I got this page test, C# - rextester
The code is below

login into a website:
//Title of this code
//Rextester.Program.Main is the entry point for your code. Don't change it.
//Compiler version 4.0.30319.17929 for Microsoft (R) .NET Framework 4.5

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
using System.Net;
using System.Text;
using System.IO;

namespace Rextester
{
    public class Program
    {
        public static void Main(string[] args)
        {
            var cookieContainer = new CookieContainer();

            HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create("http://localhost/app/admin/directories.asp?s=DEFAULT");
            request.CookieContainer = cookieContainer;

            //SET AUTOMATIC DECOMPRESSION
            request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;

            //set the user agent and accept header values, to simulate a real web browser
            request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36";
            request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
            Console.WriteLine("FIRST RESPONSE");
            Console.WriteLine();
            using (WebResponse response = request.GetResponse())
            {
                using (StreamReader sr = new StreamReader(response.GetResponseStream()))
                {
                    Console.WriteLine(sr.ReadToEnd());
                }
            }

            request = (HttpWebRequest)HttpWebRequest.Create("http://localhost/app/admin/login.asp");
            //set the cookie container object
            request.CookieContainer = cookieContainer;
            request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
            request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36";
            request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";

            //SET AUTOMATIC DECOMPRESSION
            request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;

            //set method POST and content type application/x-www-form-urlencoded
            request.Method = "POST";
            request.ContentType = "application/x-www-form-urlencoded";

            //insert your username and password
            string data = string.Format("userNameB={0}&userPassWordB={1}&targetPage=%0D%0Aindex%3Ffromlogin%3D1&goLogin=Einloggen", "navneetjoshi512@gmail.com", "12345678");
            byte[] bytes = System.Text.Encoding.UTF8.GetBytes(data);

            request.ContentLength = bytes.Length;

            using (Stream dataStream = request.GetRequestStream())
            {
                dataStream.Write(bytes, 0, bytes.Length);
                dataStream.Close();
            }

            Console.WriteLine("LOGIN RESPONSE");
            Console.WriteLine();
            using (WebResponse response = request.GetResponse())
            {
                using (StreamReader sr = new StreamReader(response.GetResponseStream()))
                {
                    Console.WriteLine(sr.ReadToEnd());
                }
            }
        }
    }
}

First problem with this is on line #55. How do I build the string?
 

Attachments

  • 1658234046757.png
    1658234046757.png
    95.4 KB · Views: 57
But that's not your code. That's the code from the link in your first post.

I recommend using the HttpClient version of that code rather than the WebRequest version. Normally it's easier to deal with the HttpClient because it's a higher level API than the WebRequest calls that it eventually uses under the covers.
 
Thank you for your suggestion to use httpclient!
I'm struggling with it, though. Here it is said, "If you have access to the website, connect to it using the right credentials and capture the traffic using Fiddler. Then, make sure WebClient sends out the right cookies, request headers, query strings, etc exactly same as the browser."
I got cookies from Fiddler and wrote this
cookie part:
using System.Text;
using System.Net.Http.Headers;
using System.Net;

var userName = "$MAINTENANCE\root";
var passwd = "changeme";

var baseAddress = new Uri("http://localhost");
var cookieContainer = new System.Net.CookieContainer();
using (var handler = new HttpClientHandler() { CookieContainer = cookieContainer })
using (var client = new HttpClient(handler) { BaseAddress = baseAddress })
{
    var content = new FormUrlEncodedContent(new[]
    {
        new KeyValuePair<string, string>("SERVER", "DEFAULT"),
        new KeyValuePair<string, string>("last%5Flang", "fr%2DFR"),
        new KeyValuePair<string, string>("lang%5F%24admin", "fr%2DFR"),
        new KeyValuePair<string, string>("SERVER%5FCLI", "DEFAULT"),
    });
    cookieContainer.Add(baseAddress, new Cookie("CookieName", "cookie_value"));
    var result = await client.PostAsync("/app/admin/login.asp?act=login&s=DEFAULT", content);
    Console.WriteLine(result);
    Console.ReadKey();
}

next, I should add request headers. And here where it all goes crazy ) I suppose those should be the ones the browser sends to the server, right? If so, Fiddler displays 6 categories of headers: Cache, Client, Entity, Miscellaneous, Security, Transport. How to make sure that c# code sends them exactly as the web browser? I'm googling in vain for an hour and only see how to add customer headers, or authorization headers, no complete example
 
Good progress!

I recommend trying without setting any additional headers first. Just change your lines 15-18 to send the actual content that you want to send (e.g. the keys and values from your post #11. See what result you get from line 22. If you back HTML that looks like the page were hoping to scrape, then your next step will be to get the HTML Agility Pack from Nuget and start parsing that HTML.
 
an important part of the picture is missing. Namely – authentication credentials
In the Fiddler I got this
1658260694975.png


looks different from all the headers mentioned above that are pairs name: value
How should I add that line into the POST request?
 
Back
Top Bottom