C# does not encode Uri properly/Web client does not load page when uri has a unicode character

drone

New member
Joined
Oct 29, 2022
Messages
1
Programming Experience
Beginner
Hi,

I were trying to get data from the url https://hanziyuan.net/#字 . This in percent encoding is

https://hanziyuan.net/#%E5%AD%97
.

No matter what I do, the data that loads is from the default page
https://hanziyuan.net/#%E8%BD%A6
https://hanziyuan.net/#车

The code I used is given below. It seems the encoded part is not getting passed on to the server
by the C# client.

C#:
// Online C# Editor for free
// Write, Edit and Run your C# code using C# Online Compiler

using System;

public class HelloWorld
{
public static void Main(string[] args)
{


         System.Net.WebClient wc = new System.Net.WebClient();
          
        
         byte[] raw = wc.DownloadData(new System.Uri("Chinese Etymology 字源"));
          
         string webData = System.Text.Encoding.UTF8.GetString(raw); 
          
     Console.WriteLine (webData);
 }
}

The data that loads is from the default page: Chinese Etymology 字源
https://hanziyuan.net/#车

While **the expected data on that code **is from:
https://hanziyuan.net/#字


I have tried with the string "https://hanziyuan.net/#字" as well. Nothing seems to work!
 
Last edited by a moderator:
You might need to post a follow-up reply here or edit your post #1 above. I moved your code into CODE tags, and tried to protect some of the URLs in ICODE tags, but there are still some places that need help. For example, line 15 in the code should have some kind of URL but the forum editor tried to unfurl the URL because you didn't post in code tags.

The forum uses BBCODE, not MarkDown.
 
I get the same result as you do using the up-to-date HttpClient instead of the obsoleted WebClient.

C#:
using System;
using System.Net.Http;

var client = new HttpClient();
var expectedData = await client.GetStringAsync("https://hanziyuan.net/#%E5%AD%97");
var resultingData = await client.GetStringAsync("https://hanziyuan.net/#%E8%BD%A6");

if (expectedData == resultingData)
    Console.WriteLine("Same page");
else
    Console.WriteLine("Different pages.");

The code above ends up printing out "Same page".

This suggests to me that the content on various pages for that site is dynamically generated/downloaded within the browser. You'll need to let a web browser retrieve the data (or figure out how to execute the JavaScript on the downloaded page) to get the effect that you want.

This is not an issue with the URI encoding with regards to Unicode characters, or with the web client.
 
I didn't see it originally, the issue is actually due to your URL containing a URL fragment. Notice the #.

When an agent (such as a web browser) requests a web resource from a web server, the agent sends the URI to the server, but does not send the fragment. Instead, the agent waits for the server to send the resource, and then the agent processes the resource according to the document type and fragment value.

In an HTML web page, the agent will look for an anchor identified with an HTML tag that includes an id= or name= attribute equal to the fragment identifier.

from

So effectively, both URLs are the same before the fragment, you get the same contents.
 
Back
Top Bottom