C# does not encode Uri properly/Web client does not load page when uri has a unicode character

drone · Oct 29, 2022

Hi,

I were trying to get data from the url https://hanziyuan.net/#字 . This in percent encoding is

https://hanziyuan.net/#%E5%AD%97
.

No matter what I do, the data that loads is from the default page
https://hanziyuan.net/#%E8%BD%A6
https://hanziyuan.net/#车

The code I used is given below. It seems the encoded part is not getting passed on to the server
by the C# client.

C#:

// Online C# Editor for free
// Write, Edit and Run your C# code using C# Online Compiler

using System;

public class HelloWorld
{
public static void Main(string[] args)
{


         System.Net.WebClient wc = new System.Net.WebClient();
          
        
         byte[] raw = wc.DownloadData(new System.Uri("Chinese Etymology 字源"));
          
         string webData = System.Text.Encoding.UTF8.GetString(raw); 
          
     Console.WriteLine (webData);
 }
}

The data that loads is from the default page: Chinese Etymology 字源
https://hanziyuan.net/#车

While **the expected data on that code **is from:
https://hanziyuan.net/#字

Chinese Etymology 字源

Chinese Etymology 字源, Chinese character history and ancient Chinaese character (Orachle characters, Bronze characters, Seal characters, Shuowen Jiezi, Liushutong) analysis and research 汉字历史和古汉字(甲骨文, 金文, 篆字, 说文解字, 六书通)研究与分析. Search 100K+ ancient Chinese characters and etymology. 查询10万+古汉字和字源.

hanziyuan.net

I have tried with the string "https://hanziyuan.net/#字" as well. Nothing seems to work!

Skydiver · Oct 29, 2022

You might need to post a follow-up reply here or edit your post #1 above. I moved your code into CODE tags, and tried to protect some of the URLs in ICODE tags, but there are still some places that need help. For example, line 15 in the code should have some kind of URL but the forum editor tried to unfurl the URL because you didn't post in code tags.

The forum uses BBCODE, not MarkDown.

Skydiver · Oct 29, 2022

I get the same result as you do using the up-to-date HttpClient instead of the obsoleted WebClient.

C#:

using System;
using System.Net.Http;

var client = new HttpClient();
var expectedData = await client.GetStringAsync("https://hanziyuan.net/#%E5%AD%97");
var resultingData = await client.GetStringAsync("https://hanziyuan.net/#%E8%BD%A6");

if (expectedData == resultingData)
    Console.WriteLine("Same page");
else
    Console.WriteLine("Different pages.");

The code above ends up printing out "Same page".

This suggests to me that the content on various pages for that site is dynamically generated/downloaded within the browser. You'll need to let a web browser retrieve the data (or figure out how to execute the JavaScript on the downloaded page) to get the effect that you want.

This is not an issue with the URI encoding with regards to Unicode characters, or with the web client.

Skydiver · Oct 29, 2022

I didn't see it originally, the issue is actually due to your URL containing a URL fragment. Notice the #.

When an agent (such as a web browser) requests a web resource from a web server, the agent sends the URI to the server, but does not send the fragment. Instead, the agent waits for the server to send the resource, and then the agent processes the resource according to the document type and fragment value.

In an HTML web page, the agent will look for an anchor identified with an HTML tag that includes an id= or name= attribute equal to the fragment identifier.

from

URI fragment - Wikipedia

en.m.wikipedia.org

So effectively, both URLs are the same before the fragment, you get the same contents.

C# does not encode Uri properly/Web client does not load page when uri has a unicode character

drone

New member

Chinese Etymology 字源

Skydiver

Skydiver

Skydiver

URI fragment - Wikipedia

Share this page

Latest posts