problem with concatenating number with hebrew

orentu

New member
Joined
Nov 4, 2019
Messages
2
Programming Experience
5-10
hi
when i try to concatenating some str to one
the order is change
for example
str1 =number
str2 = hebrew
str3= number
the str3 concatenating to str1 and not to str2 i guess (if i concatenating with "," is should be ok but i
want fixed length
some explain:
i loop via list dictionary to get length and tag <> (from xml)
after i loop via xml (treat xml like txt)
and get the value between TAG compare list exist and txt/xml exist if match i
get the value between tag and concatenating till end of list dictionary
the problem:::::
the concatenating not by order because the hebrew is RTL i guess
any suggestions?
thanks
C#:
for (int index = 0; index < dict.Count; index++)
{
    var item = dict.ElementAt(index);
    var itemKey = item.Key;
    var itemValue = item.Value;
    // int x = Int32.Parse(itemKey);
    StringBuilder builder = new StringBuilder(itemValue);
    builder.Replace("<", "</");
    int lengthh = builder.Length;
    StringBuilder builderWO = new StringBuilder(itemValue);
    builderWO.Replace(">", "/>");
    foreach (string line in lines)
    {
        int theFirstLen = line.Trim().IndexOf(itemValue);
        int theLastLen = line.Trim().IndexOf(builder.ToString());
        int theLastLenWO_OPEN = line.Trim().IndexOf(builderWO.ToString());
        if (theFirstLen >= 0 && theLastLen > 0 || theLastLenWO_OPEN >= 0)
        {
            if (theLastLenWO_OPEN >= 0) //mean that we need to put spaces only
            {
                SR_LEFT = SR_LEFT + "{" + i + ",-" + itemKey.Substring(0, itemKey.IndexOf(".")) + "}";
                Console.WriteLine(itemKey.Length - itemKey.IndexOf("."));
                SR_RIGHT = SR_RIGHT + new string(' ', Int32.Parse(itemKey.Substring(0, itemKey.IndexOf("."))));
                i += 1;
                break;
            }
            else
            {
                // Console.WriteLine(line.Trim().Substring(theFirstLen + lengthh, theLastLen - theFirstLen - lengthh));
                SR_LEFT = SR_LEFT + "{" + i + ",-" + itemKey.Substring(0, itemKey.IndexOf(".")) + "}";
                SR_RIGHT = SR_RIGHT + line.Trim().Substring(theFirstLen + lengthh, theLastLen - theFirstLen - lengthh);
                i += 1;
                break;
            }
        }
    }
}
using (StreamWriter sw = new StreamWriter("C:\\TST.TEXT", false))
{
    sw.WriteLine(SR_LEFT, SR_RIGHT);
}
 
Last edited by a moderator:
Can you give us some sample values in your dictionary dict and lines list lines?
 
Anyway, the string is correct. I think that you are just being tricked by the RTL rendering being performed by RTL aware text editors and renderers.

Here is my test program:
Program.cs:
using System;
using System.Linq;
using System.Windows.Forms;

namespace SimpleCS
{
    class Program
    {
        static void DumpString(string s)
        {
            foreach (char ch in s)
                Console.Write("{0:x4} ", (int)ch);
            Console.WriteLine();
        }

        static void Main(string[] args)
        {
            var str1 = "123";
            var str2 = "שלום";
            var str3 = "456";

            DumpString(str1);
            DumpString(str2);
            DumpString(str3);

            var concat = str1 + str2 + str3;
            DumpString(concat);

            MessageBox.Show(concat);
        }
    }
}

Here's the output to console:
Output:
0031 0032 0033
05e9 05dc 05d5 05dd
0034 0035 0036
0031 0032 0033 05e9 05dc 05d5 05dd 0034 0035 0036

Witness that the bytes for each character is is the right order.

It's just the rendering that switches over to RTL:
Capture.PNG
 
I believe this might be what the OP was trying to do. The important part : "\u200E"

C#:
            var str1 = "123";
            var str2 = "כַּף סוֹפִית";
            var str3 = "456";
            var c = string.Concat(str1, "\u200E" + str2 + "\u200E" , str3);
            Console.WriteLine(c);
            MessageBox.Show(c);

See for a resolve or partial resolve to your problem. This actually is a rendering issue in windows and has never been addressed.

Screenshot_.jpg
 
Looks like not only Windows has a rendering problem. Chrome's rendering engine has a different issue:
HTML:
<html>
<body>
    <div>123</div>
    <div>שלום</div>
    <div>456</div>
    <div>123שלום456</div>
</body>
</html>

Notice above that the last div shows as 123456 followed by Hebrew text, but if you see the attached file, the data is stored as: 123 hebrew text 456. (Rename the .TXT to .HTML since this forum won't allow attaching .HTML files.)
 

Attachments

  • test.txt
    128 bytes · Views: 45
That rendering issue is because it doesn't have the formatting on both sides.
Run this :
C#:
            var str1 = "123";
            var str2 = "כַּף סוֹפִית";
            var str3 = "456";
            var c = string.Concat(str1, "\u200E" + str2 , str3);
            Console.WriteLine(c);
            MessageBox.Show(c);
 
Let me elaborate.

If you remove the code point from the right side of the Hebrew word, it will align all the numbers to the left leaving all the Hebrew on the right side of the string. If you switch the code point to be on the right side, it will apply the Hebrew wording to be on the left. Only when there is a code mark "\u200E" on both sides will the Hebrew be centered.

Screenshot_33.jpg


And

Screenshot_34.jpg


It is believed these code points were created to deal with this exact problem.
 
Sort off. I feel that the marks are being misused, but I'm still trying to grok the Unicode BiDi Algorithm.

\u200E is the LRM (Left-to-Right-Mark) code point. Why would you wrap the Hebrew text which is already RTL, with LRM code points? Wouldn't it make more sense to use the RLM (Right-to-Left Mark: \u200F) before the Hebrew text, and then put the LRM mark before the "456"? eg.: 123 \u200F hebrew text \u200E 456. Doing it this way doesn't work for Windows, though.

And even more interesting is why does the following work without requiring the LRM's?
C#:
            var str1 = "ABC";
            var str2 = "כַּף סוֹפִית";
            var str3 = "DEF";
            var c = string.Concat(str1, str2 , str3);
            Console.WriteLine(c);
            MessageBox.Show(c);
 
I feel that the marks are being misused
I can assure you, the LTR code mark is not being miss-used, especially since I have researched this for the last two days only because I knew only a little bit about it and the problems with text positioning especially with numbers just as it is above. I've been more intrigued since, and have read quite a bit on this subject. I am not claiming to be an expert on this. Lol! But I am also aware that Hebrew is read in reverse.

And because of the ordering of the Hebrew text, I believe, that any rendering algorithm looking for ANY Hebrew or Arabic text characters somehow binds or hard-codes the RTL code mark to it, forcing the ordering of the text to be as it is, and so I believe there is a probable bug and that bug -may- be when you try to assign ordinary text along side Hebrew or Arabic writings, that the only way to sort the ordering of the words is to then wrap the Hebrew text in LTR code marks to act as a separator from the Hebrew text being hard-coded to RTL code markers, and because of however the rendering is done (through however its -may- be hard-coded to be RTL), the Hebrew text is not phased by any LTR code marks which seems to give clout to my theory based on my observations. Maybe I am wrong?

Am I right? I don't really know. But from what I can see, It looks just as I've described it. I'm trying my best to explain that how I think it is working based on observation of the few things I've tried. As Hebrew/Arabic texts certainly seems to be unaffected by LTR code markers surrounding it. Does that make sense? lol
Why would you wrap the Hebrew text which is already RTL, with LRM code points
I partially answered this in the last paragraph, but it seems evident that is the only way to position non-Hebrew or non-Arabic text with Hebrew in the center. Regardless, all I know is that it works, and that's the main thing.
 
If you were to visually imagine how I see it working. This I think is a good example. If RTL chars are detected, an external function is called which encloses it in some kinda self-contained code mark which is not phased by whatever is on either side of it. (It's like using the !IMPORTANT css tag). So even if you added LTR code tags before and after the RETURN RTL-TEXT=Hebrew word. It won't change because it's already hard-coded and can't be edited. From tests I've done on Opera, Chrome, Mozilla, and C#, they all seem to have different ordering problems with Hebrew. So it must be something to do with the
Bidi algorithm.
CONCAT.png
 
Last edited:
I was just wondering if anyone would be willing to do some testing and research on this subject which has intrigued my curious side to try to understand the bidi-algorithm. Originally when I started out looking into this issue with the alignment of chars, I thought .Net would be likely responsible for how some values display.

But obviously you rule this out when you begin testing the same code above but applying it to html and the results are all different when you test out the same code above in C# (except apply it in html) and try it in different web browsers. Would this have anything to do with the render engine being used in each browser?

Each browser including browsers who use the same engines (see screenshots) but are on different versions also display different results to each other, and I am curious if anyone has a theory on why?

For the sake of learning something new, and unknown to myself, I've taken a shine to this issue while trying to understand how the algorithm works. And while the replies to the topic are only myself and Skydiver currently, I'm assuming you guys are just as in the dark as myself regarding how various web browser engines manage to each render different results; and hence you guys not replying?

I would have thought each browser would function the same way giving its using the same functionality, it would output the same results but it does not.

Screenshot_39.jpgScreenshot_40.jpgScreenshot_41.jpgScreenshot_42.jpg

And to test this, copy the text file into a html doc and save it. Ref for html - Char range for Hebrew Chars - Ref for LTR OR Ref RTL.
 

Attachments

  • test.txt
    1.2 KB · Views: 30
Handy article, and I am aware of the different markups for html, through to v5 etc but it doesn't explain source code usage in C#. Nor does it show how Hebrew is automatically aligned automatically, it also doesn't explain why in C# the Hebrew is unaffected (for right or wrong) when enclosed with LTR tags on both sides of it. I didn't know there was a Bidi Override in html, but I did learn that from your article. Useful for future use I guess. What I have learned is that WPF is a hell of a lot smoother in how BiDi is manipulated and controlled compared to WF. I also learned that each char is not checked individually, but is checked in blocks of words instead. This eventually made sense to me when I ran certain words through a chararray and checked for Hebrew chars :
C#:
        /// <summary>
        /// This function checks a string array for a range of chars as well as vowels which belong to the Hebrew or Arabic language.
        /// </summary>
        /// <param name="eChar">This property checks each char from the executing method iterating a chararray.</param>
        /// <param name="hasHebrewChar">This bool is set to true if a Hebrew or Arabic char is detected within the charRange.</param>
        /// <param name="hasOtherHebrewChar">This bool is set to true if a Hebrew or Arabic char is detected within the charRange using a different range limit.</param>
        /// <param name="charRange">This string array contains the different Hebrew/Arabic chrRanges in order to evaluate against for Hebrew/Arabic chars.</param>
        /// <returns></returns>
        private static bool HasHebrew(char eChar, bool hasHebrewChar, bool hasOtherHebrewChar, char[] charRange)
        {
            hasHebrewChar = eChar >= charRange[0] && eChar <= charRange[1];
            hasOtherHebrewChar = eChar >= charRange[2] && eChar <= charRange[3];
            if (hasHebrewChar) { return true; }
            else if (hasOtherHebrewChar) { return true; }
            else { return false; }
        }
I noticed that some chars slip through the net and pass for non-Hebrew/Arabic, but I am unsure why that happens. For example, taking str2 in the following snipped contains the word כַּף סוֹפִית, but the letter וֹ which is waw in Arabic, and vav in Hebrew seems to slip through the filter. Maybe I am missing additional char ranges for Hebrew Vowels ? :
C#:
        /// <summary>
        /// This Button1_Click method is designed to create one tuple of three strings at a time.
        /// </summary>
        /// <param name="sender">Base object</param>
        /// <param name="e">Event args for the button</param>
        private void Button1_Click(object sender, EventArgs e)
        {
            var str1 = "123";
            var str2 = "כַּף סוֹפִית";
            var str3 = "456";


            Tuple<string, string, string> strTuple = Tuple.Create(str1, str2, str3);
            string result = BidiHelper.GetHebrewConcat(BidiHelper.CharRange, BidiHelper.CodeMark, strTuple, string.Empty);
            if (!string.IsNullOrEmpty(result))
                MessageBox.Show(result);
        }
The method I used to execute the iteration over the char array is as follows. Note I've commented inline in the code as-to save writing another exasperated lengthy post as I sometimes can do ? and included the class as a whole below the snippet :
C#:
        public static string GetHebrewConcat(char[] charRange, string codeToPoint, Tuple<string, string, string> tupleOfStrings, string separator)
        {
            var charArr = tupleOfStrings.Item2.ToCharArray();
            int spins = 0;
            foreach (char eChar in charArr)
            {
                switch (HasHebrew(eChar, false, false, charRange))
                {
                    case true:
                        spins++;
                        if (tupleOfStrings.Item2.Length.Equals(spins))
                        { return string.Join(separator, tupleOfStrings.Item1, string.Concat(codeToPoint, tupleOfStrings.Item2, codeToPoint), tupleOfStrings.Item3); }
                        break;
                    case false:
                        spins++;
                        if (tupleOfStrings.Item2.Length.Equals(spins))
                        { return string.Join(separator, tupleOfStrings.Item1, tupleOfStrings.Item2, tupleOfStrings.Item3); }
                        break;
                }
            }
            return string.Empty;
        }
C#:
    /// <summary>
    /// The BidiHelper class is designed to parse strings and detect Hebrew and Arabic
    /// languages and concatenate strings with Hebrew or Arabic characters but keeping
    /// the second item of the tuple positioned between item1 and item3
    /// </summary>
    public static class BidiHelper
    {
        /// <summary>
        /// The character range for (CharRange) consists of the Hebrew Block ranges for Hebrew and Arabic letters. See for more info : https://en.m.wikipedia.org/wiki/Hebrew_(Unicode_block)
        /// </summary>
        public static readonly char[] CharRange = { (char)0x0580, (char)0x05ff, (char)0xfb1d, (char)0xfb4f };
        public static readonly string CodeMark = "\u200E";
        /// <summary>
        /// The code mark (CodeMark) is set to use LTR directional order. See for more info : http://unicode.org/reports/tr9/#Directional_Formatting_Codes
        /// </summary>
        /// <param name="charRange">This parameter takes the string values from the CharRange string array.</param>
        /// <param name="codeToPoint">This parameter is responsible for the directional order of the text and takes its value from the CodeMark string
        /// See summery for CodeMark for additional info.</param>
        /// <param name="tupleOfStrings">The tuple takes three parameters and holds the three values we want to concatenate together. These are the
        /// three parameters we used to create the Tuple with above.</param>
        /// <param name="separator">The separator is used to add an optional symbol for separation. To use none, use string.Empty</param>
        /// <returns>The returned value returns a concatenated string of the three values.</returns>
        public static string GetHebrewConcat(char[] charRange, string codeToPoint, Tuple<string, string, string> tupleOfStrings, string separator)
        {
            var charArr = tupleOfStrings.Item2.ToCharArray();
            int spins = 0;
            foreach (char eChar in charArr)
            {
                switch (HasHebrew(eChar, false, false, charRange))
                {
                    case true:
                        spins++;
                        if (tupleOfStrings.Item2.Length.Equals(spins))
                        { return string.Join(separator, tupleOfStrings.Item1, string.Concat(codeToPoint, tupleOfStrings.Item2, codeToPoint), tupleOfStrings.Item3); }
                        break;
                    case false:
                        spins++;
                        if (tupleOfStrings.Item2.Length.Equals(spins))
                        { return string.Join(separator, tupleOfStrings.Item1, tupleOfStrings.Item2, tupleOfStrings.Item3); }
                        break;
                }
            }
            return string.Empty;
        }
        /// <summary>
        /// This function checks a string array for a range of chars as well as vowels which belong to the Hebrew or Arabic language.
        /// </summary>
        /// <param name="eChar">This property checks each char from the executing method iterating a chararray.</param>
        /// <param name="hasHebrewChar">This bool is set to true if a Hebrew or Arabic char is detected within the charRange.</param>
        /// <param name="hasOtherHebrewChar">This bool is set to true if a Hebrew or Arabic char is detected within the charRange using a different range limit.</param>
        /// <param name="charRange">This string array contains the different Hebrew/Arabic chrRanges in order to evaluate against for Hebrew/Arabic chars.</param>
        /// <returns></returns>
        private static bool HasHebrew(char eChar, bool hasHebrewChar, bool hasOtherHebrewChar, char[] charRange)
        {
            hasHebrewChar = eChar >= charRange[0] && eChar <= charRange[1];
            hasOtherHebrewChar = eChar >= charRange[2] && eChar <= charRange[3];
            if (hasHebrewChar) { return true; }
            else if (hasOtherHebrewChar) { return true; }
            else { return false; }
        }
    }
I've spent a lot of time on this in the last few days, mostly researching C++ repos and scouring over other Git repositories for example functionality to get an idea of how each word is grouped before being ordered by the algorithm. It is a very deep but interesting algorithm to study with the main trouble being that there is not enough documented source code for C# on this subject, nor am I much further in understanding how the Hebrew and Arabic language is automatically grouped together or how one would go about overriding the default pattern for alignment for Hebrew/Arabic only words, just like you can in html. My most recent find is Reference Source and I am currently only reading up on it. :)

Anyways, maybe this code will help someone who is looking to concatenate strings with numbers for C#

Edit, fixed inline comment
 
Last edited:
Back
Top Bottom