Resolved Regex Failures beyond my understanding

ConsKa

Well-known member
Joined
Dec 11, 2020
Messages
140
Programming Experience
Beginner
I am trying to match a word and replace with a capitalised version, and I don't understand why the word is not being seen?

I have used:

C#:
@"\bor\b", "OR"

Which I add to a dictionary as key and value, with the regex being the key.

My understanding, and all my reading seems to suggest, that this will search through a string looking for a non-word character, then an 'o' and then an 'r' and if it then finds a non-word character after it matches those two, it will consider that a full match.

I do a check to see if the key exists in the string so as to avoid an exception error:

C#:
foreach (KeyValuePair<string, string> entry in dict)
                    {
                        if (item.Contains(entry.Key))
                        {
                            var outPut2 = Regex.Replace(item, string.Join("|", dict.Keys.Select(k => k.ToString()).ToArray()), m => dict[m.Value]);
                            strOut.Text += outPut2 + Environment.NewLine;
                        }
                    }

Item is the phrase: "ConsKa or Conske"

The KeyValuePair does not match the @"\bor\b" just skips right over it.

I don't really understand why it isn't matching?

There is no issue with the dictionary, nor the Linq, as those terms where I have not had to use Regex (characters that never appear in the middle of a word) are found and replaced with no issue. It is when I come to deal with characters that could appear in the middle of a word where I want to rely on the regex that this issue arises.
 
Solution
If the regex expressions doesn't depend on line boundaries you can do this to replace all matching expressions in text:
C#:
var input = InputTextbox.Text;
foreach (var entry in dict)
    input = Regex.Replace(input, entry.Key, entry.Value);
OutputTextbox.Text = input;

ConsKa

Well-known member
Joined
Dec 11, 2020
Messages
140
Programming Experience
Beginner
So as I do, I worked on this some more.

I thought maybe it was this failure:

C#:
if (item.Contains(entry.Key))

Changed the if statement by doing the following:

C#:
Regex foundWords = new Regex(@"\bor\b");

Match ted = foundWords.Match(item);
if (ted.Success)

This matches no problem, and enters into the loop.

However, the dictionary still says Key not found. Despite my entering the key exactly as I entered above for foundWords.
 

ConsKa

Well-known member
Joined
Dec 11, 2020
Messages
140
Programming Experience
Beginner
This:

C#:
Regex foundWords = new Regex(@"\bor\b");

Comes up colour coded, the \b is in bright pink, to indicate that it is a Regex I suspect, a bit of IntelliSense at work and it works.

But this:

C#:
dict.Add("\bor\b", "OR");

Is what is required for for the dictionary key to recognise the regex.....why no @ sign? Why an @ sign sometimes, but not other times?

"string pattern = @"\w+ # Matches all the characters in a word.";"

From here:

 

jmcilhinney

C# Forum Moderator
Staff member
Joined
Apr 23, 2011
Messages
3,921
Location
Sydney, Australia
Programming Experience
10+
The @ preceding a string literal indicates that it is a verbatim string literal. In a verbatim string literal, the backslash (\) character is treated as a literal character, rather than as an escape character. How you want backslashes treated is what decides whether you use the @ symbol or not. In the first code snippet in post #3, you want the backslash characters treated literally because it is the Regex itself that will turn them into escape characters when it parses the pattern provided. In the second code snippet, you want the string to contain '\b' characters so the backslashes need to be treated as escape characters. @"\bor\b" is equivalent to "\\bor\\b" and that's what we had to write before verbatim string literals were a thing. It makes code harder to read, which is why the new feature was introduced.
 

ConsKa

Well-known member
Joined
Dec 11, 2020
Messages
140
Programming Experience
Beginner
Thanks JMC, but I was aware of what the @ and \\ does for string literals.

My issue is that when you read on this stuff, the @ is usually included in the regex.

I have done a little testing and other regex appear to show an unrecognised character - if you do not include the @ or \\ before a \ for example:

"\-|\,"

Will throw an IntelliSence problem, and you need to @ this regex.

I am wondering whether \b is recognised by C# - whereas other regex are not. Now, the articles I am finding of people using this are like 9 years old, and they are saying No, it is not and you need to @ or \\ the regex \b - but things change.

I am going to put my code below, because I do not understand why it simply isn't working.

There are two problems:

1. It does not recognise "dave or dave" as containing a Key.

2. On the odd occasion when I get it to recognise that "dave or dave" has a key - by using different code, it doesn't change to the value and the output is "dave or dave"

C#:
Dictionary<string, string> dict = new Dictionary<string, string>();
dict.Add("\bor\b", "OR"); // I have tried @"\bor\b" and @"\\bor\\b" none of them work

string[] test = strInput.Text.Split('\r', '\n');

// the previous line creates entries of "" this removes them
test = test.Where(x => !string.IsNullOrEmpty(x.Trim())).ToArray();

foreach (var result in test) // result is dave or dave
{
    foreach (var entry in dict) // the entry appears as {[or, OR]}
    {
        if (result.Contains(entry.Key)) // entry.Key view = \bor\b in text visualiser or in html visualiser or
        {
            var outPut = Regex.Replace(result, entry.Key, entry.Value);
            strOut.Text += outPut + Environment.NewLine;
        }
    }
}

I can't even get this to enter the loop anymore. I do not know why, I am assuming that this: or is simply the way that VS visualises the test whitespace or whitespace - which is the regex expression.

Any help here? As I thought my understanding of the regex and of the test above was correct, and break pointing through it appears to show me exactly what I would expect to see. There is no error, it simply skips it as not containing the Key.

I have noticed that the code on the page has changed what I typed slightly.

"in text visualiser or"

The or here has a box on either side that I do not appear to be able to replicate.
 

JohnH

C# Forum Moderator
Staff member
Joined
Apr 23, 2011
Messages
1,197
Location
Norway
Programming Experience
10+
Keep it simple when trying to figure things out:
C#:
var input = "dave or dave";
var pattern = @"\bor\b";
var replacement = "OR";
var output = Regex.Replace(input, pattern, replacement);
 

ConsKa

Well-known member
Joined
Dec 11, 2020
Messages
140
Programming Experience
Beginner
Yep that works, as I kind of expected it to.

What I don't understand is why putting that into a dictionary, so the pattern is the key, and the replacement is the value doesn't work.

Even simplifying it right down:

C#:
Dictionary<string, string> dict = new Dictionary<string, string>();
dict.Add(@"\bor\b", "OR"); // doing ("\\bor\\b, "OR") or ("\bor\b", "OR") makes no difference here

var output = Regex.Replace(input, dict.Keys.ToString(), dict.Values.ToString());

Break points show the key as \\bor\\b and the value as OR.

Output is dave or dave.

All the reading I have done seems to suggest the above should work.
 

JohnH

C# Forum Moderator
Staff member
Joined
Apr 23, 2011
Messages
1,197
Location
Norway
Programming Experience
10+
What does dict.Keys.ToString() return?

By the way, actually using an entry Key/Value produces same result as in post 6. I would say 'of course', because a string is a string, and the argument to the Replace function is just a string.
C#:
var entry = dict.ElementAt(0);
var output = Regex.Replace(input, entry.Key, entry.Value);
 

ConsKa

Well-known member
Joined
Dec 11, 2020
Messages
140
Programming Experience
Beginner
This works:

C#:
foreach (var d in dict)
            {
                var matches = Regex.Matches(input, d.Key);

                foreach (Match match in matches)
                {
                    var output = Regex.Replace(input, match.Value, d.Value);
                    strOut.Text = output;
                }
            }

Which seems to suggest that Match can match a Regex in a dictionary, but Replace cannot match a Regex when in a dictionary?

That doesn't seem right, as it isn't like I just decided to do this in a dictionary, I did a lot of reading on people doing it in dictionaries.
 

ConsKa

Well-known member
Joined
Dec 11, 2020
Messages
140
Programming Experience
Beginner
What does dict.Keys.ToString() return?

By the way, actually using an entry Key/Value produces same result as in post 6. I would say 'of course', because a string is a string, and the argument to the Replace function is just a string.
C#:
var entry = dict.ElementAt(0);
var output = Regex.Replace(input, entry.Key, entry.Value);
Break points show the dict.Keys.ToString() as \\bor\\b

Seems to show that regardless of how you enter into the dictionary (@, \, \\)

The output is just dave or dave - as it doesn't recognise the or in the dict.Key part of the function, so doesn't do anything.

If I wrap it in a If statement - dave or dave contains dict.Key - then it just skips it as being false.
 

Skydiver

Staff member
Joined
Apr 6, 2019
Messages
3,386
Location
Chesapeake, VA
Programming Experience
10+
Even simplifying it right down:

C#:
Dictionary<string, string> dict = new Dictionary<string, string>();
dict.Add(@"\bor\b", "OR"); // doing ("\\bor\\b, "OR") or ("\bor\b", "OR") makes no difference here

var output = Regex.Replace(input, dict.Keys.ToString(), dict.Values.ToString());

Break points show the key as \\bor\\b and the value as OR.
You're inspecting the wrong thing in your breakpoint. You may be inspecting dict, but recall that what you are passing into the Replace() call is dict.Keys.ToString(). Let's go see what dict.Keys.ToString() returns:
Test code:
C#:
var dict = new Dictionary<string, string>();
dict.Add(@"\bor\b", "OR");
Console.WriteLine(dict.Keys.ToString());

Output:
Code:
System.Collections.Generic.Dictionary`2+KeyCollection[System.String,System.String]

So how are you expecting "dave or dave" to match "System.Collections.Generic.Dictionary`2+KeyCollection[System.String,System.String]" ?
 

JohnH

C# Forum Moderator
Staff member
Joined
Apr 23, 2011
Messages
1,197
Location
Norway
Programming Experience
10+
Which seems to suggest that Match can match a Regex in a dictionary, but Replace cannot match a Regex when in a dictionary?
No, the problem in post 5 is due to this:
if (result.Contains(entry.Key))
Key is \bor\b and result does not contain that.

Dictionary has nothing to do with this, the dictionary just stores the strings.
Break points show the dict.Keys.ToString() as \\bor\\b
Then you're not look at the right place.
Immediate Window said:
?dict.Keys.ToString()
"System.Collections.Generic.Dictionary`2+KeyCollection[System.String,System.String]"
Anyway, checking for contains or Matches is pointless, because if the regex doesn't match it also won't replace.
Assigning to strOut.Text inside loop is also not a good idea, only last assignment and replace is shown.
 

JohnH

C# Forum Moderator
Staff member
Joined
Apr 23, 2011
Messages
1,197
Location
Norway
Programming Experience
10+
If the regex expressions doesn't depend on line boundaries you can do this to replace all matching expressions in text:
C#:
var input = InputTextbox.Text;
foreach (var entry in dict)
    input = Regex.Replace(input, entry.Key, entry.Value);
OutputTextbox.Text = input;
 
Solution

ConsKa

Well-known member
Joined
Dec 11, 2020
Messages
140
Programming Experience
Beginner
Understood, but look back at the original code I wrote.

It doesn't have ToString().

I added that to get it to run as IntelliSense was saying you couldn't have dict.Value in the code that was developed without a ToString();

The reason it is wrapped in an IF statement is that an exception error pops up if the Key is not found in the word, key doesn't exist error. - though I accept this is likely to the way in which the loops were being written.

Lastly, yes, if you look at the original code it was += I didn't bother doing the += when we simplified it down to a single entry.

The problem with not using \b is:

daore or daore = daORe OR daORe

Which I want to hopefully avoid.

Honestly, I do not understand why this doesn't work:

C#:
foreach (var result in test)
            {
                foreach (var entry in dict)
                {
                    if (result.Contains(entry.Key))
                    {
                        var outPut = Regex.Replace(result, entry.Key, entry.Value.ToUpper());
                        strOut.Text += outPut + Environment.NewLine;
                    }
                }
            }

Everything I have read says that should work. I added to upper, just incase it was replacing or with or and being case insensitive....it makes no different as it simply isn't finding a match.
 
Last edited:

Skydiver

Staff member
Joined
Apr 6, 2019
Messages
3,386
Location
Chesapeake, VA
Programming Experience
10+
Well, that because result has the value "ConsKa or Conske", but you are trying to see if "\\bor\\b" is it on line 5. Since the Contains() is going to return false, then lines 7-8 will not execute.
 
Last edited:

ConsKa

Well-known member
Joined
Dec 11, 2020
Messages
140
Programming Experience
Beginner
So "Contains" cannot match a Regex?

Could I then, reverse that so that

C#:
if (result.Contains(entry.Value.ToLower()))

So that it is checking the value, as the only change here is capitalisation - I can then send it into the Regex.Replace that will find the regex.

Given it is the same word.

Yep, that gets me into the loop and annoyingly....I have to do @"\bor\b" in the dictionary for it to work.....

man, what an absolutely mind bending treat this was...
 
Last edited:

JohnH

C# Forum Moderator
Staff member
Joined
Apr 23, 2011
Messages
1,197
Location
Norway
Programming Experience
10+
So "Contains" cannot match a Regex?
No, that is a string function.
And as I said, neither do you need a Contains or Match.
 

ConsKa

Well-known member
Joined
Dec 11, 2020
Messages
140
Programming Experience
Beginner
No, that is a string function.
And as I said, neither do you need a Contains or Match.
I think I do because otherwise wouldn't my output be full of the same string untouched for every other regex test that I am doing in the Dictionary which doesn't apply?

So if I do 4 Regex tests through the foreach var entry in dict...I would have:

dave or dave
dave OR dave
dave or dave
dave or dave

As my output? One would be the one the Regex replace acted on, the other 3 would be ones it just passed through.

The match/contains test, only puts the string through the regex when it matches the dictionary regex and therefore only adds it to the output, once it has been changed?

I can add an else statement to pass a single untouched string if no Regex applies.

Unless you had another thought on this?
 

JohnH

C# Forum Moderator
Staff member
Joined
Apr 23, 2011
Messages
1,197
Location
Norway
Programming Experience
10+
Look at post 13, text is processed through all regex expressions, and finally shown in UI control.
 

ConsKa

Well-known member
Joined
Dec 11, 2020
Messages
140
Programming Experience
Beginner
The input text is an array, so I need to foreach item in that array, test it against the Regex, and test it against each item in the Regex Dictionary then create an output.

This does it:

C#:
foreach (var result in test)
            {
                foreach (var item in dict)
                {
                    if (Regex.IsMatch(result, item.Key))
                    {
                        outPut = Regex.Replace(result, item.Key, item.Value);
                    }
                }
                strOut.Text += outPut + Environment.NewLine;
            }

Creates a 4 string output, which is the 4 strings that were input that were changed. This doesn't help me though when I add the 5th string which doesn't need changing, but would still like to keep in the list. Will think on that.

Just swapping out the Contains for the proper Regex.IsMatch gets me where I think I need to be.

This is tested against a 4 string array....I have to test it against a 15,000 string array and see what type of performance hit it takes to do all these loops.

So if you have a better way? I am here to learn.
 
Top Bottom