Answered Regex Split

etl2016

Active member
Joined
Jun 29, 2016
Messages
39
Programming Experience
3-5
hi,

I am trying to implement a solution to be able to split a line into array of strings, considering two criteria. Firstly- there are certain columns that are text-qualified with multi-character boundaries. Secondly, a multi-character delimiter. The situation may get complex when there are common characters in the two features. To add, metacharacters such as $ and ^ may add more challenges. It seems that, Regex is most suited for such purposes. One of the implementations as below is working for most cases, but, is breaking for metacharacters being opted in the text-qualifier and/or delimiters.

C#:
using System.Text.RegularExpressions;

public string[] Split(string expression, string delimiter,
            string qualifier, bool ignoreCase)
{
    string _Statement = String.Format
        ("{0}(?=(?:[^{1}]*{1}[^{1}]*{1})*(?![^{1}]*{1}))",
                        Regex.Escape(delimiter), Regex.Escape(qualifier));

    RegexOptions _Options = RegexOptions.Compiled | RegexOptions.Multiline;
    if (ignoreCase) _Options = _Options | RegexOptions.IgnoreCase;

    Regex _Expression = New Regex(_Statement, _Options);
    return _Expression.Split(expression);
}

The above works for majority of the scenarios, but, doesn't for such situations where metacharacters like $ are involved (especially as part of text-qualifier. Looks like partcular interpretation of escaping is needed)

C#:
string input = "*|This is an ..  example*|..Am2..Cool!";
string input2 = "*|This is an $  example*|$Am2$Cool!";
string input3 = "$|This is an $  example$|$Am2$Cool!";
string input4 = "|$This is an $  example|$$Am2$Cool!";

foreach (string _Part in Split(input, "..", "*|", true))
Console.WriteLine(_Part);

foreach (string _Part in Split(input2, "$", "*|", true))
Console.WriteLine(_Part);

foreach (string _Part in Split(input3, "$", "$|", true)) // doesn't work correctly
Console.WriteLine(_Part);

foreach (string _Part in Split(input4, "$", "|$", true)) //  doesn't work correctly
Console.WriteLine(_Part);

Could you please let me know how do we handle all situations, including the ones that involve metacharacters as part of text-qualifier and/or delimiters?

thank you
 
You know that it would help dramatically if you showed us your expected output, and the current output that you are getting.

Also is this really your data, or this going to be like your other threads in the past where you suddenly change the data or the requirements of what you are asking for because you are trying to be so oblique about the way you ask your question?
 
It may also help if you explain what each part of the regex on line 7 is trying to achieve.
 
thank you for the reply.

The program is expected to be versatile to handle any incoming feed that is free to have text-qualifiers and delimiters of their choice. Some feeds may not even have text qualifiers. Of course, all these choices are conveyed to the program as parameters to interpret the feed accordingly.

C#:
string input5 = "|$This is an $  example|$$Am2$Cool!|$$|";

In above example, if the text-qualifier is |$ and the delimiter is $ then, the expected array of strings is as follows:

a[0] is |$This is an $ example|$
a[1] is Am2
a[2] is Cool!
a[3] is null

In first post's Line 7, the regex used is a sample found while researching, it is trying to escape the qualifier and delimiters. Non-Regex is proving to be a lot more lines of code, while Regex though cryptic is seem to be a general choice for complex string processing, offering a variety of functions. Both options are being explored.

thank you.
 
What differentiates a "text-qualifier" from a "delimiter" ?
 
If a delimiter appears within a text-qualifier , it is not considered a split criterion. " good | morning"|40|50|60 will produce "good | morning" as first array element if | is a delimiter and " is text qualifier. 40, 50, 60 are next array elements. thanks
 
Are "text-qualifiers" always paired? Can there be an unpaired text qualifier like in:
C#:
"good|morning"|"40"|"50"|"60
Notice that the 60 has an opening double quote, but no closing double quote.

How are "text-qualifiers" escaped in your data? What if you really want that character sequence to be in the data, and not be treated as "text-qualifiers"?
 
Yes, text-qualifiers are always paired. Both below lines are valid representations.
C#:
"good|morning"|"40"|"50"|"60"
"good|morning"|40|50|60

Text-qualifiers are constructed by the upstream. So, their programs will have to choose an appropriate text-qualifier that is expected not to appear as part of their data.

However, many thanks for raising this real-life possibility. As a more robust solution, a text qualifier preceded by \ backslash may be ignored.

For instance, below may be valid, in such a case.

C#:
"hello \"$ good|morning"|40|50|60

In the above example, array [0] should be :
C#:
"hello \"$ good|morning"
.

The reason \ needs to be carried forward as-is is, because, that's how the source has sent it.

thank you
 
Okay. Next silly question: Are you married to the idea of using Split(), or is used my Match() or just plain or IndexOf() sufficient as along as in the end, you end up with an array of strings that look like they have been split using the delimiter, and protected by the "text-qualifiers"?

And a normal question: What happens if the chosen delimiter or "text-qualifier" is the "\"?
 
Also another question, what happens to "text-qualifiers" that are not at the beginning and end of the data field:
C#:
Pete "Maverick" Mitchell|Lieutenant|Pilot
 
thank you.

C#:
Pete "Maverick" Mitchell|Lieutenant|Pilot
/* invalid, reason-1 : assuming quote is the text-qualifier, it is in the middle of data field and it is not escaped. If it really had to be in the middle, it needed an escape character.
reason-2 this column and other columns which are strings are not text-qualified.
*/
"Pete \"Maverick\" Mitchell"|"Lieutenant"|"Pilot"|40|50|60 // valid
"Pete \"Maverick\" Mitchell"|"Lieutenant"|"Pilot"|"40"|"50"|"60" // valid

On the other hand, \ is universally interpreted as an escape metacharacter, so, use of such a character as delimiter or text-qualifier is objected and is pushed back to source, asking for a review.
 
hi, the regular expression approach used in the first post is withstanding a wide range of scenarios. However, it still is being improved to handle scenarios of escaped text-qualifier appearing as part of data field (as in #11 post)

thank you
 
And hence my question in post #9 whether he is really married to the idea of using a regex. Maybe time to consider a divorce? :)

 
Back
Top Bottom