Answered Regex Split

etl2016

Active member
Joined
Jun 29, 2016
Messages
39
Programming Experience
3-5
hi,

I am trying to implement a solution to be able to split a line into array of strings, considering two criteria. Firstly- there are certain columns that are text-qualified with multi-character boundaries. Secondly, a multi-character delimiter. The situation may get complex when there are common characters in the two features. To add, metacharacters such as $ and ^ may add more challenges. It seems that, Regex is most suited for such purposes. One of the implementations as below is working for most cases, but, is breaking for metacharacters being opted in the text-qualifier and/or delimiters.

C#:
using System.Text.RegularExpressions;

public string[] Split(string expression, string delimiter,
            string qualifier, bool ignoreCase)
{
    string _Statement = String.Format
        ("{0}(?=(?:[^{1}]*{1}[^{1}]*{1})*(?![^{1}]*{1}))",
                        Regex.Escape(delimiter), Regex.Escape(qualifier));

    RegexOptions _Options = RegexOptions.Compiled | RegexOptions.Multiline;
    if (ignoreCase) _Options = _Options | RegexOptions.IgnoreCase;

    Regex _Expression = New Regex(_Statement, _Options);
    return _Expression.Split(expression);
}

The above works for majority of the scenarios, but, doesn't for such situations where metacharacters like $ are involved (especially as part of text-qualifier. Looks like partcular interpretation of escaping is needed)

C#:
string input = "*|This is an ..  example*|..Am2..Cool!";
string input2 = "*|This is an $  example*|$Am2$Cool!";
string input3 = "$|This is an $  example$|$Am2$Cool!";
string input4 = "|$This is an $  example|$$Am2$Cool!";

foreach (string _Part in Split(input, "..", "*|", true))
Console.WriteLine(_Part);

foreach (string _Part in Split(input2, "$", "*|", true))
Console.WriteLine(_Part);

foreach (string _Part in Split(input3, "$", "$|", true)) // doesn't work correctly
Console.WriteLine(_Part);

foreach (string _Part in Split(input4, "$", "|$", true)) //  doesn't work correctly
Console.WriteLine(_Part);

Could you please let me know how do we handle all situations, including the ones that involve metacharacters as part of text-qualifier and/or delimiters?

thank you
 
Well, here's my non-Regex approach:
LineParser.cs:
using System.Collections.Generic;
using System.Linq;

namespace LineParser
{
    public abstract class LineParser
    {
        public string Delimiter { get; }
        public string BeginQuote { get; }
        public string EndQuote { get; }
        public char Escape { get; }

        public LineParser(string delimiter, string beginQuote, string endQuote, char escape = '\\')
        {
            Delimiter = delimiter;
            BeginQuote = beginQuote;
            EndQuote = endQuote;
            Escape = escape;
        }

        public LineParser(string delimiter, string quote, char escape = '\\')
            : this(delimiter, quote, quote, escape)
        {
        }

        public abstract IEnumerable<string> Parse(string input);

        public virtual string[] Split(string input) => Parse(input).ToArray();
    }
}

StringLineParser.cs:
using System;
using System.Collections.Generic;
using System.IO;

namespace LineParser
{
    public class StringLineParser : LineParser
    {

        public StringLineParser(string delimiter, string beginQuote, string endQuote, char escape = '\\')
            : base(delimiter, beginQuote, endQuote, escape)
        {
        }

        public StringLineParser(string delimiter, string quote, char escape = '\\')
            : this(delimiter, quote, quote, escape)
        {
        }

        int IndexOfUnescaped(ReadOnlySpan<char> input, string value)
        {
            var left = 0;
            while (!input.IsEmpty)
            {
                int index = input.IndexOf(value);
                if (index < 0)
                    return index;

                if (index == 0 || input[index - 1] != Escape)
                    return left + index;

                left += index + value.Length;
                input = input.Slice(index + value.Length);
            }
            return -1;
        }

        int GetEndOfQuotedString(ReadOnlySpan<char> input)
        {
            int end = IndexOfUnescaped(input.Slice(BeginQuote.Length), EndQuote);
            if (end < 0)
                throw new InvalidDataException($"Unmatched {BeginQuote}");
            end += BeginQuote.Length + EndQuote.Length;

            if (end < input.Length && !input.Slice(end).StartsWith(Delimiter))
                throw new InvalidDataException($"Expected {Delimiter} after {EndQuote}");

            return end;
        }

        int GetEndOfField(ReadOnlySpan<char> input)
        {
            int end = IndexOfUnescaped(input, Delimiter);
            if (end < 0)
                end = input.Length;

            var field = input[0..end];
            ValidateNoQuote(field, BeginQuote);
            ValidateNoQuote(field, EndQuote);

            return end;

            void ValidateNoQuote(ReadOnlySpan<char> field, string quote)
            {
                if (field.IndexOf(quote) >= 0)
                    throw new InvalidDataException($"Unexpected {quote} in field data.");
            }
        }

        IEnumerable<string> GetTokens(string input)
        {
            var inputMem = input.AsMemory();

            while (!inputMem.IsEmpty)
            {
                int end = Delimiter.Length;
                var span = inputMem.Span;

                if (span.StartsWith(BeginQuote))
                    end = GetEndOfQuotedString(span);
                else if (!span.StartsWith(Delimiter))
                    end = GetEndOfField(span);

                yield return inputMem[0..end].ToString();
                inputMem = inputMem.Slice(end);
            }
        }

        public override IEnumerable<string> Parse(string input)
        {
            if (input == null)
                yield break;

            string lastValue = null;
            foreach(var token in GetTokens(input))
            {
                var value = token;
                if (value == Delimiter)
                {
                    yield return lastValue ?? "";
                    value = null;
                }
                lastValue = value;
            }
            yield return lastValue ?? "";
        }
    }
}

StringLineParserTests.cs:
using System;
using Xunit;
using LineParser;
using System.IO;

namespace LineParser.Tests
{
    public class StringLineParserTests
    {
        [Fact]
        public void HandlesNull()
        {
            var parser = new StringLineParser("|", "<", ">");
            var results = parser.Split(null);
            Assert.Empty(results);
        }

        [Theory]
        [InlineData("hello<world>")]
        [InlineData("<hello>world")]
        public void HandlesBadQuotedData(string input)
        {
            var parser = new StringLineParser("|", "<", ">");
            Assert.Throws<InvalidDataException>(() => parser.Split(input));
        }

        [Fact]
        public void HandlesEmptyString()
        {
            var parser = new StringLineParser("|", "<", ">");
            var results = parser.Split("");
            Assert.Collection(results, v => Assert.Equal("", v));
        }

        [Fact]
        public void HandlesSingleDelimiter()
        {
            var parser = new StringLineParser("|", "<", ">");
            var results = parser.Split("|");
            Assert.Collection(results,
                              v => Assert.Equal("", v),
                              v => Assert.Equal("", v));
        }

        [Fact]
        public void HandlesTwoDelimiters()
        {
            var parser = new StringLineParser("|", "<", ">");
            var results = parser.Split("||");
            Assert.Collection(results,
                              v => Assert.Equal("", v),
                              v => Assert.Equal("", v),
                              v => Assert.Equal("", v));
        }

        [Theory]
        [InlineData("abc", "abc")]
        [InlineData("<abc>", "<abc>")]
        [InlineData("<a|b>", "<a|b>")]
        [InlineData(@"<a\|b>", @"<a\|b>")]
        [InlineData("<a|b|c>", "<a|b|c>")]
        [InlineData(@"<a\|b\|c>", @"<a\|b\|c>")]
        public void HandlesSingleValue(string input, string a)
        {
            var parser = new StringLineParser("|", "<", ">");
            var results = parser.Split(input);
            Assert.Collection(results, v => Assert.Equal(a, v));
        }

        [Theory]
        [InlineData("abc|123", "abc", "123")]
        [InlineData("<abc>|123", "<abc>", "123")]
        [InlineData("<abc>|<123>", "<abc>", "<123>")]
        [InlineData("abc|<123>", "abc", "<123>")]
        [InlineData("<abc>|", "<abc>", "")]
        [InlineData("|<123>", "", "<123>")]
        public void HandlesTwoValues(string input, string a, string b)
        {
            var parser = new StringLineParser("|", "<", ">");
            var results = parser.Split(input);
            Assert.Collection(results,
                              v => Assert.Equal(a, v),
                              v => Assert.Equal(b, v));
        }

        [Fact]
        public void HandlesThreeValues()
        {
            var parser = new StringLineParser("|", "<", ">");
            var results = parser.Split("abc|123|ghi");
            Assert.Collection(results,
                              v => Assert.Equal("abc", v),
                              v => Assert.Equal("123", v),
                              v => Assert.Equal("ghi", v));
        }
    }
}

and some test output:
C#:
hello|world|28: [hello] [world] [28]
"hello"|world|28: ["hello"] [world] [28]
hello|"world|28": [hello] ["world|28"]
: []
|: [] []
world: [world]
"hello": ["hello"]
"hello\"test"|world|28: ["hello\"test"] [world] [28]
produced by:
Program.cs:
using System;
using System.Linq;
using System.Text.RegularExpressions;

namespace LineParser
{
    class Program
    {
        static void DoIt(LineParser parser, string input)
        {
            Console.Write($"{input}: ");
            Console.WriteLine(string.Join(" ", parser.Parse(input).Select(s => $"[{s}]")));
        }


        static void Main(string[] args)
        {
            var parser = new StringLineParser("|", "\"");

            DoIt(parser, "hello|world|28");
            DoIt(parser, "\"hello\"|world|28");
            DoIt(parser, "hello|\"world|28\"");
            DoIt(parser, "");
            DoIt(parser, "|");
            DoIt(parser, "world");
            DoIt(parser, "\"hello\"");
            DoIt(parser, "\"hello\\\"test\"|world|28");
        }
    }
}
 
thank you Skydiver. I moved away from Regex and implemented a non-Regex solution and its been tested working ok for many scenarios, with no issues noticed so far. Yes, non-Regex is more readable and is easier to debug. Thanks for the above, your approach looks more robust than mine.
 
Hi Skydiver, thanks. Is the above implementation specific to a particular flavor of .net environment? Under VS2017, I tried the above code in both 4.6.1 and Core 2.1 and they had different compilation errors.

Under 4.6.1, the error was around ReadOnlySpan, which says : type or namespace could not be found
Under Core 2.1, in the method GetEndOfField, there were 3 errors:
1) The ending } in Line 68 seems slightly displaced, bringing ValidateNoQuote inside and GetEndOfField , and this was fixed by aligning the ending }
2) In GetEndOfField, at var field = input[0..end] , the error is : identifier expected.
3) In GetEndOfField, at ValidateNoQuote (field, BeginQuote), the error is : Cannot convert from char to ReadOnlySpan<char>. The casting (ReadOnlySpan<char>) didn't fix.

Please let me know, if any supporting NuGet packages or .net Settings to be used?

many thanks once again
 
.NET Core 3.1
 
I see that StringLineParser line 57 uses range syntax input[0..end] which is only supported on c# 8.0 which means .Net Core 3.x, so that requires VS 2019.
 
I think the code in post 16 could work with .Net Core 2.1 (in VS 2017) if the two occurences of range syntax was replaced with .Slice(0, end). In addition with .Net Core 2.1 compiler complains about the local function ValidateNoQuote (line 63) parameter name field, for example change it to field2 (it compiles with field in .Net Core 3.1)
 

Latest posts

Back
Top Bottom