Answered scanning DataTable - a performant approach

etl2016

Active member
Joined
Jun 29, 2016
Messages
39
Programming Experience
3-5
hi,

Is DataTable a good fit for large scale data processing, of order hundred thousand rows?

I have a scenario, prototyped as below, effectively to achieve two requirements, illustrated in if-else.

Scenario:

There is a DataTable. There is a uni dimensional list/array, of the same count. The values held in DataTable's particular column are to be replaced with corresponding positional values from the list. If nulls are found, these positional values need on-the-fly computation, resulting in a new list of {key, value} pair, parked aside.

The scanning of DataTable row by row is proving to be very slow. Is DataTable a good fit for such purposes? If yes, is there a more performant manner to achieve the same objective? If no, could you please advise about what alternate .net programming features do the trick here, thank you.

C#:
using System;
using System.Data;
using System.Collections.Generic;

namespace UpdateDataTable
{
    class Program
    {
        
        static void Main(string[] args)
        {

            DataTable table = GetTable();
            string[] CarNames = { "Volvo", "Tesla", null, null, "Ford" };
            DataTable returnedTable = UpdateTable(table, CarNames);

            for (int i = 0; i < returnedTable.Rows.Count; i++)
            {
                DataRow row = table.Rows[i];

                Console.WriteLine(row["Car"]);

            }
        }

        static DataTable GetTable()
        {
            DataTable table = new DataTable();

            table.Columns.Add("ID",   typeof(int));
            table.Columns.Add("Car",  typeof(string));
            table.Columns.Add("Name", typeof(string));
            table.Columns.Add("Date", typeof(DateTime));

            table.Rows.Add(25,  "Car A", "A", DateTime.Now);
            table.Rows.Add(50,  "Car B", "B", DateTime.Now);
            table.Rows.Add(10,  "Car C", "C", DateTime.Now);
            table.Rows.Add(21,  "Car D", "D", DateTime.Now);
            table.Rows.Add(100, "Car E", "E", DateTime.Now);
            
            return table;
        }
 
        static DataTable UpdateTable(DataTable table, string[] CarNames)
        {

            var list = new List<KeyValuePair<int, string>>();

            for (int i=0; i < table.Rows.Count; i ++)
            {
                DataRow row = table.Rows[i];
                if ( !string.IsNullOrEmpty (CarNames[i] ) )
                {
                    row["Car"] = CarNames[i];  //  Requirement-1 : to update datatable with non-nulls
                }
                else
                {
                    row["Car"] = "hello";      //Requirement-2:  Construct new value and pile them up in a list of pairs
                    list.Add(new KeyValuePair<int, string>(i, "hello"));
                }

            } // DataTable is updated row-by-row and is found very slow for large volumes

            return table;
        } // end of UpdateTable
     } // end of class
}
 
Welcome to O(log n) data access. The C# DataTable rows are stored as a tree. It's not an array so accessing a row by index needs to walk the tree. I don't know if the enumerator for the RowCollection is more efficient about walking the tree and remembering where it was last at.

And now I know, G.I. Joe! At least the enumerator is smart:
 
And it looks like the DataTable rows collection is a bit smarter now compared to the .NET Framework 1.1 days. Now it does a bunch of math to figure out where the node is by index instead of walking the tree the hard way:
 
Last edited:
And again, we would really ask you to post your real code rather than the dummy and pseudo code that you've been using consistently.

In the past threads, I've questioned whether you really need a DataTable. My understanding is that you are going from .CSV, into multiple DataTables, and then from DataTables into Redis. Do you do anything with the DataTables afterwards? Or is what is really important that list from line 47 that you seem to be just discarding?

As I've also alluded to in past threads, if the computation for new values is only dependent on the current row that is being processed, then there is really no need for the DataTable, each row from the CSV can be processed in pipeline, and just be done with it. Since you started exploring TPL DataFlow, you know that you can parallelize parts of that pipeline.
 
thank you. Yes, these individual DataTables will be merged before being written to filesystem. The list from line 47 will be used as input to a Library async method that updates Redis in a performant batch approach, carrying hundreds of thousand of Redis calls in a fraction of a minute. Each row of csv , if processed in a pipeline, is noticed to be very slow, as it has to use singular Get/Set library calls, each of which is taking quarter to half a second RTT. Response times offered by the intermediate library methods is playing key role in overall approach. Will explore the option you mentioned about parallelizing feature in TPL Dataflow, thank you.
 
Each row of csv , if processed in a pipeline, is noticed to be very slow, as it has to use singular Get/Set library calls, each of which is taking quarter to half a second RTT.
Huh? You could still process each row of the CSV and put the new values into a list like you are doing with the datatable.
 
Yes, these individual DataTables will be merged before being written to filesystem.
Merged or the data table rows merely concatenated? If merged, how exactly are things merged together? How was it determined which rows of the CSV went to which particular DataTable? Or was the partitioning of the CSV into DataTables just a function of the order rows read from the CSV?
 
I couldn't sleep, so here's the results I have while running on battery on my laptop for one million rows:
C#:
DataTable index: 00:00:04.8185246
DataTable foreach: 00:00:04.6474218
RowByRow: 00:00:00.9127712

The first two uses a data table. They read reads from a "csv" and populates the data table. They both iterate over the data table to do the processing. The first one iterates over the data table by indexing, while the second one uses foreach to get the enumerator. The last one reads a row from the "csv", does the processing, and then appends the result to a records list. All off them also create a populate a list of "created" key value pairs.

Code used to generate those timings:
C#:
using System;
using System.Collections;
using System.Collections.Generic;
using System.Data;
using System.Diagnostics;
using System.Linq;
using System.Reflection.Metadata.Ecma335;
using System.Text;

struct CarRecord
{
    public int Id;
    public string Car;
    public string Name;
    public DateTime Date;
}

class Csv
{
    static DateTime _baseDate = DateTime.Now;

    int _count = 0;

    public IEnumerable<CarRecord> GetRecords()
    {
        while (true)
        {
            _count++;
            yield return new CarRecord()
            {
                Id = _count,
                Car = $"Car {_count:D8}",
                Name = $"{_count:D8}",
                Date = _baseDate.AddMilliseconds(_count)
            };
        }
    }
}

class NameGenerator
{
    static string[] _names =
    {
        "Volvo",
        "Tesla",
        "Ford",
    };

    Random _random = new Random(123456);

    public IEnumerable<string> GetNames()
    {
        while (true)
        {
            var index = _random.Next(_names.Length + 1);
            yield return index < _names.Length ? _names[index] : null;
        }
    }
}

class Program
{
    IEnumerable UseDataTableIndex(int rowCount, Csv csv, NameGenerator generator)
    {
        var table = new DataTable();
        table.Columns.Add("Id", typeof(int));
        table.Columns.Add("Car", typeof(string));
        table.Columns.Add("Name", typeof(string));
        table.Columns.Add("Date", typeof(DateTime));

        int count = 0;
        foreach (var record in csv.GetRecords())
        {
            table.Rows.Add(record.Id, record.Car, record.Name, record.Date);
            count++;
            if (count >= rowCount)
                break;
        }

        var list = new List<KeyValuePair<int, string>>();
        var enumerator = generator.GetNames().GetEnumerator();
        for(int index = 0; index < rowCount; index++)
        {
            if (!enumerator.MoveNext())
                break;

            var name = enumerator.Current;
            if (name == null)
            {
                name = "hello";
                list.Add(new KeyValuePair<int, string>(index, name));
            }
            var row = table.Rows[index];
            row["Name"] = name;
        }

        return table.Rows;
    }

    IEnumerable UseDataTableForEach(int rowCount, Csv csv, NameGenerator generator)
    {
        var table = new DataTable();
        table.Columns.Add("Id", typeof(int));
        table.Columns.Add("Car", typeof(string));
        table.Columns.Add("Name", typeof(string));
        table.Columns.Add("Date", typeof(DateTime));

        int count = 0;
        foreach(var record in csv.GetRecords())
        {
            table.Rows.Add(record.Id, record.Car, record.Name, record.Date);
            count++;
            if (count >= rowCount)
                break;
        }

        var list = new List<KeyValuePair<int, string>>();
        var enumerator = generator.GetNames().GetEnumerator();
        int index = 0;
        foreach (DataRow row in table.Rows)
        {
            if (!enumerator.MoveNext())
                break;

            var name = enumerator.Current;
            if (name == null)
            {
                name = "hello";
                list.Add(new KeyValuePair<int, string>(index, name));
            }
            row["Name"] = name;

            index++;
        }

        return table.Rows;
    }

    IEnumerable UseRowByRow(int rowCount, Csv csv, NameGenerator generator)
    {
        var records = new List<CarRecord>(rowCount);
        var list = new List<KeyValuePair<int, string>>();
        var enumerator = generator.GetNames().GetEnumerator();
        int index = 0;
        foreach (var record in csv.GetRecords())
        {
            if (!enumerator.MoveNext())
                break;

            var name = enumerator.Current;
            if (name == null)
            {
                name = "hello";
                list.Add(new KeyValuePair<int, string>(index, name));
            }

            records.Add(new CarRecord() { Id = record.Id, Car = record.Car, Name = name, Date = record.Date });

            index++;
            if (index >= rowCount)
                break;
        }

        return records;
    }

    const int RowCount = 1000000;

    void TimeIt(string caption, Func<int, Csv, NameGenerator, IEnumerable> func)
    {
        Console.Write($"{caption}: ");
        var stopwatch = new Stopwatch();

        var csv = new Csv();
        var generator = new NameGenerator();
        stopwatch.Start();
        func(RowCount, csv, generator);
        stopwatch.Stop();
        Console.WriteLine($"{stopwatch.Elapsed}");
    }

    void Run()
    {
        TimeIt("DataTable index", UseDataTableIndex);
        TimeIt("DataTable foreach", UseDataTableForEach);
        TimeIt("RowByRow", UseRowByRow);
    }

    static void Main()
    {
        new Program().Run();
    }
}
 
I was curious about how much of that 4 seconds for the first 2 is time used to load up the DataTable and how much was actually the cost of iterating and processing, so I go these results for processing:
C#:
DataTable index: 00:00:01.1930239
DataTable foreach: 00:00:00.5725378
RowByRow: 00:00:00.8157491

Note that this is not an apples to apples comparison anymore. The first two get to fill up their data structures off the clock and are only timing processing, while the last one is doing its processing and filling up its data structure on the clock. The only thing that is off the clock for the last on is the initial construction of a list with room for a million items.

I used this code:
C#:
using System;
using System.Collections;
using System.Collections.Generic;
using System.Data;
using System.Diagnostics;
using System.Linq;
using System.Reflection.Metadata.Ecma335;
using System.Text;

struct CarRecord
{
    public int Id;
    public string Car;
    public string Name;
    public DateTime Date;
}

class Csv
{
    static DateTime _baseDate = DateTime.Now;

    int _count = 0;

    public IEnumerable<CarRecord> GetRecords()
    {
        while (true)
        {
            _count++;
            yield return new CarRecord()
            {
                Id = _count,
                Car = $"Car {_count:D8}",
                Name = $"{_count:D8}",
                Date = _baseDate.AddMilliseconds(_count)
            };
        }
    }
}

class NameGenerator
{
    static string[] _names =
    {
        "Volvo",
        "Tesla",
        "Ford",
    };

    Random _random = new Random(123456);

    public IEnumerable<string> GetNames()
    {
        while (true)
        {
            var index = _random.Next(_names.Length + 1);
            yield return index < _names.Length ? _names[index] : null;
        }
    }
}

class Program
{
    DataTable PrepareDataTable(int rowCount, Csv csv)
    {
        var table = new DataTable();
        table.Columns.Add("Id", typeof(int));
        table.Columns.Add("Car", typeof(string));
        table.Columns.Add("Name", typeof(string));
        table.Columns.Add("Date", typeof(DateTime));

        int count = 0;
        foreach (var record in csv.GetRecords())
        {
            table.Rows.Add(record.Id, record.Car, record.Name, record.Date);
            count++;
            if (count >= rowCount)
                break;
        }
        return table;
    }

    DataTable UseDataTableIndex(DataTable table, int rowCount, Csv csv, NameGenerator generator)
    {
        var list = new List<KeyValuePair<int, string>>();
        var enumerator = generator.GetNames().GetEnumerator();
        for(int index = 0; index < rowCount; index++)
        {
            if (!enumerator.MoveNext())
                break;

            var name = enumerator.Current;
            if (name == null)
            {
                name = "hello";
                list.Add(new KeyValuePair<int, string>(index, name));
            }
            var row = table.Rows[index];
            row["Name"] = name;
        }

        return table;
    }

    DataTable UseDataTableForEach(DataTable table, int rowCount, Csv csv, NameGenerator generator)
    {
        var list = new List<KeyValuePair<int, string>>();
        var enumerator = generator.GetNames().GetEnumerator();
        int index = 0;
        foreach (DataRow row in table.Rows)
        {
            if (!enumerator.MoveNext())
                break;

            var name = enumerator.Current;
            if (name == null)
            {
                name = "hello";
                list.Add(new KeyValuePair<int, string>(index, name));
            }
            row["Name"] = name;

            index++;
        }

        return table;
    }

    List<CarRecord> UseRowByRow(List<CarRecord> records, int rowCount, Csv csv, NameGenerator generator)
    {
        var list = new List<KeyValuePair<int, string>>();
        var enumerator = generator.GetNames().GetEnumerator();
        int index = 0;
        foreach (var record in csv.GetRecords())
        {
            if (!enumerator.MoveNext())
                break;

            var name = enumerator.Current;
            if (name == null)
            {
                name = "hello";
                list.Add(new KeyValuePair<int, string>(index, name));
            }

            records.Add(new CarRecord() { Id = record.Id, Car = record.Car, Name = name, Date = record.Date });

            index++;
            if (index >= rowCount)
                break;
        }

        return records;
    }

    const int RowCount = 1000000;

    void TimeIt<T>(string caption, Func<int, Csv, T> prepare, Func<T, int, Csv, NameGenerator, T> process)
    {
        Console.Write($"{caption}: ");
        var stopwatch = new Stopwatch();

        var csv = new Csv();
        var generator = new NameGenerator();

        T data = prepare(RowCount, csv);

        stopwatch.Start();
        process(data, RowCount, csv, generator);
        stopwatch.Stop();
        Console.WriteLine($"{stopwatch.Elapsed}");
    }

    void Run()
    {
        TimeIt("DataTable index", PrepareDataTable, UseDataTableIndex);
        TimeIt("DataTable foreach", PrepareDataTable, UseDataTableForEach);
        TimeIt("RowByRow", (count, csv) => new List<CarRecord>(count), UseRowByRow);
    }

    static void Main()
    {
        new Program().Run();
    }
}
 
I will point out that the test code I had in post #8 and #9 are already handing 1,000,000 rows. You said in your other threads that you are only handling 160,000 (e.g. 16 DataTable's of 10,000 rows each). Notice that even with 1,000,000 rows, it took only a little over 1 second using indexing. I don't know why you are saying that accessing each row of the DataTable is slow.
 
many thanks Skydiver. Sorry for the delay in follow up, was trying to redesign and replace DataTable approach with Lists etc, performance of which was no better. Digging further, I could localise the bottleneck to a user library method that is used to construct the new value in line 90 in post #8. To examine the impact, defaulting this step resulted in processing a few hundred thousand rows in under a minute, thus, confirming it as the time-consumer (upto half a sec turnaround time per call). So, DataTable approach in itself is not causing delays, nor iterating through it. thank you once again.
 
Back
Top Bottom