Count of Duplicates

WebKill · Mar 19, 2018

I have a process that parses a rather ugly text file (lots of them actually), and populates records into a list of a custom class set up with the names of the fields that I will sql bulk copy into a table after converting it into a data table. It works GREAT, however after reviewing the data it appears we have a new requirement to count duplicate records. These records indicate actions, and it is OK to have duplicates, unfortunately each duplicate record is a genuine duplicate, nothing separating it at all from the other, so instead of listing the same record twice (or more), I want to indicate how many times it occurred. This will also allow me to set up a primary key on the table as there will not be duplicates to contend with.

As I read the lines of the file, I stop with a footer record and populate my class record with the data needed and insert it into the list of that class. When the file is done processing, the completed list of class records is then converted to a data table.

My question is this: is there a simple way populate the field "Count" in my class with the number of times the record appears, either after the list is complete, or after converted to datatable?

I have something like this:

List<MyRecord> records = new List<MyRecord>();

var read = File.ReadAllLines(FILE);
var lines = new List<string>(read);

foreach (string line in lines)
{
    if (line.Contains("ENDREC"))
    {
        records.add(new MyRecord(Data1, Data2, Data3, "1")); //the 1 would be the count, defaulting to 1
    }
}

DataTable table = ConvertToDataTable(records); //uses a function I wrote to convert the list to a datatable


public class MyRecord
{
    public string Data1 { get; set; }
    public string Data2 { get; set; }
    public string Data3 { get; set; }
    public int Count { get; set; }

    //constructor here
}

jmcilhinney · Mar 19, 2018

You can use a HashSet instead of a List to ensure that no duplicates are added to the list. You can specify your own test for equality so as to use only the DataN properties. When a duplicate is rejected, you can get the existing item and increment its count. This code is untested but should give you an idea of how to implement those steps:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;

namespace ConsoleApp1
{
    class Program
    {
        static void Main(string[] args)
        {
            var things = new HashSet<Thing>(new ThingEqualityComparer());

            foreach (var line in File.ReadLines("file path here"))
            {
                var record = new Thing {Data1 = "Data1", Data2 = "Data2", Data3 = "Data3", Count = 1};

                if (!things.Add(record))
                {
                    // This record is a duplicate so get the existing record and increment the count.
                    record = things.Single(t => things.Comparer.Equals(t, record));
                    record.Count++;
                }
            }
        }
    }

    class Thing
    {
        public string Data1 { get; set; }
        public string Data2 { get; set; }
        public string Data3 { get; set; }
        public int Count { get; set; }
    }

    class ThingEqualityComparer : IEqualityComparer<Thing>
    {
        public bool Equals(Thing x, Thing y)
        {
            return x.Data1 == y.Data2 &&
                   x.Data2 == y.Data2 &&
                   x.Data3 == y.Data3;
        }

        public int GetHashCode(Thing obj)
        {
            return string.Join(Environment.NewLine, obj.Data1, obj.Data2, obj.Data3).GetHashCode();
        }
    }
}

Note also that I used ReadLines instead of ReadAllLines. The former is preferable unless you really need all lines in an array for random or repeated access. If sequential, singular access is all you need then ReadLines is more efficient, especially for large files.

WebKill · Mar 19, 2018

Perfect, just what I was looking for, thanks!

Did you mean to put String.concat for GetHashCode?

jmcilhinney · Mar 20, 2018

WebKill said:
Did you mean to put String.concat for GetHashCode?

No, I meant Join. It's more likely to produce a unique value that way. Using Concat, "A", "BCD" and "E" would produce the same hash code as "AB", "C" and "DE", while they would not using Join with a line break as the delimiter.

Count of Duplicates

WebKill

New member

jmcilhinney

C# Forum Moderator

WebKill

New member

jmcilhinney

C# Forum Moderator

Similar threads

Share this page

Latest posts