Remove duplicate strings from CSV

tdignan87 · Oct 8, 2019

Hi,
I have CSV files that come from our ERP system although there is a bug with the system ( waiting for it to be fixed) where occasionally it sends down a file that contains CSV files with a duplicate row which is then causing chaos with our other system that it imports into.
I should have a fix for the ERP in a few weeks, however;

How can I create a simple console file that checks the folder for any CSV and reads the file and checks to see if the second row matches the first two text rows; if so it just deletes the last row.
Example of csv below
"P0755","R190830022","2021-08-30","POSITIVE RELEASE",57.000,"TRUE"
"P0755","R190830022","2021-08-30","POSITIVE RELEASE",14.500,"TRUE"

I basically want the application to remove the second row as the first 4 values within the "" match.

Any help would be appreciated or examples.

NoUserHere · Oct 8, 2019

I don't follow... Can you explain with more detail before I start popping unnecessary/mistaken answers at you? Will there be one or more files in this folder? This needs clarifying :

tdignan87 said:
to see if the second row matches the first two text rows;

What do you define as a row?
If you have this :

tdignan87 said:
"P0755","R190830022","2021-08-30","POSITIVE RELEASE",57.000,"TRUE"
"P0755","R190830022","2021-08-30","POSITIVE RELEASE",14.500,"TRUE"

What should the output be after the application removes the second row?

tdignan87 · Oct 8, 2019

Ah the output should just be the first row and place the second row in a new file if possible.
I should have said that;
if i cant place it in another file then just delete the second row.

There will be two or three files max in this folder. The system takes the CSV and moves it again after it's been processed.
Rows are the rows of data within the CSV file.
Normally we would have about 5 max rows in one file.

NoUserHere · Oct 8, 2019

Ok, you can have it either way. Would you prefer a new file or delete the second row?
I think deleting the second row would be easier, and cleaner.

So basically you want to remove the second row if "P0755","R190830022" is listed twice? So essentially you're removing duplicates?

Regular Expression would probably be recommended here, if so..

tdignan87 · Oct 8, 2019

Yeah delete is fine please.

yeah exactly that if "P0755","R190830022" is listed twice.

NoUserHere · Oct 8, 2019

It's commented for you so you should be able to follow what's going on in the code. All you need to do is stick one of them inside a button or method and it will start looking for a CSV file in your desktop called csvfile.csv, but I suggest changing this to suit your needs. Take note, that this code will overwrite your current csv file you are reading from and replace it with a new file with non duplicated lines. There is room for improvement here! If "P0755" is not relevant, you could opt from using hashset to using the dictionary to hold <key,value> as <line, R190830022> making line your unique key. The code works as tested and executes at 6ms, which can be improved by optimising the code.

Method 1:

            /* Combines the desktop folder with the file name of my CSV file */
            string pathToFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "csvfile.csv");
            Dictionary<string, string> kvp = new Dictionary<string, string>();
            HashSet<string> eachLine = new HashSet<string>();
            /* Above are self explanatory, while below csvReader creates an array of strings from all lines in the file */
            string[] csvReader = File.ReadAllLines(pathToFile);
            /* Loop the string array of lines */
            foreach (string line in csvReader)
            {
                /* Split at first comma */
                var partA = line.Split(',').FirstOrDefault();
                /* Split at second comma, by skipping the first one */
                var partB = line.Split(',').Skip(1).FirstOrDefault();
                /* TryGetValue returns a bool, so we set the bool if the key exists */
                bool kvpHasValue = kvp.TryGetValue(partB, out string kvpValue);
                /* If value isn't returned, this will be false */
                if (kvpHasValue == false)
                /* If it isn't added, we will add it below */
                {
                    /* partB seems to be unique, so I used this for the key, and partA as the value since it's not unique according to CSV */
                    kvp.Add(partB, partA);
                    /* Next we add partB and partA to the dictionary, next we add the line to the hashset */
                    eachLine.Add(line);
                }
            }
            File.WriteAllText(pathToFile, string.Empty);
            /* Start with an empty file, then write to it */
            eachLine.ToList().ForEach(func_line => File.AppendAllText(pathToFile, string.Concat(func_line, Environment.NewLine)));
            /* Lastly write the file back with only the entries we added, and no duplicates */

This is what I started out with, but was able to make it 3ms quicker by using the below instead :

Method 2:

            /* Combines the desktop folder with the file name of my CSV file */
            string pathToFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "csvfile.csv");
            Dictionary<string, string> kvp = new Dictionary<string, string>();
            StringBuilder sb = new StringBuilder();
            /* Above are self explanatory, while below csvReader creates an array of strings from all lines in the file */
            string[] csvReader = File.ReadAllLines(pathToFile);
            /* Loop the string array of lines */
            foreach (string line in csvReader)
            {
                /* Split at first comma */
                var partA = line.Split(',').FirstOrDefault();
                /* Split at second comma, by skipping the first one */
                var partB = line.Split(',').Skip(1).FirstOrDefault();
                /* TryGetValue returns a bool, so we set the bool if the key exists */
                bool kvpHasValue = kvp.TryGetValue(partB, out string kvpValue);
                /* If value isn't returned, this will be false */
                if (kvpHasValue == false)
                /* If it isn't added, we will add it below */
                {
                    /* partB seems to be unique, so I used this for the key, and partA as the value since it's not unique according to CSV */
                    kvp.Add(partB, partA);
                    /* Next we add partB and partA to the dictionary, next we add the line to the string builder */
                    sb.Insert(sb.Length, string.Concat(line, Environment.NewLine));
                }
            }
            /* Lastly write the file back with only the entries we added, and no duplicates */
            File.WriteAllText(pathToFile, sb.ToString());

The difference is at the end of the file I believe, where method 1 opens the file twice before writing to it.

<<< Method 1

<<< Method 2
In both methods, the code runs at the same speed up until this point :

C#:

            File.WriteAllText(pathToFile, string.Empty);
            eachLine.ToList().ForEach(func_line => File.AppendAllText(pathToFile, string.Concat(func_line, Environment.NewLine)));

I know you said you won't be executing many lines, but at a later point if that changes and you are executing more lines, take this into account, as the above line will cost you 3ms in execution time. It really doesn't seem like a lot, but overtime it adds up. Hope this is what you're looking for? Tested and working but it can be greatly improved. If you want to find all files in a directory, I suggest you iterate a directory path and for each file pass in the file location to pathToFile and let it execute the method. This will also give you something to do. Post back if you're stuck

JuggaloBrotha · Oct 9, 2019

Do you have access to Sql Server or similar?
What you could do is load all of the data into a table then use a query to pull out a single record from the dups, the nice part about doing it this way is you can add a more granular criteria for which one of the duplicates you grab instead of the first one or the last one in the file. You'd use the Row_Number() Over (Partition By [column with P0755}, [column with R190830022] Order By [column with P0755}, [column with R190830022], [column with 57.000]) As [RowNum] And your where clause around that would have [RowNum] = 1.

tdignan87 · Oct 9, 2019

Sheepings it worked great thank you.

tdignan87 · Oct 9, 2019

JuggaloBrotha said:
Do you have access to Sql Server or similar?
What you could do is load all of the data into a table then use a query to pull out a single record from the dups, the nice part about doing it this way is you can add a more granular criteria for which one of the duplicates you grab instead of the first one or the last one in the file. You'd use the Row_Number() Over (Partition By [column with P0755}, [column with R190830022] Order By [column with P0755}, [column with R190830022], [column with 57.000]) As [RowNum] And your where clause around that would have [RowNum] = 1.

Hi JuggaloBrotha
Sorry its a AS400 old PRMS system that has the data. I don't have access to the DB.
Thanks though

NoUserHere · Oct 9, 2019

tdignan87 said:
Sheepings it worked great thank you.

I thought it might. You're welcome. Happy to help.

tdignan87 · Oct 9, 2019

If you want box of goodies let me know

The code is for a food manufacturer in Scotland (shortbread)

tdignan87 · Oct 9, 2019

C#:

string pathToFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), @"C:\Users\tdignan\Documents\CSV Script", " *.csv");

Sorry one last question. How do i get the code to look at csv files within a directory and not just a specific file?
Thanks

NoUserHere · Oct 9, 2019

I am on the road for work right now. So I'm not at a PC to type it out. If someone else doesn't reply to you by this evening, I will show you when I return home. Meanwhile, can you post the location(s) of your CSV files?

Meanwhile, you can try do it yourself while waiting, as I've explained how to do that part above.

tdignan87 · Oct 9, 2019

I will attempt it myself mate and post if I have been successful. If not by the time you are home then you know i need help

\\192.168.7.33\Vantage_Hist

Thats the path above where the file sits

NoUserHere · Oct 9, 2019

I'll do it a different way... I will have it so you can input the location into the console yourself. Try :

Silent Version:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;

namespace TestConsoleApp
{
    internal class Program
    {
        static string pathToFolder = string.Empty;
        private static void Main(string[] args)
        {
            Console.WriteLine("Press Ctrl+V to paste in your path and press enter key :");
            /* After you Ctrl+V, and hit enter, pathToFile will be set to the directory of your CSV files */
            pathToFolder = Console.ReadLine();
            DirectoryInfo directory = new DirectoryInfo(pathToFolder);
            foreach (FileInfo file in directory.GetFiles("*.csv"))
            {
                RemoveDuplicates_InEachFile(file.FullName, file.Name);
            }
        }
        private static void RemoveDuplicates_InEachFile(string pathToFile, string filename)
        {
            HashSet<string> hashSet = new HashSet<string>();
            /* Above are self explanatory, while below csvReader creates an array of strings from all lines in the file */
            string[] csvReader = File.ReadAllLines(pathToFile);
            /* Loop the string array of lines */
            foreach (string line in csvReader)
            {
                /* Split at second comma, by skipping the first one */
                string partB = line.Split(',').Skip(1).FirstOrDefault();
                bool hasText = hashSet.Any(Func_Partial => Func_Partial.Contains(partB));
                if (hasText == false)
                {
                    /* If it isn't added, we will add it below */
                    hashSet.Add(line);
                    /* Next we add line to the hash set */
                }
            }
            /* Delete the file, and recreate it by appending it */
            File.Delete(pathToFile);
            hashSet.ToList().ForEach(func_line => File.AppendAllText(pathToFile, string.Concat(func_line, Environment.NewLine)));
            /* Lastly write the file back with only the entries we added, and no duplicates */
        }
    }
}

If you want to read what the code is doing, you can use this verbose mode. All It adds is come comments to write out on screen what it is doing...

Verbose Version:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;

namespace TestConsoleApp
{
    internal class Program
    {
        static string pathToFolder = string.Empty;

        private static void Main(string[] args)
        {
            Console.WriteLine("Press Ctrl+V to paste in your path and press enter key :");
            /* After you Ctrl+V, and hit enter, pathToFile will be set to the directory of your CSV files */
            pathToFolder = Console.ReadLine();
            DirectoryInfo directory = new DirectoryInfo(pathToFolder);
            foreach (FileInfo file in directory.GetFiles("*.csv"))
            {
                Console.WriteLine($"REPORT: Console found file {file.Name} : Processing file content....");
                RemoveDuplicates_InEachFile(file.FullName, file.Name);
            }
            Console.WriteLine("Console is finished processing CSV files");
            Console.WriteLine("Press any key to exit. Are you still looking for the any key? :)");
            Console.ReadKey();
        }
        private static void RemoveDuplicates_InEachFile(string pathToFile, string filename)
        {
            Console.WriteLine($"Processing file: {filename}");
            HashSet<string> hashSet = new HashSet<string>();
            /* Above are self explanatory, while below csvReader creates an array of strings from all lines in the file */
            string[] csvReader = File.ReadAllLines(pathToFile);
            Console.WriteLine($"{filename} contains {csvReader.Length} lines");
            int cycle = 1;
            /* Loop the string array of lines */
            foreach (string line in csvReader)
            {
                Console.WriteLine($"Processing line ({cycle}) : Outputting line content: {line}");
                /* Split at second comma, by skipping the first one */
                string partB = line.Split(',').Skip(1).FirstOrDefault();
                Console.WriteLine($"Getting keyword from line ({cycle}) : Searching line for keyword ({partB})");
                bool hasText = hashSet.Any(Func_Partial => Func_Partial.Contains(partB));
                if (hasText == false)
                {
                    Console.WriteLine($"({partB}) keyword was not found on the hashset, continuing process...");
                    /* If it isn't added, we will add it below */
                    hashSet.Add(line);
                    Console.WriteLine($"Adding line ({cycle}) to the hashset");
                    /* Next we add line to the hash set */
                }
                else
                {
                    Console.WriteLine($"Duplicate line has been identified and skipped");
                }
                cycle++;
            }
            /* Delete the file, and recreate it by appending it */
            Console.WriteLine($"Deleting old file : {filename}");
            File.Delete(pathToFile);
            Console.WriteLine($"({filename}) deleted successfully, standby while compiling new hashset file...");
            hashSet.ToList().ForEach(func_line => File.AppendAllText(pathToFile, string.Concat(func_line, Environment.NewLine)));
            Console.WriteLine($"({filename}) successfully rebuilt!");
            cycle = 1;
            /* Lastly write the file back with only the entries we added, and no duplicates */
        }
    }
}

The verbose mode outputs the following on screen :

Console Output:

Press Ctrl+V to paste in your path and press enter key :
C:\Users\user\Desktop
REPORT: Console found file csvfile - Copy (2).csv : Processing file content....
Processing file: csvfile - Copy (2).csv
csvfile - Copy (2).csv contains 7 lines
Processing line (1) : Outputting line content: "P0755","R190830022","2021-08-30","POSITIVE RELEASE",57.000,"TRUE"
Getting keyword from line (1) : Searching line for keyword ("R190830022")
("R190830022") keyword was not found on the hashset, continuing process...
Adding line (1) to the hashset
Processing line (2) : Outputting line content: "P0755","R190830088","2021-08-30","POSITIVE RELEASE",18884.500,"TRUE"
Getting keyword from line (2) : Searching line for keyword ("R190830088")
("R190830088") keyword was not found on the hashset, continuing process...
Adding line (2) to the hashset
Processing line (3) : Outputting line content: "P0755","R190830089","2021-08-30","POSITIVE RELEASE",58887.000,"TRUE"
Getting keyword from line (3) : Searching line for keyword ("R190830089")
("R190830089") keyword was not found on the hashset, continuing process...
Adding line (3) to the hashset
Processing line (4) : Outputting line content: "P0755","R190830777","2021-08-30","POSITIVE RELEASE",57.000,"TRUE"
Getting keyword from line (4) : Searching line for keyword ("R190830777")
("R190830777") keyword was not found on the hashset, continuing process...
Adding line (4) to the hashset
Processing line (5) : Outputting line content: "P0755","R190830777","2021-08-30","POSITIVE RELEASE",57.000,"TRUE"
Getting keyword from line (5) : Searching line for keyword ("R190830777")
Duplicate line has been identified and skipped
Processing line (6) : Outputting line content: "P0755","R190830777","2021-08-30","POSITIVE RELEASE",14.500,"TRUE"
Getting keyword from line (6) : Searching line for keyword ("R190830777")
Duplicate line has been identified and skipped
Processing line (7) : Outputting line content: "P0755","R190830777","2021-08-30","POSITIVE RELEASE",57.000,"TRUE"
Getting keyword from line (7) : Searching line for keyword ("R190830777")
Duplicate line has been identified and skipped
Deleting old file : csvfile - Copy (2).csv
(csvfile - Copy (2).csv) deleted successfully, standby while compiling new hashset file...
(csvfile - Copy (2).csv) successfully rebuilt!

To have your folder location hard coded change pathToFolder = Console.ReadLine(); - to equal the folder location in double quotation marks and remove line 13 too.

Remove duplicate strings from CSV

tdignan87

Well-known member

NoUserHere

Well-known member

tdignan87

Well-known member

NoUserHere

Well-known member

tdignan87

Well-known member

NoUserHere

Well-known member

JuggaloBrotha

tdignan87

Well-known member

tdignan87

Well-known member

NoUserHere

Well-known member

tdignan87

Well-known member

tdignan87

Well-known member

NoUserHere

Well-known member

tdignan87

Well-known member

NoUserHere

Well-known member

Similar threads

Share this page

Latest posts