Remove duplicate strings from CSV

tdignan87

Well-known member
Joined
Jul 8, 2019
Messages
95
Programming Experience
Beginner
Hi,
I have CSV files that come from our ERP system although there is a bug with the system ( waiting for it to be fixed) where occasionally it sends down a file that contains CSV files with a duplicate row which is then causing chaos with our other system that it imports into.
I should have a fix for the ERP in a few weeks, however;

How can I create a simple console file that checks the folder for any CSV and reads the file and checks to see if the second row matches the first two text rows; if so it just deletes the last row.
Example of csv below
"P0755","R190830022","2021-08-30","POSITIVE RELEASE",57.000,"TRUE"
"P0755","R190830022","2021-08-30","POSITIVE RELEASE",14.500,"TRUE"

I basically want the application to remove the second row as the first 4 values within the "" match.

Any help would be appreciated or examples.
 
That's odd i guess. I don't know if classes can reside inside of methods, can they?

This is where understanding the basic building blocks of an application is a must know.
 
Sorry - I am just learning and I am really struggling! :(
This is my code. I need the duplicate row to delete and go into its own separate CSV file, for each duplicate. Trying all day pretty much but I just can't get it.

This is my code
C#:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO;
namespace CSVRemoveDuplicates
{
 
        class Program
        {
            static string pathToFolder = @"C:\Users\tdignan\Documents\CSV TEST";
            private static void Main(string[] args)
            {
                // Console.WriteLine("Press Ctrl+V to paste in your path and press enter key :");
                /* After you Ctrl+V, and hit enter, pathToFile will be set to the directory of your CSV files */
                // pathToFolder = Console.ReadLine();
                DirectoryInfo directory = new DirectoryInfo(pathToFolder);
                foreach (FileInfo file in directory.GetFiles("*.csv"))
                {
                    RemoveDuplicates_InEachFile(file.FullName, file.Name);
                }
            }
            private static void RemoveDuplicates_InEachFile(string pathToFile, string filename)
            {
                // Below was put into file incase wanted CSV to go to a seperate directory. Easier if it is is within the same directory.
                string pathtoFile2 = @"C:\Users\tdignan\Documents\CSV TEST\STOCKRD" + DateTime.Now.ToFileTime() + ".csv";

                HashSet<string> hashSet = new HashSet<string>();
                /* Above are self explanatory, while below csvReader creates an array of strings from all lines in the file */


                IEnumerable<string> eachBlankLine = File.ReadAllLines(pathToFile).Where(emptyLine => !string.IsNullOrWhiteSpace(emptyLine));
                File.WriteAllLines(pathToFile, eachBlankLine);
                string[] csvReader = File.ReadAllLines(pathToFile);
                /* Loop the string array of lines */
                foreach (string line in csvReader)
                {
                    /* Split at second comma, by skipping the first one */
                    string partB = line.Split(',').Skip(1).FirstOrDefault();
                    bool hasText = hashSet.Any(Func_Partial => Func_Partial.Contains(partB));
                    if (hasText == false)
                    {
                        /* If it isn't added, we will add it below */
                        hashSet.Add(line);



                        /* Next we add line to the hash set */
                    }
                    else
                    {


                    }
                }
            /* Delete the file, and recreate it by appending it */

            //   File.Delete(pathToFile);
            //File.Delete(pathToFile);
            File.Delete(pathToFile);
            File.WriteAllLines(pathtoFile2, eachBlankLine);
          
            
      


          
            hashSet.ToList().ForEach(func_line => File.AppendAllText(pathToFile, string.Concat(func_line, Environment.NewLine)));
            
            // System.IO.File.WriteAllText(pathToFile, pathToFile);

            /* Lastly write the file back with only the entries we added, and no duplicates */
        }
        }
    }
 
I've managed to get it to create a seperate CSV file, but if no duplicates it still creates a new CSV file anyway with the data from the CSV regardless if its a duplicate or not.
The code moves the duplicate, but keeps the duplicate in the original file also.
I need it to delete the duplicate in the original file, and for it not to create any new CSV files for any records (if no duplicates)

Cheers!
Thanks for your patience!
 
I don't see how you managed to screw up that code even with the comments which told you what was what... Anyway, try this out. I just quickly wrote it, and I have not tested it, but it should work, and it will also store your copies in a copies folder inside of where it reads the files from the original directory. Note that if the files exist, it won't overwrite them. :
C#:
        const string pathToFolder = @"C:\Users\user\Downloads\CSV Script\";
        const string pathNewDir = "Copies";
        private static void Main(string[] args)
        {
            DirectoryInfo directory = new DirectoryInfo(pathToFolder);
            FileInfo[] array = directory.GetFiles("*.csv");
            for (int i = 0; i < array.Length; i++)
            {
                FileInfo file = array[i];
                RemoveBlanks_InEachFile(file.FullName, Path.Combine(pathToFolder, pathNewDir, file.Name));
            }
        }
        private static void RemoveBlanks_InEachFile(string pathToFile, string pathOfCopies)
        {
            var existingPath = Path.GetDirectoryName(pathOfCopies);
            if (!Directory.Exists(existingPath))
                Directory.CreateDirectory(existingPath);
            IEnumerable<string> nonBlankLine = File.ReadAllLines(pathToFile).Where(nonEmptyLine => !string.IsNullOrWhiteSpace(nonEmptyLine));
            AddNon_Duplicated(nonBlankLine, pathOfCopies, null);
        }

        private static void AddNon_Duplicated(IEnumerable<string> nonBlankLine, string pathOfCopies, HashSet<string> hashSet_Filter)
        {
            hashSet_Filter = new HashSet<string>();
            foreach (var non_Duplicate in from string line in nonBlankLine
                                          let partB = line.Split(',').Skip(1).FirstOrDefault()
                                          let hasText = hashSet_Filter.Any(Func_Partial => Func_Partial.Contains(partB))
                                          where hasText == false
                                          select line)
            {
                hashSet_Filter.Add(non_Duplicate);
            }
            Write_IEnumerableValues(pathOfCopies, hashSet_Filter);
        }
        private static void Write_IEnumerableValues(string writeTo, IEnumerable<string> newValues)
        {
            File.WriteAllLines(writeTo, newValues);
        }
I would assume in whatever other application you are using. After you import these new CSV files, you would iterate the directory and delete each file.
 
Last edited:
Thanks
I need it to delete the duplicate records from the original file also otherwise the ERP system will process the transaction from the original file; and also the new one created.
I need it to also not create a second file if there is no duplicates in the file.
 
I done it like this because adding the duplicates to a new file doesn't make sense. Why would you need duplicate values if you're importing them into another system?

You started out asking how to remove/delete the duplicates, and later on asking again to add them to a new file. State exactly what you want in your opening thread post, instead of changing your mind.
Ok, you can have it either way. Would you prefer a new file or delete the second row?
Yeah delete is fine please.
As quoted, this is not what you started out asking for. I am sorry if I initially misunderstood, but that's the trouble with changing your mind, or not being descriptive enough from the get-go.
I need the duplicate row to delete and go into its own separate CSV file
I deliberately compartmentalised the latest code into methods since your last attempt is all messed up, and so you can see where the work is being done. And if you want the duplicates only, you can acquire them at : foreach (var non_Duplicate in from string line in nonBlankLine, and pass them onto your writing method private static void Write_IEnumerableValues with the path(s) you want to use. You can extend the method to accommodate another string to accept a second path to write for your duplicates file path, and also add an additional IEnumerable<string> for your collection of duplicates.

Consider this; if what I gave you already takes all lines excluding non-duplicating values from your csv, what do you think you now need to alter to get only the duplicate values instead?
 
Back
Top Bottom