Remove duplicate strings from CSV

tdignan87

Well-known member
Joined
Jul 8, 2019
Messages
95
Programming Experience
Beginner
Hi,
I have CSV files that come from our ERP system although there is a bug with the system ( waiting for it to be fixed) where occasionally it sends down a file that contains CSV files with a duplicate row which is then causing chaos with our other system that it imports into.
I should have a fix for the ERP in a few weeks, however;

How can I create a simple console file that checks the folder for any CSV and reads the file and checks to see if the second row matches the first two text rows; if so it just deletes the last row.
Example of csv below
"P0755","R190830022","2021-08-30","POSITIVE RELEASE",57.000,"TRUE"
"P0755","R190830022","2021-08-30","POSITIVE RELEASE",14.500,"TRUE"

I basically want the application to remove the second row as the first 4 values within the "" match.

Any help would be appreciated or examples.
 
I am getting the value cannot be null?

1570701648147.png
1570701674764.png
 
That's perfectly normal when your CSV contains blank lines. So when its iterating or parsing the contents for the blank line, its causing the exception to arise. Add this after line 25, above the comment :
C#:
            IEnumerable<string> eachBlankLine = File.ReadAllLines(pathToFile).Where(emptyLine => !string.IsNullOrWhiteSpace(emptyLine));
            File.WriteAllLines(pathToFile, eachBlankLine);
 
All working great now. I changed it to read from the path directly rather than prompt in CMD for the path.
Thanks very much for taking time to help me out.
 
That's no bother. Glad to help, and if you want to work towards simplifying the code, you can use the two lines i last gave you and change string[] csvReader, to a hashset and use lines 44 with a little tweak to remove the duplicates. You'd do the whole lot in about 5 lines or so. ;)
 
If you believe in LINQ, here's an alternative approach:

C#:
class RowComparer : IEqualityComparer<string>
{
    public bool Equals(string x, string y) => GetHashCode(x) == GetHashCode(y);

    public int GetHashCode(string obj)
    {
        //$ TODO: need to replace this parsing with more robust parsing to handle quotes
        string key = obj?.Split(',')
                        ?.Skip(1)
                        ?.FirstOrDefault();
        return key?.GetHashCode() ?? 0;
    }
}
:

File.WriteAllLines(tempFileName,
                  File.ReadLines(originalFileName)
                      .Distinct(new RowComparer()));
File.Replace(tempFileName, originalFileName, null);
File.Delete(tempFileName);

:

Note: Untested code above, it's just me designing on the keyboard while waiting for a build to finish.
 
Cool i will have a play with it.
Now say i wanted to
1. Remove the duplicates but move them into a seperate CSV file :) ( just the duplicates). What would be the best approach?
 
Or if it's easier (sounds harder though)
Take the quantity from the duplicate row, and add it to the quantity to other matching row. ;)
 
Use what you have and learn to edit it. It's well commented to see where duplicates are handled. It's then a mater of reading up on file writealltext.

@Skydiver not sure if it's because I am mobile, but that looks like a lot of text. Looking again, it can be shortened. Read the files line(s) to a list excluding the duplicate lines and empty lines with linq

Reply from mobile
 
No need to read into a list. File.ReadLines() returns an IEnumerable<string>. File.WriteAllLines() also takes an IEnumerable<string>. All that is needed is a way to find the unique lines. Thate is where the LINQ Distinct() extension come in. It does all the magic you did with your HashSet in your original code (in fact if you look at the reference sources, it also uses a HashSet. All that is missing is to be able to tell Distinct() how to compare two different lines to see if they are the same or different. That is where the RowComparer which implements the IEqualityComparer comes in. This is what does the line parsing and pulls out the second column value and checks for equality.
 
After seeing my comparer, as well as Jon Skeet's comparer, I've come up with something like this hybrid:
C#:
class Comparer<T, TKey> : IEqualityComparer<T>
{
    Func<T, TKey> _getKey;

    public Comparer(Func<T, TKey> getKey) => _getKey = getKey;
    public bool Equals(T x, T y) => _getKey(x) == _getKey(y);
    public int GetHashCode(T obj) => _getKey(obj).GetHashCode();
}

:
var comparer = new Comparer(r => r.Split(',')
                                  .ElementAtOrDefault(1)
                                  ?? "");
File.WriteAllLines(tempFileName,
                   File.ReadLines(originalFileName)
                       .Distinct(comparer));
File.Replace(tempFileName, originalFileName, null);
File.Delete(tempFileName);
:

Again, untested code. Just doodling at the keyboard.
 
class Comparer<T, TKey> : IEqualityComparer<T>
{
Func<T, TKey> _getKey;

public Comparer(Func<T, TKey> getKey) => _getKey = getKey;
public bool Equals(T x, T y) => _getKey(x) == _getKey(y);
public int GetHashCode(T obj) => _getKey(obj).GetHashCode();
}

:
var comparer = new Comparer(r => r.Split(',')
.ElementAtOrDefault(1)
?? "");
File.WriteAllLines(tempFileName,
File.ReadLines(originalFileName)
.Distinct(comparer));
File.Replace(tempFileName, originalFileName, null);
File.Delete(tempFileName);
:
Hi Skydiver
I'm pasting that into the console but its all red.
Where am i going wrong? Excuse my stupidity.

1570786308185.png
 
Back
Top Bottom