Remove duplicate strings from CSV

tdignan87 · Oct 8, 2019

Hi,
I have CSV files that come from our ERP system although there is a bug with the system ( waiting for it to be fixed) where occasionally it sends down a file that contains CSV files with a duplicate row which is then causing chaos with our other system that it imports into.
I should have a fix for the ERP in a few weeks, however;

How can I create a simple console file that checks the folder for any CSV and reads the file and checks to see if the second row matches the first two text rows; if so it just deletes the last row.
Example of csv below
"P0755","R190830022","2021-08-30","POSITIVE RELEASE",57.000,"TRUE"
"P0755","R190830022","2021-08-30","POSITIVE RELEASE",14.500,"TRUE"

I basically want the application to remove the second row as the first 4 values within the "" match.

Any help would be appreciated or examples.

JuggaloBrotha · Oct 9, 2019

tdignan87 said:
Hi JuggaloBrotha
Sorry its a AS400 old PRMS system that has the data. I don't have access to the DB.
Thanks though

No I was asking if you had access to Sql Server or similar, not whether you have access to the system where the data comes from.

tdignan87 · Oct 10, 2019

I am getting the value cannot be null?

NoUserHere · Oct 10, 2019

Upload your CSV files here please

tdignan87 · Oct 10, 2019

Attached mate

NoUserHere · Oct 10, 2019

That's perfectly normal when your CSV contains blank lines. So when its iterating or parsing the contents for the blank line, its causing the exception to arise. Add this after line 25, above the comment :

C#:

            IEnumerable<string> eachBlankLine = File.ReadAllLines(pathToFile).Where(emptyLine => !string.IsNullOrWhiteSpace(emptyLine));
            File.WriteAllLines(pathToFile, eachBlankLine);

tdignan87 · Oct 10, 2019

Ah i didn't realise it contained blank lines. I should have noticed this!
Thanks very much mate

tdignan87 · Oct 10, 2019

All working great now. I changed it to read from the path directly rather than prompt in CMD for the path.
Thanks very much for taking time to help me out.

NoUserHere · Oct 10, 2019

That's no bother. Glad to help, and if you want to work towards simplifying the code, you can use the two lines i last gave you and change string[] csvReader, to a hashset and use lines 44 with a little tweak to remove the duplicates. You'd do the whole lot in about 5 lines or so.

Skydiver · Oct 10, 2019

If you believe in LINQ, here's an alternative approach:

C#:

class RowComparer : IEqualityComparer<string>
{
    public bool Equals(string x, string y) => GetHashCode(x) == GetHashCode(y);

    public int GetHashCode(string obj)
    {
        //$ TODO: need to replace this parsing with more robust parsing to handle quotes
        string key = obj?.Split(',')
                        ?.Skip(1)
                        ?.FirstOrDefault();
        return key?.GetHashCode() ?? 0;
    }
}
:

File.WriteAllLines(tempFileName,
                  File.ReadLines(originalFileName)
                      .Distinct(new RowComparer()));
File.Replace(tempFileName, originalFileName, null);
File.Delete(tempFileName);

:

Note: Untested code above, it's just me designing on the keyboard while waiting for a build to finish.

tdignan87 · Oct 10, 2019

Cool i will have a play with it.
Now say i wanted to
1. Remove the duplicates but move them into a seperate CSV file

( just the duplicates). What would be the best approach?

tdignan87 · Oct 10, 2019

Or if it's easier (sounds harder though)
Take the quantity from the duplicate row, and add it to the quantity to other matching row.

NoUserHere · Oct 10, 2019

Use what you have and learn to edit it. It's well commented to see where duplicates are handled. It's then a mater of reading up on file writealltext.

@Skydiver not sure if it's because I am mobile, but that looks like a lot of text. Looking again, it can be shortened. Read the files line(s) to a list excluding the duplicate lines and empty lines with linq

Reply from mobile

Skydiver · Oct 10, 2019

No need to read into a list. File.ReadLines() returns an IEnumerable<string>. File.WriteAllLines() also takes an IEnumerable<string>. All that is needed is a way to find the unique lines. Thate is where the LINQ Distinct() extension come in. It does all the magic you did with your HashSet in your original code (in fact if you look at the reference sources, it also uses a HashSet. All that is missing is to be able to tell Distinct() how to compare two different lines to see if they are the same or different. That is where the RowComparer which implements the IEqualityComparer comes in. This is what does the line parsing and pulls out the second column value and checks for equality.

Skydiver · Oct 10, 2019

After seeing my comparer, as well as Jon Skeet's comparer, I've come up with something like this hybrid:

C#:

class Comparer<T, TKey> : IEqualityComparer<T>
{
    Func<T, TKey> _getKey;

    public Comparer(Func<T, TKey> getKey) => _getKey = getKey;
    public bool Equals(T x, T y) => _getKey(x) == _getKey(y);
    public int GetHashCode(T obj) => _getKey(obj).GetHashCode();
}

:
var comparer = new Comparer(r => r.Split(',')
                                  .ElementAtOrDefault(1)
                                  ?? "");
File.WriteAllLines(tempFileName,
                   File.ReadLines(originalFileName)
                       .Distinct(comparer));
File.Replace(tempFileName, originalFileName, null);
File.Delete(tempFileName);
:

Again, untested code. Just doodling at the keyboard.

tdignan87 · Oct 11, 2019

Skydiver said:
class Comparer<T, TKey> : IEqualityComparer<T>
{
Func<T, TKey> _getKey;

public Comparer(Func<T, TKey> getKey) => _getKey = getKey;
public bool Equals(T x, T y) => _getKey(x) == _getKey;
public int GetHashCode(T obj) => _getKey(obj).GetHashCode();
}

:
var comparer = new Comparer(r => r.Split(',')
.ElementAtOrDefault(1)
?? "");
File.WriteAllLines(tempFileName,
File.ReadLines(originalFileName)
.Distinct(comparer));
File.Replace(tempFileName, originalFileName, null);
File.Delete(tempFileName);
:

Hi Skydiver
I'm pasting that into the console but its all red.
Where am i going wrong? Excuse my stupidity.

Remove duplicate strings from CSV

tdignan87

Well-known member

JuggaloBrotha

tdignan87

Well-known member

NoUserHere

Well-known member

tdignan87

Well-known member

Attachments

NoUserHere

Well-known member

tdignan87

Well-known member

tdignan87

Well-known member

NoUserHere

Well-known member

Skydiver

tdignan87

Well-known member

tdignan87

Well-known member

NoUserHere

Well-known member

Skydiver

Skydiver

tdignan87

Well-known member

Similar threads

Share this page

Latest posts