Optimizing Memory Usage when Processing Data in C#

devhisu_vn · Jul 12, 2024

Hello everyone, I'm currently tasked with processing 10 text files, each containing 500,000 lines. Each line consists of 10 keywords separated by ';'. My job is to read these files and find the top 10 most frequently appearing keywords across all files (essentially the top 10 keywords out of 50 million keywords).

A strict requirement is to limit memory usage to no more than 1GB. I've tried using StreamReader to read the files and storing counts of each keyword in a Dictionary, but this approach doesn't meet the memory requirement. Is there a way to reduce the memory footprint for this task? Thank you.

Skydiver · Jul 12, 2024

Is the 1GB limit just for your data, or is it your data plus your program code to run?

If it's just your data, and there are 50 million possible unique keywords, then just using rough 1 GB (as opposed to 1 GiB) then: 1,000,000,000 / 50,000,000 = 20.

That means that you only have 20 bytes allowed per keyword on average. Given that C# uses 2 bytes per character, that means that each keyword can only have 10 characters per keyword on average. If that 1GB limit is data plus code, then you are pretty much going to be over your limit.

If these keywords are stored in an array, there is the overhead of the array which will be 50,000,000 * 8 bytes per string reference. That will easily take you over your 1GB data only limit.

If you are using an array to store all the counts of the keywords, then you need an additional 50,000,000 * 4 bytes per integer count, over and above that list of keywords that you keep. Again this will take you even further over the 1GB data limit.

But all that above is assuming that you using naive arrays to keep track of things. If you use trie data structure, my gut feel is that you will easily fit into your 1GB data limit, but that means you'll need to break out your computer science data structures book.

Trie - Wikipedia

en.wikipedia.org

Skydiver · Jul 12, 2024

Can you share your code?

My understanding is that the C# Dictionary implementation grows as needed, so it's not as inefficient as a naive array implementation. Unless you truly have 50M unique keywords spread out over those 10 text files, and a majority of them are over 10 characters, then yes, I can see the memory limits being surpassed.

See How is the c#/.net 3.5 dictionary implemented? to read about how the Dictionary is implemented.

Also recall that the framework holds on to memory until garbage collection occurs. The recommendation from Microsoft is to let the garbage collection to happen naturally, but if you are willing to go against those recommendations and force the collections, perhaps your memory usage would also be lower.

How exactly are you measuring your memory usage? Are you just looking at task manager? Are you running a profiler to have more accurate measurements?

Skydiver · Jul 12, 2024

Also a better explanation of how the Dictionary works, and a better way to get a feel for the overhead involved in using the Dictionary:

Dictionary implementation in C# - Dotnetos - courses & conferences about .NET

Did you know that a Dictionary was introduced in .Net Framework 2.0 - that is almost 20 years ago! This collection is mainly used to store key-value pairs. In this post you will learn how the hash algorithm is used in the Dictionary implementation and you will get an in-depth look at the basic...

dotnetos.org

Skydiver · Jul 12, 2024

Personally, I would just implement a trie, but I'm a data structures geek.

devhisu_vn · Jul 13, 2024

The memory comsumpsion must be less than 1GB. That means I need to design an algorithm so that the program does not take up more than 1GB of RAM when processing the above number of files. I used hashcode and divided each file into 1000. The division will reduce the load on Dictionary. And the number of iterations when using hashtable will also be greatly reduced. The remaining thing is to read the hashed code from each file in and out and put it into the Dictionary.

Skydiver said:
Personally, I would just implement a trie, but I'm a data structures geek.

devhisu_vn · Jul 13, 2024

Skydiver said:
Can you share your code?

My understanding is that the C# Dictionary implementation grows as needed, so it's not as inefficient as a naive array implementation. Unless you truly have 50M unique keywords spread out over those 10 text files, and a majority of them are over 10 characters, then yes, I can see the memory limits being surpassed.

See How is the c#/.net 3.5 dictionary implemented? to read about how the Dictionary is implemented.

Also recall that the framework holds on to memory until garbage collection occurs. The recommendation from Microsoft is to let the garbage collection to happen naturally, but if you are willing to go against those recommendations and force the collections, perhaps your memory usage would also be lower.

How exactly are you measuring your memory usage? Are you just looking at task manager? Are you running a profiler to have more accurate measurements?

I tried to use debug of Visual Studio, Garbage collector and task manager to measure usage memory. Output is the same result. It's not exacty 100% but quite similar.

devhisu_vn · Jul 13, 2024

Here's my code:

C#:

public IEnumerable<string> GetTop10Strings(string path)
{
 
    int numTempFiles = 1000;
    string tempDirectory = @"C:\Users\hoang\source\repos\Homework_1\Homework_1\tempData";

    Directory.CreateDirectory(tempDirectory);

    StreamWriter[] tempWriters = new StreamWriter[numTempFiles];
    for (int i = 0; i < numTempFiles; i++)
    {
        tempWriters[i] = new StreamWriter(System.IO.Path.Combine(tempDirectory, $"temp_{i}.txt"));
    }

    
    foreach (var filePath in Directory.GetFiles(path, "*.dat"))
    {
        using (StreamReader reader = new StreamReader(filePath))
        {
            string line;
            while ((line = reader.ReadLine()) != null)
            {
                var strings = line.Split(';');
                foreach (var str in strings)
                {
                    if (string.IsNullOrWhiteSpace(str))
                        continue;

                    int hash = Math.Abs(str.GetHashCode() % numTempFiles);
                    tempWriters[hash].WriteLine(str);
                }
            }
        }
    }

    
    foreach (var writer in tempWriters)
    {
        writer.Close();
    }

    
    List<string> finalTopKeywords = new List<string>();
    foreach (var tempFile in Directory.GetFiles(tempDirectory, "temp_*.txt"))
    {
        Dictionary<string, int> frequencyMap = new Dictionary<string, int>();
        using (StreamReader reader = new StreamReader(tempFile))
        {
            string line;
            while ((line = reader.ReadLine()) != null)
            {
                if (frequencyMap.ContainsKey(line))
                {
                    frequencyMap[line]++;
                }
                else
                {
                    frequencyMap[line] = 1;
                }
            }
        }

        
        var top10 = frequencyMap.OrderByDescending(kv => kv.Value)
                                .ThenBy(kv => kv.Key)
                                .Take(10)
                                .Select(kv => kv.Key);
        finalTopKeywords.AddRange(top10);
    }

    
    Dictionary<string, int> finalFrequencyMap = new Dictionary<string, int>();
    foreach (var keyword in finalTopKeywords)
    {
        if (finalFrequencyMap.ContainsKey(keyword))
        {
            finalFrequencyMap[keyword]++;
        }
        else
        {
            finalFrequencyMap[keyword] = 1;
        }
    }

    var overallTop10 = finalFrequencyMap.OrderByDescending(kv => kv.Value)
                                        .ThenBy(kv => kv.Key)
                                        .Take(10)
                                        .Select(kv => kv.Key);

    
    Directory.Delete(tempDirectory, true);

    return overallTop10;
}

devhisu_vn · Jul 13, 2024

Skydiver said:
Can you share your code?

My understanding is that the C# Dictionary implementation grows as needed, so it's not as inefficient as a naive array implementation. Unless you truly have 50M unique keywords spread out over those 10 text files, and a majority of them are over 10 characters, then yes, I can see the memory limits being surpassed.

See How is the c#/.net 3.5 dictionary implemented? to read about how the Dictionary is implemented.

Also recall that the framework holds on to memory until garbage collection occurs. The recommendation from Microsoft is to let the garbage collection to happen naturally, but if you are willing to go against those recommendations and force the collections, perhaps your memory usage would also be lower.

How exactly are you measuring your memory usage? Are you just looking at task manager? Are you running a profiler to have more accurate measurements?

It's not unique keyword. Random keywords can be duplicate, saved in 10 files. I need to find "Top 10 most repeated words" from these files.
I solved it by using hashcode, divide file -> 1000 temp files, use StreamWriter. Then, use Dictionary solve each temp file and delete it.

jmcilhinney · Jul 13, 2024

Please don't post unformatted code snippets. They are too hard to read. I have fixed it for you this time.

Skydiver · Jul 13, 2024

devhisu_vn said:
It's not unique keyword.

I understand that at across the 10 files, there could be more than 50 million words, but at most there could only be at most 50 million different words. If the number of unique words within those 10 files is less 50 million, then your margins for how long each of those words can get longer than 10 characters.

Also if you recall your computer science classes, the higher the divisor you divide your hash (i e the less buckets you have) the higher the chance of collision you get. Additionally that divisor should a prime number so that the hashing is more efficient.

Anyway, with your division strategy, you are setting yourself up for collisions, and therefore mistaken counts. Let's say the word "supercollider" hashes to 7777000, and the word "atom" hashes to 7777999. If you divide both of their hashes by 1000, they will both result in 7777. Later when you are counting the words based on that divided hash, when you encounter "supercollider" you will count that as 1 hit for bucket 7777 Then when you encounter "atom", you will count that as a hit for bucket 7777 again. So let's say "supercollider" appears 5 million times, and "atom" appears 9 million times, how do you know which one was more frequently used.

Anyway, your approach of using external hashing (as opposed to all internal hashing) is a viable strategy to reduce memory usage. A lot of old computer systems/programs from the 50s through the early 70s used this strategy.

On another topic: using the LINQ OrderBy() method going to undo all your work to reduce memory usage. This is because it does all its sorting in memory. So effectively all the contents of your external hashes are brought into memory.

devhisu_vn · Jul 14, 2024

S

jmcilhinney said:
Please don't post unformatted code snippets. They are too hard to read. I have fixed it for you this time.

Sorry, this is my first times. I'll pay attention next time

devhisu_vn · Jul 14, 2024

Skydiver said:
I understand that at across the 10 files, there could be more than 50 million words, but at most there could only be at most 50 million different words. If the number of unique words within those 10 files is less 50 million, then your margins for how long each of those words can get longer than 10 characters.

Also if you recall your computer science classes, the higher the divisor you divide your hash (i e the less buckets you have) the higher the chance of collision you get. Additionally that divisor should a prime number so that the hashing is more efficient.

Anyway, with your division strategy, you are setting yourself up for collisions, and therefore mistaken counts. Let's say the word "supercollider" hashes to 7777000, and the word "atom" hashes to 7777999. If you divide both of their hashes by 1000, they will both result in 7777. Later when you are counting the words based on that divided hash, when you encounter "supercollider" you will count that as 1 hit for bucket 7777 Then when you encounter "atom", you will count that as a hit for bucket 7777 again. So let's say "supercollider" appears 5 million times, and "atom" appears 9 million times, how do you know which one was more frequently used.

Anyway, your approach of using external hashing (as opposed to all internal hashing) is a viable strategy to reduce memory usage. A lot of old computer systems/programs from the 50s through the early 70s used this strategy.

On another topic: using the LINQ OrderBy() method going to undo all your work to reduce memory usage. This is because it does all its sorting in memory. So effectively all the contents of your external hashes are brought into memory.

It's because the hashCode is limit 32-bit right ? So, with the big data, it's will get collisions? Sorry I studuied development by myself no one teach me that. Why prime number can make the hashing is more efficient? Please teach me. Thank you very much.

devhisu_vn · Jul 14, 2024

Skydiver said:
I understand that at across the 10 files, there could be more than 50 million words, but at most there could only be at most 50 million different words. If the number of unique words within those 10 files is less 50 million, then your margins for how long each of those words can get longer than 10 characters.

Also if you recall your computer science classes, the higher the divisor you divide your hash (i e the less buckets you have) the higher the chance of collision you get. Additionally that divisor should a prime number so that the hashing is more efficient.

Anyway, with your division strategy, you are setting yourself up for collisions, and therefore mistaken counts. Let's say the word "supercollider" hashes to 7777000, and the word "atom" hashes to 7777999. If you divide both of their hashes by 1000, they will both result in 7777. Later when you are counting the words based on that divided hash, when you encounter "supercollider" you will count that as 1 hit for bucket 7777 Then when you encounter "atom", you will count that as a hit for bucket 7777 again. So let's say "supercollider" appears 5 million times, and "atom" appears 9 million times, how do you know which one was more frequently used.

Anyway, your approach of using external hashing (as opposed to all internal hashing) is a viable strategy to reduce memory usage. A lot of old computer systems/programs from the 50s through the early 70s used this strategy.

On another topic: using the LINQ OrderBy() method going to undo all your work to reduce memory usage. This is because it does all its sorting in memory. So effectively all the contents of your external hashes are brought into memory.

and in my data, any keyword has 10 character. When hashing it's can conflict hashcode ?

Skydiver · Jul 14, 2024

devhisu_vn said:
Why prime number can make the hashing is more efficient?

See:

Is it always necessary to make hash table number of buckets a prime number for performance reason?

https://www.quora.com/Why-should-the-size-of-a-hash-table-be-a-prime-number?share=1 I see that people mention that the number of buckets of a hash table is better to be prime numbers. Is it always ...

stackoverflow.com

Optimizing Memory Usage when Processing Data in C#

devhisu_vn

Member

Skydiver

Trie - Wikipedia

Skydiver

Skydiver

Dictionary implementation in C# - Dotnetos - courses & conferences about .NET

Skydiver

devhisu_vn

Member

devhisu_vn

Member

devhisu_vn

Member

devhisu_vn

Member

jmcilhinney

C# Forum Moderator

Skydiver

devhisu_vn

Member

devhisu_vn

Member

devhisu_vn

Member

Skydiver

Is it always necessary to make hash table number of buckets a prime number for performance reason?

Share this page

Latest posts