Question Hashing files

Jason Phang

Well-known member
Joined
Aug 13, 2019
Messages
46
Programming Experience
Beginner
I am currently working on a project that implements google drive rest api alongside asp.net. This project will perform like a basic data deduplication whereby duplicated files in the google drive will be detected and not allowed to be uploaded. Hence, in order to do this, I need to hash the files and if there is a identical hash value in the files, the file is sensed as a duplicate. But I have been looking around for tutorials and documentations, but I am sort of lost. How should I approach this problem if I want to hash the files and compare for identical hash values in a sort of like hash table. What is like the methods for this? I just need like a way on how to kickstart this problem.
 
What is the difficulty with hashing files? It is simply a matter of reading the file data, feeding that into your hash algorithm, and n getting back the hash value? Did you look at the documentation? It even has sample code:
 
Then once you have a hash value, you throw that into a data structure that lets you do quick look ups. You could use the HashSet that comes built into .NET Framework, or you could build your own data structure. If you have tons and tons of files already in Google Drive, then implementing a Bloom filter in front of your lookup may help speed things up -- but that is an optimization for later.

Get it working first. Then get it working right. And then finally consider getting it working fast.
 
Okay, thanks for the link. I managed to get the md5 working which is something for now. And i am wondering as how should integrate this hashing capabilities of the files with the google drive rest api.
 
He already told you on #3 - Bloom filters are not something you will hear about being throwing around in the C# arena all too much, although they're more commonly referred to in C++. They are just a more advanced version of .Contains() method with some additional functionality. First do a quick google on them so you know what they are and then I'd probably advise scouring codeplex or codeproject or the Github platform for some code, but be careful not to infringe on any licences, as most bloom filters written in C# are licensed by their original authors.
 
My recommendation is try things with a HashSet first. See if it is fast enough.

I'm not seeing the issue with how to integrate this with the Google Drive REST API. It feels like you want us to design your project for you. If so, here's a naive design:
C#:
Init:
while (files available in Google Drive)
    download file data from Google Drive.
    compute hash file data
    put hash in data structure

OnFileUploadAttempt:
compute hash of file to be uploaded
if data structure does not contains hash
    upload file
 
The dictionary in the .Net frameworks was built as a concept based on the Hash table, and so, with that in mind, If it where me; I'd likely use a Dictionary<T-key, T-value> like <hash, filename> or something to that effect. This is a good way to keep an accumulation of your hashes.
 
True, but he doesn't really need the name of the duplicated file. All he stated in his original post is that he needed to prevent the upload if the hashes matched. He didn't say that he had to tell the user which file it already matched in Google Drive.
 
I am not looking for direct answers but rather just want to understand how i should approach this problem. since the google drive api documentation is not that helpful and I am fairly new with C# , that is why i am still unfamiliar with it. Based on the pseudocode, it looks like the way i should approach my problem is first, i generate the hash value of the files and in this case, I am using md5 hash. then those hash will be stored in hashset. before being able to be stored in a hashset, the program will check the hashset to determine if there is alrady an existing hash value of the file. If not, upload to the google drive and store the hash value in the hash set. If there is a duplicate, then just inform the user that there is a duplicate file and he/she need to upload another file. Is this how i should like approach my prob? i am still reading and understading the hashset.
 
So from what i can deduce so far from the hashset, basically two same elements are not possible in a hashset, yes? meaning that in a way, if there exists already a hash of a file in the hashset, it is not possible to have a dedup hash?
 
If you are still on the learning curve for C#, this project is likely not the project you want to tackle first because you have multiple learning curves to climb simultaneously:
1) data structures and algorithms
2) C#
3) C# built in data structures
4) Google Drive API.
5) OAuth
 
This project was assigned to me, hence the only way is to learn it the hard way which is why i am trying to break the problem into sub-problem. How would I store the md5 hash value of the files in a hashset for comparison of similar hash values. I am well aware that hashset would not be able to have duplicated hash.
 
How much of a Computer Science or Computer Engineering background do you have? If none, have you at least taught yourself some rudimentary data structures and algorithms?
 
Take a look at the documentation for HashSet. You'll see all kinds of methods and properties. You may even find some sample code there. :)
 
Last edited:
Back
Top Bottom