Question Hashing files

Jason Phang · Aug 13, 2019

I am currently working on a project that implements google drive rest api alongside asp.net. This project will perform like a basic data deduplication whereby duplicated files in the google drive will be detected and not allowed to be uploaded. Hence, in order to do this, I need to hash the files and if there is a identical hash value in the files, the file is sensed as a duplicate. But I have been looking around for tutorials and documentations, but I am sort of lost. How should I approach this problem if I want to hash the files and compare for identical hash values in a sort of like hash table. What is like the methods for this? I just need like a way on how to kickstart this problem.

Skydiver · Aug 13, 2019

What is the difficulty with hashing files? It is simply a matter of reading the file data, feeding that into your hash algorithm, and n getting back the hash value? Did you look at the documentation? It even has sample code:

HashAlgorithm.ComputeHash Method (System.Security.Cryptography)

Computes the hash value for the input data.

docs.microsoft.com

Skydiver · Aug 13, 2019

Then once you have a hash value, you throw that into a data structure that lets you do quick look ups. You could use the HashSet that comes built into .NET Framework, or you could build your own data structure. If you have tons and tons of files already in Google Drive, then implementing a Bloom filter in front of your lookup may help speed things up -- but that is an optimization for later.

Get it working first. Then get it working right. And then finally consider getting it working fast.

Jason Phang · Aug 13, 2019

Okay, thanks for the link. I managed to get the md5 working which is something for now. And i am wondering as how should integrate this hashing capabilities of the files with the google drive rest api.

NoUserHere · Aug 13, 2019

He already told you on #3 - Bloom filters are not something you will hear about being throwing around in the C# arena all too much, although they're more commonly referred to in C++. They are just a more advanced version of .Contains() method with some additional functionality. First do a quick google on them so you know what they are and then I'd probably advise scouring codeplex or codeproject or the Github platform for some code, but be careful not to infringe on any licences, as most bloom filters written in C# are licensed by their original authors.

Skydiver · Aug 13, 2019

My recommendation is try things with a HashSet first. See if it is fast enough.

I'm not seeing the issue with how to integrate this with the Google Drive REST API. It feels like you want us to design your project for you. If so, here's a naive design:

C#:

Init:
while (files available in Google Drive)
    download file data from Google Drive.
    compute hash file data
    put hash in data structure

OnFileUploadAttempt:
compute hash of file to be uploaded
if data structure does not contains hash
    upload file

NoUserHere · Aug 13, 2019

The dictionary in the .Net frameworks was built as a concept based on the Hash table, and so, with that in mind, If it where me; I'd likely use a Dictionary<T-key, T-value> like <hash, filename> or something to that effect. This is a good way to keep an accumulation of your hashes.

Skydiver · Aug 13, 2019

True, but he doesn't really need the name of the duplicated file. All he stated in his original post is that he needed to prevent the upload if the hashes matched. He didn't say that he had to tell the user which file it already matched in Google Drive.

Jason Phang · Aug 13, 2019

I am not looking for direct answers but rather just want to understand how i should approach this problem. since the google drive api documentation is not that helpful and I am fairly new with C# , that is why i am still unfamiliar with it. Based on the pseudocode, it looks like the way i should approach my problem is first, i generate the hash value of the files and in this case, I am using md5 hash. then those hash will be stored in hashset. before being able to be stored in a hashset, the program will check the hashset to determine if there is alrady an existing hash value of the file. If not, upload to the google drive and store the hash value in the hash set. If there is a duplicate, then just inform the user that there is a duplicate file and he/she need to upload another file. Is this how i should like approach my prob? i am still reading and understading the hashset.

Jason Phang · Aug 13, 2019

So from what i can deduce so far from the hashset, basically two same elements are not possible in a hashset, yes? meaning that in a way, if there exists already a hash of a file in the hashset, it is not possible to have a dedup hash?

Skydiver · Aug 13, 2019

Jason Phang said:
since the google drive api documentation is not that helpful and I am fairly new with C#

Seems pretty helpful to me since even has sample code for various languages and platforms. For example, how to download a file if you know it's ID:

Download and export files | Google Drive | Google for Developers

Explore detailed instructions for performing several types of download and export actions.

developers.google.com

Skydiver · Aug 13, 2019

If you are still on the learning curve for C#, this project is likely not the project you want to tackle first because you have multiple learning curves to climb simultaneously:
1) data structures and algorithms
2) C#
3) C# built in data structures
4) Google Drive API.
5) OAuth

Jason Phang · Aug 14, 2019

This project was assigned to me, hence the only way is to learn it the hard way which is why i am trying to break the problem into sub-problem. How would I store the md5 hash value of the files in a hashset for comparison of similar hash values. I am well aware that hashset would not be able to have duplicated hash.

Skydiver · Aug 14, 2019

How much of a Computer Science or Computer Engineering background do you have? If none, have you at least taught yourself some rudimentary data structures and algorithms?

Skydiver · Aug 14, 2019

Take a look at the documentation for HashSet. You'll see all kinds of methods and properties. You may even find some sample code there.

Question Hashing files

Jason Phang

Well-known member

Skydiver

HashAlgorithm.ComputeHash Method (System.Security.Cryptography)

Skydiver

Jason Phang

Well-known member

NoUserHere

Well-known member

Skydiver

NoUserHere

Well-known member

Skydiver

Jason Phang

Well-known member

Jason Phang

Well-known member

Skydiver

Download and export files | Google Drive | Google for Developers

Skydiver

Jason Phang

Well-known member

Skydiver

Skydiver

Similar threads

Share this page

Latest posts