Comparing files over FTP

Beau

New member
Joined
Oct 3, 2021
Messages
4
Programming Experience
Beginner
Hi there! I've developed a program that compares byte sizes of files (FTP vs Client). It then tells them what needs updating and allows them to update said files. However some of the files have minor differences (e.g. "Version 1.2" vs "Version 1.3") and when comparing the amount of bytes, it is recognised as the same and therefore thinks the files are equal. How would you recommend comparing the files over FTP more accurately without having to download all of the bytes to the client's machine to compare the files. Thanks in advance!
 
Solution
You should store a list of MD5 hashes on the FTP server. You can then download that list, hash each local file and then download the ones that have changed. Two files with the same hash aren't guaranteed to be different but the chances of their being the same is very small. Two files that are different by even one byte will produce completely different hashes.
You should store a list of MD5 hashes on the FTP server. You can then download that list, hash each local file and then download the ones that have changed. Two files with the same hash aren't guaranteed to be different but the chances of their being the same is very small. Two files that are different by even one byte will produce completely different hashes.
 
Solution
As I recall the MD5 hash collisions involved objects of different sizes, so having both the hashes and the file sizes would be good to have in that index file. So check file sizes first. If same, then check hashes.
 
Hi thanks for your responses! I've implemented this and it seems to be working in the tests so far.

Only problem I'm facing now is that when the program scans for the files, it calculates the folder sizes, then if they're the same it doesn't bother checking the files. Is there a way to do a similar process but for whole folders? Or do you recommend I just scan each file when initially scanning? (by the way the total size of all files is about 50gb+).
 
Yes, you can compute the hash of an entire folder by scanning through all the files in the folder and computing a hash as if all the files were concatenated together. Order matters, though. Scan the same files in a different folder and you get a different hash. Or you could compute a hash of filenames, modified times, and file sizes for each folder.

Ouch, computing hashes on 50GB will take a while, though. My recommendation is maximize the information you can get from the OS. Use the last modified time (unless your users are hackers who like to twiddle last modified times to make it look like they have not been changed.) Chances are that this will help you narrow down the modified files quickly.
 
Last edited:
This may be a stupid question on my behalf, but wouldn't the last modified time differ from client to client vs the server as when a fresh file is downloaded it will have the date modified set to whenever it was created on the computer?
 
Yes, but like I alluded to above, you can tweak the last modified time to make it match the server.
 
And presumably, you already have a way of getting the timestamps from the server. If not, there are some interesting discussions on StackOverflow regarding that topic. Apparently there is some sample code in MSDN, but requires you to get the timestamp one file at a time.
 

Latest posts

Back
Top Bottom