processing large file connecting to Database

etl2016

Active member
Joined
Jun 29, 2016
Messages
39
Programming Experience
3-5
Hi,

I have a large file with many millions of rows per day. I need to lookup a column in the input file and get related information from the Database and regenerate the same file with all same column values as the original file but one column transformed. There is no database load involved, just a lookup.

Could you please suggest the performance considerations such as splitting the file into smaller chunks and parallel processing, because there is a Database involved. What C# .Net programming features should I be exploring about (multi threading?)

thank you
 
Is the replacement data looked up from the database exactly the same length? E.g. Old value is "1234567", and the replacement will be "Letters". I'm asking to see if it is worth doing an in place replacement. Less data to re-write.
 
Yes, this definitely sounds like you could get some perf gains by partitioning the input file and processing the partitions in parallel.

I'm doubtful about actually splitting the input file into smaller files, though. The time spent reading the splitting the input file into smaller files could actually be better spent. If you have any kind of antivirus system or data loss prevention nannyware running on the machine, your write times will be much higher than your read times.

But yes, for generating the output, if you cannot replace in place, then writing into separate smaller files will have less contention for file locks than trying to wait for one thread to finish writing it's line so that this thread can write out its line. Just re-assemble one big file at the end. Yes, there will still be the AV and Nanny ware overhead, but it's a price that you have to pay.

Out of curiosity, presumably this operation is meant to be a batch job that is run overnight. So I can see the absolute run time limit has to be in less than 8 hours, but does it really matter if the file is generated in 2 hours vs. 7 hours since nobody is going to read that data until 6AM the next day anyway? I'm wondering if you are prematurely optimizing.
 
My last question about time constraints above maybe moot, now that I noticed your username. All the ETL folks I've encountered seem to be obsessed with data throughput performance regardless whether the final consumer really needs things that fast or not.
 
Back
Top Bottom