Denard.me

Sort -u Large Files

^{on Jan. 31, 2019, 8:06 p.m.}

I know this is a huge issue the world is dying to know how to solve. So why not write about it here. Recently, I have acquired several large text files that have... words in them. Here-in lies the problem. For this to be usable, I have to get it down to a single text file with only a set of unique... words. So, after many failed attempts, I found a solution. First, you have the split the large file out into smaller text files using the hashcat utils. Within these utils is a command called splitlen. Use it to split the 600+ GB file out into smaller files. This command splits the files out according to line length.

./splitlen.bin outdir < infile

The above command will output into "outdir" a list of 0-64 files. From there, run the following on the 64 files.

for i in {01..64}; do sort -u -T stmp/ --parallel=8 $i > ${i}.txt ; done;
then
for i in {01..64}; do cat ${i}.txt >> sorted_dedupped.txt ; done;

Now, if you're still interested, let me explain what this does. From my understanding, this does an external merge sort which means it creates temp files to not eat all your RAM to use to then sort these things, slowly reducing till it has your end result file. The -T tells it the location to store the temp files and --parallel tells it to run 8 of these at once(one per CPU code is what I'm doing). Without doing this, sort will just crash. Once you have the 00-64 files sorted, you can combine them altogether and have a unique list.