I came across a post on Cloudflare Blog where the author was describing his experience on tackling the issue of removing duplicate lines from a large file. The following text is a comment sharing my experience and alternative approach to the task that yields performance results similar to the author’s but without going down the route of coding in C. Funnily, this comment was rejected by Cloudflare Blog moderators. Thus, not to lose a couple of hour’s worth of work, I am putting it out here.
The challenge
Note
|
In the interest of coherance I include the starting point of the problem here. You can also read the original post on the Cloudflare blog. |
The author was analysing log file which contained IP addresses and datacenter codes. The data looks as follows:
192.0.2.1 107
198.51.100.0 20
203.0.113.0 189
192.0.2.1 107
...
The data needed some pre-processing, including line de-duplication. The very
first thing to try is obviously sort -u
, or, what the author tried,
|sort|uniq
. This pattern is handy on small datasets but becomes a false
friend as soon as the size increases to even a modest few hundred megs.
Here’s the result that made the author look for other solutions:
$ time (cat logs-popcount-org.txt | sort | uniq >/dev/null)
real 2m2.701s
user 2m10.667s
sys 0m2.767s
Two minutes was way too long for the task at hand, so after trying various
options with sort, such as --parallel
, --buffer-size
and --unique
, the
author went down the route of writing C code to implement first a bloom filter
and then a hash-based deduplication.
The bloom-based filter made it possible to dedup the above dataset in 12s, and the hash-based dedup managed it in 7s. Finally, the optimized hash-based dedup which required profiling disassembling brought the timing down to a very impressive 2s.
It happened so that I used to deal with large text files in one of my jobs and I
got interested how far I can push the timing from the naive sort|uniq
without
resorting to C. Unfortunately, the author did not mention the percentage of
duplicates in the dataset but it makes a significant difference. I have assumed
40% of duplicates (which can be high or low, depending on situation) and
generated a dataset of IP addresses so that on my hardware |sort|uniq
performance roughly matches the author’s; this appeared to be 25M lines and
approximately 350MB:
$ perl -e '
> while ($i < 15000000) {
> printf("%d.%d.%d.%d\n",
> int(rand(255)),int(rand(255)),int(rand(255)),int(rand(255)));
> $i++
> }' > logs-popcount-org-15M.txt
$ head -10000000 logs-popcount-org-15M.txt >> logs-popcount-org.txt
$ time (cat logs-popcount-org.txt | sort | uniq >/dev/null)
real 1m53.781s
user 2m13.406s
sys 0m1.312s
$ time wc -l logs-popcount-org.txt
25000000 logs-popcount-org.txt
real 0m0.182s
user 0m0.137s
sys 0m0.045s
Note
|
A keen eye might spot that I am generating 15M random IP addresses but that does not mean they are going to be unique. This is a correct observation, but given that there are 4B possible IP addresses, amount of duplicates due to randomly generating octets is going to be exceedingly low. Indeed, in the above example the resulting file contained about 26k duplicates which is approximately 0.2%. I chose to ignore this for the sake of simplicity. |
So, I got a 25M lines in the file and 1m53.781s that |sort|uniq
takes is
roughly the same amount of time as in the author’s first test (2m2s). Let’s
see how much we can speed it up.
One-liner approach
The author correctly observes that de-duplication does not require dataset
sorting. Indeed the usual approach to removing duplicates in an unsorted file is
awk '!a[$0]++'
pattern. Let’s see how well it works:
$ time awk '!a[$0]++' logs-popcount-org.txt >/dev/null
real 0m21.624s
user 0m20.328s
sys 0m1.296s
Simply by switching to awk
we can quickly cut the time to just 20s (6 times faster!).
Usually, awk
is faster than perl
in dealing with such tasks, but probably
because of more efficient hash memory management, in this instance perl
lets
us shave off a few more seconds (8 times faster now, 15s):
$ time perl -ne 'print $_ unless($a{$_}++)' <logs-popcount-org.txt >/dev/null
real 0m15.344s
user 0m14.332s
sys 0m1.013s
I believe this is rather not bad for a one-liner as it is, and we are already in the vicinity of the result that the author’s bloom filter showed, but we can take few extra steps to improve the result.
Parallelizing de-duplication
Looking at CPU utilization, we can see that the process is CPU bound and runs on
just a single core. A natural way to share the load would be to split the file
into chunks and run uniq
in parallel for each chunk. The trick here would be
to make so that each chunk is self-sufficient, i.e. the same IP address should
not be in different chunks. A linear file reading is fast, so we can just read
the file, look at each string, determine which chunk it belongs to, and send it
to the respective uniq
process (let’s call it a worker). Important is to
have chunks well balanced so that each uniq
worker gets a roughly
equal amount of strings to process.
A tempting choice of splitting by the first character of the line (0..9
) has a
natural bias because there are more IP addresses starting with 1
and
2
than with any other number. Instead, let’s look at first 2 characters
(0..99
, the IP addresses with a single digit in the first octet will have a
dot as the second character and successfully convert to a number still) and use
a residue class to decide which worker to send it to:
$ mkfifo fifo{0..4} # create 5 named pipes to communicate with uniq workers
$ # start 5 uniq workers in the background:
$ for i in {0..4}; do
> perl -ne 'print $_ unless($a{$_}++)' >> fifo$i.txt <fifo$i &
> done
$ # read the dataset and dispatch each line to corresponding worker
$ time <logs-popcount-org.txt perl -ne '
> BEGIN{
> $wrks=5; # number of uniq workers
> # open fifos to workers and store handles in an array
> for($i=0;$i<$wrks;$i++){open($fifos[$i],">./fifo$i")}
> }
> $class=substr($_,0,2)%$wrks; # figure out residue class by first two chars of IP
> print {$fifos[$class]} $_;' # write to corresponding fifo (uniq worker)
real 0m9.518s
user 0m8.065s
sys 0m0.280s
Ok, we are 30% faster now and under 10 seconds. The above fragment creates 5
FIFOs and starts uniq
workers on them, and then runs perl
oneliner[1] that
reads lines from the file, looks at the first 2
chars of each line, determines the residue class, and sends the IP address to
the appropriate worker. We’ll need to combine resulting files, but that takes
few tens of a second:
$ ls fifo*txt
fifo0.txt fifo1.txt fifo2.txt fifo3.txt fifo4.txt
$ time cat fifo*txt > logs-popcount-org-wrk1.txt
real 0m0.482s
user 0m0.002s
sys 0m0.237s
Note
|
If the dataset fits into memory, then right after deduping all these files will be in the OS cache and concatenation can take as little as 0.1s! |
This relatively simple construct puts us ahead of bloom-based filter implementation in C! However, we still can do better.
Parallelizing line classification
Looking again at CPU utilization, we will find that the bottleneck now is
reading and splitting the data file, not removing duplicates in workers. Can we
do something about it? How about we run multiple splitters each working on its
part of the file? I.e. we split 25M lines into N
parts and start N
readers which process just that part of the file?
$ # start 5 uniq workers in the background:
$ for i in {0..4}; do
> perl -ne 'print $_ unless($a{$_}++)' >> fifo$i.txt <fifo$i &
> done
$ # start 6 readers
$ time (
> for i in {0..5}; do
> <logs-popcount-org.txt perl -sne '
> BEGIN{
> use Fcntl qw(:DEFAULT :flock);
> $wrks=5; # uniq workers cnt
> $step=int(25000000/6); # number of lines per reader
> $st=$p*$step; # lineno this reader starts at
> $end=$st+$step; # lineno this reader ends at
>
> for $i (0..$wrks-1){open($fifos[$i],">./fifo$i");}
> }
>
> $ln++; # line cnt
> exit if($ln>$end); # exit if past the end
> next unless($ln>$st); # skip until start
>
> $a[substr($_,0,2)%$wrks] .= $_; # add IP address to buffer
>
> if($ln%10000==0){ # flush buffers to uniq workers every 10k lines
> for $c (0..$wrks-1){
> # have to lock to avoid overlapping btw readers
> flock($fifos[$c],2);
> print {$fifos[$c]} $a[$c];
> flock($fifos[$c],LOCK_UN);
> $a[$c]="";
> }
> }
> END { # flush remaining buffers
> for $c (0..$wrks-1){
> flock($fifos[$c],2);
> print {$fifos[$c]} $a[$c];
> flock($fifos[$c],LOCK_UN);
> }
> }' -- -p=$i &
> done; wait)
real 0m7.177s
user 0m26.559s
sys 0m0.455s
This one is a little bit trickier because we have to handle concurrent writes to
FIFOs, thus the need of flock()
and buffering, but hey — 7.2sec with just
perl
and the command line! This is the same performance as a hash-based
deduplication implemented in C. This is definitely not too shabby.[2]
Percentage of duplicates
As I mentioned earlier, the percentage of duplicates affects the timing noticeably. Let’s see how exactly. First, let’s generate two datasets, one with 80% of duplicates and the other with 20%:
$ # generating dataset with 80% dups
$ perl -e 'while($i<5000000){printf("%d.%d.%d.%d\n",int(rand(255)),int(rand(255)),int(rand(255)),int(rand(255)));$i++}' >logs-popcount-org-5M.txt
$ for i in {0..4}; do cat logs-popcount-org-5M.txt >>logs-popcount-org-80pct-dups.txt; done
$ # generating dataset with 20% dups
$ perl -e 'while($i<20000000){printf("%d.%d.%d.%d\n",int(rand(255)),int(rand(255)),int(rand(255)),int(rand(255)));$i++}' >logs-popcount-org-20M.txt
$ cp logs-popcount-org-20M.txt logs-popcount-org-20pct-dups.txt
$ head -5000000 logs-popcount-org-20M.txt >>logs-popcount-org-20pct-dups.txt
And now let’s time the de-duplication using simple and our most advanced filters:
$ time perl -ne 'print $_ unless($a{$_}++)' <logs-popcount-org-80pct-dups.txt >/dev/null
real 0m10.772s
user 0m10.398s
sys 0m0.374s
$ time perl -ne 'print $_ unless($a{$_}++)' <logs-popcount-org-20pct-dups.txt >/dev/null
real 0m17.931s
user 0m16.613s
sys 0m1.320s
As a reminder, on the dataset with 40% duplicates, the timing was 15s. As you can see the more duplicates are present in the dataset, the more effective deduplication becomes, and the spread is quite significant — 30%.
Let’s see how our advanced deduping behaves:
$ time (for i in {0..5}; do <logs-popcount-org-80pct-dups.txt perl -sne 'BEGIN...')
real 0m6.266s
user 0m26.280s
sys 0m0.457s
$ time (for i in {0..5}; do <logs-popcount-org-20pct-dups.txt perl -sne 'BEGIN...')
real 0m7.826s
user 0m26.796s
sys 0m0.458s
Here the spead is more modest — about 10%, but still repeats the pattern: more duplicates, faster the processing.
Conclusion
Hope this exercise was interesting for you; it certainly was fun and enjoyable
for me. However, I often feel that standard tooling such as awk
, sed
,
grep
, etc offered by Linux is often overlooked, and custom solutions are made
instead which takes more time and effort.
Update
After getting in touch with the author of the original blog post to share my experience he reminded me of the effect of the locale setting on string processing performance which I have completely forgotten about! The catch is that these days most systems have their default locale supporting UTF8 — a variable width character encoding.
As you can imagine, comparing characters
potentially represented by multiple bytes is expensive. However, when dealing
with just numbers and standard ASCII characters, we can tell to treat them as
single-byte charactes by special environment variable. Let’s see how this
affects the performance of sort
:
$ time (cat logs-popcount-org.txt | LC_COLLATE=C sort -u | wc -l)
14973169
real 0m18.664s
user 0m17.804s
sys 0m1.183s
18s vs almost 2 min in the very first try we had! That’s the price for UTF8 support! But this is still running just on a single core, let’s try to add parallel workers and see how does it fare:
$ time (cat logs-popcount-org.txt | LC_COLLATE=C sort -u -S4G | wc -l)
14973169
real 0m8.671s
user 0m34.327s
sys 0m2.140s
Even more impressive, sort
managed to come very close to our advances
multi-reader multi-worker solution.
Note
|
sorts’s `--parallel option creates threads only when the input file is
large enough to be worth the overhead. As it can’t get the size of the data
coming in via pipe, it treats it like a small file and does not spin multiple
threads (a nasty surprise!). To overcome this, you need to direct sort to use
a large buffer with -S .[3]
|
Would this LC_COLLATE=C
(or less subtle LANG=C
) setting improve the
performance of awk
- or perl
-based solutions? Not really, as we do not rely
on comparing strings, but we are still faster both in single and multicore
setting:
Threads | sort |
Simple | Advanced |
---|---|---|---|
single |
18.6s |
15s |
— |
multi |
8.6s |
— |
7.1s |