when-a-couple-gets-a-dog-its-like-saying-we-want-a-baby-but-dont-want-to-go-to-jail-if-it-dies-by-accident.jpgAt Milo, I pretty frequently need to pull data down from production to my workstation to test some new code. That's what happens when you raise a Series A round - you can't live-edit production data anymore. I think it's in the term sheet somewhere.

Anyhow, I was pulling down a 14GB MySQL database dump today. Trying to compress it through plain Jane gzip was pretty slow, so I looked for some parallel options. The server I was pulling from has 16 cores, so I figured I could make use of them.  Anyhow, here's what I found:

  • pbzip2 - Parallel BZIP2: Parallel implementation of BZIP2. BZIP2 is well known for being balls slow, so speed it up using multiple CPUs.
  • pigz - Parallel GZIP: Parallel implementation of GZIP written by Mark Adler (guy who co-authored zlib and gzip, so you can be reasonably confident he has his shit together).
On the 14GB database dump, both are faster than vanilla GZIP. Because Hacker News and Reddit both love this shit, here are the timing stats:

  • Plain gzip, default compression level: 11 minutes, 58 seconds. Resultant file is 2.3GB.
  • pbzip2, default compression level: 8 minutes, 48 seconds. Resultant file is 1.7GB.
  • pigz, default compression level: 1 minute, 33 seconds. Resultant file is 2.3GB.
Again this was on a 14GB database dump file, on a 16-core machine, with Intel solid state disks.

If any readers know of other parallel compression schemes I can try, e-mail me and let me know. I will post stats here.