Compressing Intermediate Map Output in Hadoop

It is generally recommended to always compress intermediate map output. This is because IO and network transfer are big bottlenecks in Hadoop, and compression can help with both of these issues.

Map output is written to local disk, and then transferred (shuffled) across the network to reducer nodes. At this point in a MapReduce job, we are no longer concerned with data being splittable. Therefore a non-splittable compression type will work fine. One thing to consider is that increased compression also means increased processing time, so an fast compressor like Snappy or LZO is usually a good choice for compressing intermediate map output. This way you can get increased performance by simply reducing the amount of data sent over the network. In fact, Amazon EMR enables intermediate compression with the Snappy codec by default.

In order to enable intermediate data compression you must adjust the parameters you pass to your MapReduce job. Simply set mapreduce.map.output.compress to true, and mapreduce.map.output.compress.codec to the compression codec of your choice.

Here is an example of enabling intermediate (Snappy) compression in our Java MapReduce job class:

Here is an example of enabling intermediate (Snappy) compression when from the command line:

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">