Compressing Intermediate Map Output in Hadoop

It is generally recommended to always compress intermediate map output. This is because IO and network transfer are big bottlenecks in Hadoop, and compression can help with both of these issues. Map output is written to local disk, and then transferred (shuffled) across the network to reducer nodes. At this point in a MapReduce job,

read more

How to Decode URLs in Hive

Decoding URLs and strings can be a common task, especially when working with web data. This is easy to do in a language like Java or Python, but what about in Hive? Luckily, this is fairly easy as well. Decoding URLs in Hive with Reflection The first and easiest approach is to use the reflect()

read more

How to Set Number of Hadoop Reducers on Command Line

Setting the number of reducers for a Hadoop MapReduce job can be very important. Fortunately, there is an easy way to to this from the command line using the -D <property=value> option. Using -D Option on the Command Line A simple example of the -D option to set the number of reducers to 10: -D

read more

How to Create a Custom Writable for Hadoop

If you have gone through other Hadoop MapReduce examples, you will have noticed the use of “Writable” data types such as LongWritable, IntWritable, Text, etc… All values in used in Hadoop MapReduce must implement the Writable interface. Although we can do a lot with the primitive Writables already available with Hadoop, there are often times

read more

How to get Distinct Values with Hadoop MapReduce

Getting the distinct values from a dataset is a very common task, and actually very easy to do in MapReduce. In psuedo code your mapper and reducer will look something like this:

The mapper above will emit each record as the key, and null as the value. The reducer will take the key and

read more

Hadoop – Setting Configuration Parameters on Command Line

Often when running MapReduce jobs, people prefer setting configuration parameters from the command line. This helps avoid the need to hard code settings such as number of mappers, number of reducers, or max split size. Parsing options from the command line can be done easily by implementing Tool and extending Configured. Below is a simple

read more

Hadoop MapReduce Example – Aggregating Text Fields

Below is a simple Hadoop MapReduce example. This example is a little different than the standard “Word Count” example in that it takes (tab) delimited text, and counts the occurrences of values in a certain field. More details about the implementation are included below as well.

You can see above in the Map class

read more