How to get Distinct Values with Hadoop MapReduce

Getting the distinct values from a dataset is a very common task, and actually very easy to do in MapReduce.

In psuedo code your mapper and reducer will look something like this:

The mapper above will emit each record as the key, and null as the value. The reducer will take the key and a collection of associated output values from the mapper. Since we will only have one reduce task per key, emitting each key and no values will output the distinct values.

Below is an example of how to get distinct values with Hadoop MapReduce. In the example we are taking tab delimited records from files in HDFS and counting the distinct values in the second column.

