Hadoop MapReduce Example – Aggregating Text Fields

Below is a simple Hadoop MapReduce example. This example is a little different than the standard “Word Count” example in that it takes (tab) delimited text, and counts the occurrences of values in a certain field. More details about the implementation are included below as well.

You can see above in the Map class that each line of text is split using split("\t"). The second field in the resulting array is used as the map key, and 1 is use as it’s value. This data is then aggregated in the combiner (optional) and reducer until the final result of a count by first name is available in HDFS. Code for this example is available in the Big Datums GitHub repo.

Here is how to set things up to run the above MapReduce job:

1. Create an Executable Jar containing your MapReduce classes

This can be done a variety of ways. This example assumes Maven is being used.

2. Create a working Hadoop instance

You must first have a working Hadoop installation to run this on. I personally like to create a Docker container using the sequenceiq/docker-spark image.

3. Create an HDFS directory for your input data

If you do not have an HDFS directory containing the data you want to aggregate, create one.

4. Add data to your HDFS directory

Add text file(s) to your newly created HDFS directory.

5. Run program from the command line

6. Print output from HDFS

Great documentation about Map Reduce as well as the standard “Word Count” example can be found in this MapReduce Tutorial.

One thought on “Hadoop MapReduce Example – Aggregating Text Fields”

  1. Pingback: Hadoop - Setting Configuration Parameters on Command Line - Big Datums

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">