How to do Total Order Sorting in Hadoop MapReduce

Being able to sort by all keys in a data set is a common need in the world of big data. Those familiar with Hive or relational databases know that this easily be done with with a simple SQL statement. For example, sorting an entire data set by “first_name” would look something like this: SELECT

read more

How to Create a Custom Writable for Hadoop

If you have gone through other Hadoop MapReduce examples, you will have noticed the use of “Writable” data types such as LongWritable, IntWritable, Text, etc… All values in used in Hadoop MapReduce must implement the Writable interface. Although we can do a lot with the primitive Writables already available with Hadoop, there are often times

read more

How to get Distinct Values with Hadoop MapReduce

Getting the distinct values from a dataset is a very common task, and actually very easy to do in MapReduce. In psuedo code your mapper and reducer will look something like this:

The mapper above will emit each record as the key, and null as the value. The reducer will take the key and

read more

Hadoop – Setting Configuration Parameters on Command Line

Often when running MapReduce jobs, people prefer setting configuration parameters from the command line. This helps avoid the need to hard code settings such as number of mappers, number of reducers, or max split size. Parsing options from the command line can be done easily by implementing Tool and extending Configured. Below is a simple

read more

How to GZip a File in Java

One of the most common compression algorithms out there is gzip. Therefore you are likely to need to compress files using gzip at some time or another. Below is an example of doing this in Java. First create a FileInputStream from the file to be compressed. The data is read, and a compressed version of

read more

How to Sum Array of Ints in Java

Aggregating data in an array is a common programming task. This can easily be done in Java by initializing a variable to hold the summed value, looping over the elements in the array, and adding these values to the total. The sumArray method below is a good example of how to sum the values in

read more

Creating JSON with JSON.simple (Java)

JSON is a popular way to represent and transfer data. Creating JSON with JSON.simple (a Java library from Google) is very easy. JSON.simple also performs very well compared to other Java JSON libraries when parsing a variety of file sizes (see results of performance tests here). Below is a simple example of building a JSON

read more

Hadoop MapReduce Example – Aggregating Text Fields

Below is a simple Hadoop MapReduce example. This example is a little different than the standard “Word Count” example in that it takes (tab) delimited text, and counts the occurrences of values in a certain field. More details about the implementation are included below as well.

You can see above in the Map class

read more

How to Reverse a Linked List

A common algorithm question is “How to Reverse a Linked List”. Below the SimpleLinkedList (Java) class contains a simple example to follow. The class SimpleLinkedList contains one field, head, which is used to keep track of the head (first) node of the linked list.

The nested class Node is the class used to create

read more

Recursive Binary Search Example

Binary Search is a classic algorithm used to find an item in an ordered list/array of items. This list/array of items must be ordered for binary search to work. The basic idea of Binary Search is to: Take the midpoint between the smallest and largest elements. Determine if item being searched for is smaller or

read more