Hadoop – Setting Configuration Parameters on Command Line

Often when running MapReduce jobs, people prefer setting configuration parameters from the command line. This helps avoid the need to hard code settings such as number of mappers, number of reducers, or max split size. Parsing options from the command line can be done easily by implementing Tool and extending Configured. Below is a simple

read more

Hadoop MapReduce Example – Aggregating Text Fields

Below is a simple Hadoop MapReduce example. This example is a little different than the standard “Word Count” example in that it takes (tab) delimited text, and counts the occurrences of values in a certain field. More details about the implementation are included below as well.

You can see above in the Map class

read more

How to Extract Nested JSON Data in Spark

JSON is a very common way to store data. But JSON can get messy and parsing it can get tricky. Here are a few examples of parsing nested data structures in JSON using Spark DataFrames (examples here done with Spark 1.6.0). Our sample.json file:

Assuming you already have a SQLContext object created, the examples

read more

How to Read / Write JSON in Spark

Needing to read and write JSON data is a common big data task. Thankfully this is very easy to do in Spark using Spark SQL DataFrames. Spark SQL can automatically infer the schema of a JSON dataset, and use it to load data into a DataFrame object. A DataFrame’s schema is used when writing JSON

read more

Simple Apache Avro Example using Java

Apache Avro is a popular data serialization system that relies on schemas. The official Avro documentation can be found here: http://avro.apache.org/docs/current/. This post walks through an example of serializing and deserializing data using Avro in Java. Maven is not necessary for working with Avro in Java, but we will be using Maven in this post.

read more

How to Load a Text File into Spark

Loading text files in Spark is a very common task, and luckily it is easy to do. Below are a few examples of loading a text file (located on the Big Datums GitHub repo) into an RDD in Spark. If you have looked at the Spark Documentation you will notice that they do not include

read more

Set Empty Fields to Null in Hive

Depending on the data you load into Hive/HDFS, some of your fields might be empty. Having Hive interpret those empty fields as nulls can be very convenient. It is easy to do this in the table definition using the serialization.null.format table property. Here is a an example from the Big Datums GitHub repo :

Get Input File name in Hive Query

Often times it is useful to know the input file name you are processing in a Hive query. This is a common if useful metadata is stored in the file name. For example, logs from many different servers can be stored in S3, and these files’ names could contain the names or ip addresses of

read more

Get List of Objects in S3 Bucket with Java

Often when working with files in S3, you need information about all the items in a particular S3 bucket. Below is an example class that extends the AmazonS3Client class to provide this functionality. For the most part this class has been adapted from the sample in this AWS post. Aside from some additional methods, one

read more

Using UNIX Wildcards with AWS S3 (AWS CLI)

Currently AWS CLI doesn’t provide support for UNIX wildcards in a command’s “path” argument. However, it is quite easy to replicate this functionality using the –exclude and –include parameters available on several aws s3 commands. The wildcards available for use are: “*” – Matches everything “?” – Matches any single character “[]” – Matches any

read more