Simple Apache Avro Example using Java

Apache Avro is a popular data serialization system that relies on schemas. The official Avro documentation can be found here: http://avro.apache.org/docs/current/. This post walks through an example of serializing and deserializing data using Avro in Java. Maven is not necessary for working with Avro in Java, but we will be using Maven in this post.

read more

How to Break from Nested Loop in Java

In Java, we can break/exit from the current loop with the break statement. But what if we want to break an outer loop from a nested loop? In Java we can name our loops using labels. By using labels we can specify which loop we would like to break out of (also called “breaking to

read more

Git – How to Push Local Branch to Remote Repo (like GitHub)

Branching is a common task when working with Git. However new branches on your local machine aren’t automatically added to your remote repository. You will need to explicitly push your local branch to your remote repository. You can do this on the command line using:

Remote repositories often have the alias “origin”. So if

read more

How to Load a Text File into Spark

Loading text files in Spark is a very common task, and luckily it is easy to do. Below are a few examples of loading a text file (located on the Big Datums GitHub repo) into an RDD in Spark. If you have looked at the Spark Documentation you will notice that they do not include

read more

Get the MD5 Hash Code of a File with Java

Getting the hash code of a file is a common programming task. MD5 is a very popular and commonly used hashing algorithm. Getting the MD5 hash code of a file with Java can be easily done, and is shown in the code below:

The code above does several things: Creates a MessageDigest object that

read more

Set Empty Fields to Null in Hive

Depending on the data you load into Hive/HDFS, some of your fields might be empty. Having Hive interpret those empty fields as nulls can be very convenient. It is easy to do this in the table definition using the serialization.null.format table property. Here is a an example from the Big Datums GitHub repo :

Get Input File name in Hive Query

Often times it is useful to know the input file name you are processing in a Hive query. This is a common if useful metadata is stored in the file name. For example, logs from many different servers can be stored in S3, and these files’ names could contain the names or ip addresses of

read more

Add a Shared Directory (Data Volume) to your Docker Container

Adding a data volume to your Docker container creates a shared directory between the container and your host file system. Data in volumes is readable and writeable to any number of containers. Data in volumes is designed to persist regardless of a containers life cycle, so deleting a container will not delete or change the

read more

Create an MD5 Hash Code from a String in Java

Creating a hash codes from strings is a common programming task. MD5 is a very popular and commonly used hashing algorithm. Creating an MD5 hash code from a String in Java can be easily done, and is shown in the code below:

The code above is does several things: Creates a StringBuilder object to

read more