Creating a Hadoop Docker Image

Apache Hadoop is a core big data technology. Running Hadoop on Docker is a great way to get up and running quickly. Below are the basic steps to create a simple Hadoop Docker image.

Pick an OS

Hadoop runs great on a variety of Linux distos. In this post we use Ubuntu 16.04.

Install Required Packages

Various software packages are required for Hadoop, including ssh and Java. These must be installed before using Hadoop.

Install Hadoop

Installing Hadoop can be done by downloading and extracting the binary package within your Docker container. There are many mirrors from which this package can be downloaded. Here is an example of downloading from a specific mirror, and extracting Hadoop into the /opt/hadoop/ directory.

Make sure to update this URL with the version of Hadoop you are interested in. In this example we use version 2.8.1. See http://hadoop.apache.org/releases.html for a list of Hadoop releases to download.

Configure SSH

Running Hadoop in pseudo-distributed mode requires ssh. Add the following to ~/.ssh/config to avoid having to manually confirm the connection.

You will also need to set up SSH keys, which can be done like this:

Configure Hadoop

Various Hadoop configuration files need to be created or updated in order for Hadoop to run correctly. These config files can be found in $HADOOP_HOME/etc/hadoop/. The following are examples of various config files needed:

core-site.xml.

hdfs-site.xml

mapred-site.xml

yarn-site.xml

Set Environment Variables

Export the HADOOP_HOME and JAVA_HOME environment variables in the .bashrc and $HADOOP_HOME/etc/hadoop/hadoop-env.sh files.

Expose Ports

If you want the ability to view the various web interfaces available with Hadoop, expose the related ports in your Dockerfile.

Starting Hadoop

At this point all the pieces should be in place, and Hadoop can be started. The remaining steps are to start the SSH server, format the namenode, run start-dfs.sh, and run start-yarn.sh.

Sample Dockerfile

This Dockerfile shows an example of installing Hadoop on Ubuntu 16.04 into /opt/hadoop. The start-hadoop.sh script is used to start SSH and Hadoop (contents shown below). The Hadoop and SSH configuration files shown above are copied from the local filesystem using the ADD command.

Dockerfile

start-hadoop.sh

Building the Hadoop Docker Image

Running docker build -t my-hadoop . from the directory containing your Dockerfile will create the docker my-hadoop image.

Creating & Running Docker Container

The command docker run -p 8088:8088 --name my-hadoop-container -d my-hadoop can now be used to create a Docker container from this image. The -p option in the command will map the port 8088 inside to the container to port 8088 on the host machine. The CMD instruction used in the Dockerfile will run start-hadoop.sh by default when the container is created.

Accessing Hadoop in Docker Container

Hadoop should now be running in a Docker container. Below is an example of starting an interactive shell in the Docker container, and running a sample MapReduce job.

You can also take a look at the web interface of the Resource Manager at http://localhost:8088.

Hadoop Resource Manager

The original Docker image used in this example can be found at https://github.com/nsonntag/docker-images/.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">