Creating a Hadoop Docker Image

Apache Hadoop is a core big data technology. Running Hadoop on Docker is a great way to get up and running quickly. Below are the basic steps to create a simple Hadoop Docker image. Pick an OS Hadoop runs great on a variety of Linux distos. In this post we use Ubuntu 16.04. Install Required

read more

Writing Data from Apache Kafka to Text File

When working with Apache Kafka you might want to write data from a Kafka topic to a local text file. This is actually very easy to do with Kafka Connect. Kafka Connect is a framework that provides scalable and reliable streaming of data to and from Apache Kafka. With Kafka Connect, writing a topic’s content

read more

Writing Text File contents to Kafka with Kafka Connect

apache kafka logo

When working with Kafka you might need to write data from a local file to a Kafka topic. This is actually very easy to do with Kafka Connect. Kafka Connect is a framework that provides scalable and reliable streaming of data to and from Apache Kafka. With Kafka Connect, writing a file’s content to a

read more

Sending Key Value Messages with the Kafka Console Producer

When working with Kafka you might find yourself using the kafka-console-producer (kafka-console-producer.sh). The kafka-console-producer is a program included with Kafka that creates messages from command line input (STDIN). However, simply sending lines of text will result in messages with null keys. In order to send messages with both keys and values you must set the

read more

Creating a Simple Kafka Consumer

Apache Kafka is a fault tolerant publish-subscribe streaming platform that lets you process streams of records as they occur. If you haven’t installed Kafka yet, see our Kafka Quickstart Tutorial to get up and running quickly. In this post we will talk about creating a simple Kafka consumer in Java. Kafka Consumer Code The example

read more

Creating a Simple Kafka Producer in Java

apache kafka logo

Apache Kafka is a fault tolerant publish-subscribe streaming platform that lets you process streams of records as they occur. If you haven’t installed Kafka yet, see our Kafka Quickstart Tutorial to get up and running quickly. In this post we discuss how to create a simple Kafka producer in Java. Kafka Producer Java Code The

read more

Apache Kafka Docker Image Example

apache kafka logo

Apache Kafka is a fault tolerant publish-subscribe streaming platform that lets you process streams of records as they occur. This post is a step by step guide of how to build a simple Apache Kafka Docker image. The original Dockerfile can be found here: https://github.com/nsonntag/docker-images/tree/master/kafka-quickstart. The Dockerfile This Dockerfile is very simple. It installs Java

read more

Apache Kafka Quickstart Tutorial

apache kafka logo

Apache Kafka is a fault tolerant publish-subscribe streaming platform that lets you process streams of records as they occur. This Kafka Quickstart Tutorial walks through the steps needed to get Apache Kafka up and running on a single Linux/Unix machine. In this tutorial we use Ubuntu and Kafka 0.10.2.0. Installing Java Running Kafka requires Java.

read more

Compressing Intermediate Map Output in Hadoop

It is generally recommended to always compress intermediate map output. This is because IO and network transfer are big bottlenecks in Hadoop, and compression can help with both of these issues. Map output is written to local disk, and then transferred (shuffled) across the network to reducer nodes. At this point in a MapReduce job,

read more

How to Decode URLs in Hive

Decoding URLs and strings can be a common task, especially when working with web data. This is easy to do in a language like Java or Python, but what about in Hive? Luckily, this is fairly easy as well. Decoding URLs in Hive with Reflection The first and easiest approach is to use the reflect()

read more