Writing to a Database from Spark

One of the great features of Spark is the variety of data sources it can read from and write to. If you already have a database to write to, connecting to that database and writing data from Spark is fairly simple. This example shows how to write to database that supports JDBC connections. Databases Supporting

read more

Loading Data from a Database into Spark

One of the great features of Spark is the variety of data sources it can read from. Loading data from a database into Spark using JDBC requires 3 major steps. First you need a running database that support JDBC connections. Next you will need to download and use the JDBC driver of that database. Finally

read more

Set AWS Credentials in Cloudera Quickstart Docker Container

Cloudera’s Quickstart Image is a fantastic way to get started quickly with the big data ecosystem. With software such as Hadoop, Spark, Hive, Pig, Impala, and Hue already set up, this Docker image is a must in your big data toolkit. One thing the Cloudera Quickstart container is lacking however, is an easy way to

read more

What is a UUID?

UUID stands for Universally Unique Identifier. UUIDs are used as IDs (to identify) unique objects or records. These are very common in a big data environment where coordinating unique IDs in a central location is difficult to do. Most values (if not all) in a UUID are generated randomly (depending on UUID version). UUID Format

read more

How to Set Number of Hadoop Reducers on Command Line

Setting the number of reducers for a Hadoop MapReduce job can be very important. Fortunately, there is an easy way to to this from the command line using the -D <property=value> option. Using -D Option on the Command Line A simple example of the -D option to set the number of reducers to 10: -D

read more

How to Copy local files to S3 with AWS CLI

AWS CLI has made working with S3 very easy. Once you get AWS CLI installed you might ask “How do I start copying local files to S3?” The syntax for copying files to/from S3 in AWS CLI is: aws s3 cp <source> <destination> The “source” and “destination” arguments can either be local paths or S3

read more

Copy all Files in S3 Bucket to Local with AWS CLI

The AWS CLI makes working with files in S3 very easy. However, the file globbing available on most Unix/Linux systems is not quite as easy to use with the AWS CLI. S3 doesn’t have folders, but it does use the concept of folders by using the “/” character in S3 object keys as a folder

read more

How to do Total Order Sorting in Hadoop MapReduce

Being able to sort by all keys in a data set is a common need in the world of big data. Those familiar with Hive or relational databases know that this easily be done with with a simple SQL statement. For example, sorting an entire data set by “first_name” would look something like this: SELECT

read more

How to Create a Custom Writable for Hadoop

If you have gone through other Hadoop MapReduce examples, you will have noticed the use of “Writable” data types such as LongWritable, IntWritable, Text, etc… All values in used in Hadoop MapReduce must implement the Writable interface. Although we can do a lot with the primitive Writables already available with Hadoop, there are often times

read more

How to get Distinct Values with Hadoop MapReduce

Getting the distinct values from a dataset is a very common task, and actually very easy to do in MapReduce. In psuedo code your mapper and reducer will look something like this:

The mapper above will emit each record as the key, and null as the value. The reducer will take the key and

read more