What is Predicate Pushdown?

The basic idea of predicate pushdown is that certain parts of SQL queries (the predicates) can be “pushed” to where the data lives.  This optimization can drastically reduce query/processing time by filtering out data earlier rather than later. Depending on the processing framework, predicate pushdown can optimize your query by doing things like filtering data

read more

How to Decode URLs in Hive

Decoding URLs and strings can be a common task, especially when working with web data. This is easy to do in a language like Java or Python, but what about in Hive? Luckily, this is fairly easy as well. Decoding URLs in Hive with Reflection The first and easiest approach is to use the reflect()

read more

How to do Total Order Sorting in Hadoop MapReduce

Being able to sort by all keys in a data set is a common need in the world of big data. Those familiar with Hive or relational databases know that this easily be done with with a simple SQL statement. For example, sorting an entire data set by “first_name” would look something like this: SELECT

read more

Set Empty Fields to Null in Hive

Depending on the data you load into Hive/HDFS, some of your fields might be empty. Having Hive interpret those empty fields as nulls can be very convenient. It is easy to do this in the table definition using the serialization.null.format table property. Here is a an example from the Big Datums GitHub repo :

Get Input File name in Hive Query

Often times it is useful to know the input file name you are processing in a Hive query. This is a common if useful metadata is stored in the file name. For example, logs from many different servers can be stored in S3, and these files’ names could contain the names or ip addresses of

read more

Add a Shared Directory (Data Volume) to your Docker Container

Adding a data volume to your Docker container creates a shared directory between the container and your host file system. Data in volumes is readable and writeable to any number of containers. Data in volumes is designed to persist regardless of a containers life cycle, so deleting a container will not delete or change the

read more