What is Predicate Pushdown?

The basic idea of predicate pushdown is that certain parts of SQL queries (the predicates) can be “pushed” to where the data lives.  This optimization can drastically reduce query/processing time by filtering out data earlier rather than later. Depending on the processing framework, predicate pushdown can optimize your query by doing things like filtering data

read more

Writing to a Database from Spark

One of the great features of Spark is the variety of data sources it can read from and write to. If you already have a database to write to, connecting to that database and writing data from Spark is fairly simple. This example shows how to write to database that supports JDBC connections. Databases Supporting

read more

Loading Data from a Database into Spark

One of the great features of Spark is the variety of data sources it can read from. Loading data from a database into Spark using JDBC requires 3 major steps. First you need a running database that support JDBC connections. Next you will need to download and use the JDBC driver of that database. Finally

read more

How to Extract Nested JSON Data in Spark

JSON is a very common way to store data. But JSON can get messy and parsing it can get tricky. Here are a few examples of parsing nested data structures in JSON using Spark DataFrames (examples here done with Spark 1.6.0). Our sample.json file:

Assuming you already have a SQLContext object created, the examples

read more

How to Read / Write JSON in Spark

Needing to read and write JSON data is a common big data task. Thankfully this is very easy to do in Spark using Spark SQL DataFrames. Spark SQL can automatically infer the schema of a JSON dataset, and use it to load data into a DataFrame object. A DataFrame’s schema is used when writing JSON

read more

How to Load a Text File into Spark

Loading text files in Spark is a very common task, and luckily it is easy to do. Below are a few examples of loading a text file (located on the Big Datums GitHub repo) into an RDD in Spark. If you have looked at the Spark Documentation you will notice that they do not include

read more