How to Extract Nested JSON Data in Spark

JSON is a very common way to store data. But JSON can get messy and parsing it can get tricky. Here are a few examples of parsing nested data structures in JSON using Spark DataFrames (examples here done with Spark 1.6.0).

Our sample.json file:

Assuming you already have a SQLContext object created, the examples below will demonstrate how to parse the nested data from the JSON above.

Read the JSON file into a Spark DataFrame:

We can see in our output that the “content” field contains an array of structs, while our “dates” field contains an array of integers. The first step to being able to access the data in these data structures is to extract and “explode” the column into a new DataFrame using the explode function.

Extracting “dates” into new DataFrame:

Our “content” field contains and array of structs. To access the data in each of these structs, we must use the dot operator.

Extracting data in array of structs:

5 thoughts on “How to Extract Nested JSON Data in Spark”

  1. skypiece

    Looks nice, but does not works for me:
    Error: not found: value explode
    val dfi = dfj.select(explode(dfj(“load_portion”)))
    ^
    (select and explode marked red in IDEA IDE, such classes are included)

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">