How to Extract Nested JSON Data in Spark

JSON is a very common way to store data. But JSON can get messy and parsing it can get tricky. Here are a few examples of parsing nested data structures in JSON using Spark DataFrames (examples here done with Spark 1.6.0).

Our sample.json file:

Assuming you already have a SQLContext object created, the examples below will demonstrate how to parse the nested data from the JSON above.

Read the JSON file into a Spark DataFrame:

We can see in our output that the “content” field contains an array of structs, while our “dates” field contains an array of integers. The first step to being able to access the data in these data structures is to extract and “explode” the column into a new DataFrame using the explode function.

Extracting “dates” into new DataFrame:

Our “content” field contains and array of structs. To access the data in each of these structs, we must use the dot operator.

Extracting data in array of structs:

11 thoughts on “How to Extract Nested JSON Data in Spark”

  1. skypiece

    Looks nice, but does not works for me:
    Error: not found: value explode
    val dfi =“load_portion”)))
    (select and explode marked red in IDEA IDE, such classes are included)

  2. vss

    Can we have Same explain in Java …

    I have covert following Json to Spark Sql Dataframe


    Please help

  3. sueshi

    When I try to load the sample file i get this error on the spark shell:
    DF_2: org.apache.spark.sql.DataFrame = [_corrupt_record: string]

  4. sueshi

    Very helpful. As a side note might be useful to mention that the json object should be in one row when being read.

  5. ramu

    In my requirement I need to explode columns as well from nested json data. instead of mentioning column values manually. like
    scala> val dfContent =“content”))).toDF(“content”)

    I need to keep column names as from json data.
    ex: “foo”: 123,
    “bar”: “val1”
    foo and bar has to come as columns..

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">