How to Read / Write JSON in Spark

Needing to read and write JSON data is a common big data task. Thankfully this is very easy to do in Spark using Spark SQL DataFrames.

Spark SQL can automatically infer the schema of a JSON dataset, and use it to load data into a DataFrame object. A DataFrame’s schema is used when writing JSON out to file.

The (Scala) examples below of reading in, and writing out a JSON dataset was done is Spark 1.6.0. If you are using the spark-shell, you can skip the import and sqlContext creation steps.

Sample Output:

This example will write the DataFrame to multiple “part” files inside of a newly created “names” directory. All the “part” files combined will contain all the data from the DataFrame.

One thought on “How to Read / Write JSON in Spark”

  1. Arjun

    Is there a way to specify the sampling value ? my pyspark job reads a array of struct ( array:[{col:val1, col2:val2}]) as string when the data is empty (array:[]) . Is there a way to specify higher sampling value so that it reads data values as well. I tried specifying the schema but ended up with all column in the data frame as null

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">