How to Read / Write JSON in Spark

Needing to read and write JSON data is a common big data task. Thankfully this is very easy to do in Spark using Spark SQL DataFrames.

Spark SQL can automatically infer the schema of a JSON dataset, and use it to load data into a DataFrame object. A DataFrame’s schema is used when writing JSON out to file.

The (Scala) examples below of reading in, and writing out a JSON dataset was done is Spark 1.6.0. If you are using the spark-shell, you can skip the import and sqlContext creation steps.

Sample Output:

This example will write the DataFrame to multiple “part” files inside of a newly created “names” directory. All the “part” files combined will contain all the data from the DataFrame.

    Is there a way to specify the sampling value ? my pyspark job reads a array of struct ( array:[{col:val1, col2:val2}]) as string when the data is empty (array:[]) . Is there a way to specify higher sampling value so that it reads data values as well. I tried specifying the schema but ended up with all column in the data frame as null

