Writing to a Database from Spark

One of the great features of Spark is the variety of data sources it can read from and write to. If you already have a database to write to, connecting to that database and writing data from Spark is fairly simple. This example shows how to write to database that supports JDBC connections.

Databases Supporting JDBC Connections

Spark can easily write to databases that support JDBC connections. MySQL, Oracle, and Postgres are common options. In this post we show an example using MySQL.

Downloading the Database JDBC Driver

A JDBC driver is needed to connect your database to Spark. The MySQL JDBC driver can be downloaded at https://dev.mysql.com/downloads/connector/j/. MySQL provides ZIP or TAR archives that contain the database driver. Inside each of these archives will be a mysql-connector-java-...-bin.jar file. This is the JDBC driver that enables Spark to connect to the database.

Write data from Spark to Database

We now have everything we need to connect Spark to our database. If running within the spark-shell use the --jars option and provide the location of your JDBC driver jar file on the command line.
spark-shell --jars ./mysql-connector-java-5.0.8-bin.jar

Once the spark-shell has started, we can now insert data from a Spark DataFrame into our database. A sample of the our DataFrame’s contents can be seen below.

Spark DataFrames (as of Spark 1.4) have a write() method that can be used to write to a database. The write() method returns a DataFrameWriter object. DataFrameWriter objects have a jdbc() method, which is used to save DataFrame contents to an external database table via JDBC. The jdbc() method takes a JDBC URL, destination table name, and a Java Properties object containing other connection information.

Here is an example of putting these various pieces together to write to a MySQL database.

Notice in the above example we set the mode of the DataFrameWriter to "append" using df.write.mode("append"). The mode() method specifies how to handle the database insert when then destination table already exists. The default behavior is for Spark to create and insert data into the destination table. If the table already exists, you will get a TableAlreadyExists Exception. In order to write to an existing table you must use mode("append") as in the example above.

3 thoughts on “Writing to a Database from Spark”

  1. Simeon Kredatus

    Hey, is it possible to write to a database with defined columns a dataframe, which contains only a subset of those columns? Is it also possible to let the database take care for the id, such that it would automatically append it to each call? Can you provide an example?

  2. Piyush

    Your article helped me solve an issue i’ve been stuck with for past 2 days. I was trying to connect with Aws redshift with another method (which i thought should work) without any success. I was able to connect to redshift using your article. Cannot thank you enough.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">