How to Load a Text File into Spark

Loading text files in Spark is a very common task, and luckily it is easy to do.

Below are a few examples of loading a text file (located on the Big Datums GitHub repo) into an RDD in Spark. If you have looked at the Spark Documentation you will notice that they do not include file:// in their example. This however is often needed because many Spark installations will use HDFS as the standard location to look for input or to output files. The examples below were done on the Spark shell.

Create RDD from text file:

Create RDD from text file and filter header row:

Create RDD of usernames by splitting file records by the field delimiter (in this case “\t”) and retaining only the the second field (username):

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">