Hadoop – Setting Configuration Parameters on Command Line

Often when running MapReduce jobs, people prefer setting configuration parameters from the command line. This helps avoid the need to hard code settings such as number of mappers, number of reducers, or max split size. Parsing options from the command line can be done easily by implementing Tool and extending Configured.

Below is a simple example. Note that there are a fair number of differences between the code below, and this other simple MapReduce Example even though the input and output are the same. Notice below that the job is executed by ToolRunner‘s static run() method.

Here is an example of running the job above and setting the mapred.max.split.size configuration parameter on the command line:

Notice that the HDFS input and output directories are still passed as arguments. The list of supported command line options is shown below:

GENERIC_OPTION Description
-conf <configuration file> Specify an application configuration file.
-D <property=value> Use value for given property.
-fs <local|namenode:port> Specify a namenode.
-jt <local|jobtracker:port> Specify a job tracker. Applies only to job.
-files <comma separated list of files> Specify comma separated files to be copied to the map reduce cluster. Applies only to job.
-libjars <comma seperated list of jars> Specify comma separated jar files to include in the classpath. Applies only to job.
-archives <comma separated list of archives> Specify comma separated archives to be unarchived on the compute machines. Applies only to job.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">