Wednesday, June 11, 2014

What is the need to use job.setJarByClass in the driver section of Hadoop MapReduce solution?

Have you ever thought of this?

We do provide the jar to be executed by hadoop while executing hadoop command, don't we?

$ hadoop jar /some-jar.jar

So, why do we need to have the following line of code in our Driver section while declaring the Job object properties:
job.setJarByClass(SomeClass.class);

Answer to that is very simple. Here you help Hadoop to find out that which jar it should send to nodes to perform Map and Reduce tasks. Your some-jar.jar might have various other jars in it's classpath, also your driver code might be in a separate jar than that of your Mapper and Reducer classes.

Hence, using this setJarByClass method we tell Hadoop to find out the relevant jar by finding out that the class specified as it's parameter to be present as part of that jar. So usually you should provide either MapperImplementation.class or your Reducer implementation or any other class which is present in the same jar as that pf Mapper and Reducer. Also make sure that both Mapper and Reducer are part of the same jar.

Saturday, June 7, 2014

Alternative to deprecated DistributedCache class in Hadoop 2.2.0

As of Hadoop 2.2.0, if you use org.apache.hadoop.filecache.DistributedCache class to load files you want to add to your job as distributed cache, then your compiler will warn you regarding this class being deprecated.

In earlier versions of Hadoop, we used DistributedCache class in the following fashion to add files to be available to all mappers and reducers locally:
// In the main driver class using the new mapreduce API
Configuration conf = getConf();
...
DistributedCache.addCacheFile(new Path(filename).toUri(), conf);
...
Job job = new Job(conf);
...

// In the mapper class, mostly in the setup method
Path[] myCacheFiles = DistributedCache.getLocalCacheFiles(job);

But now, with Hadoop 2.2.0, the functionality of addition of files to distributed cache has been moved to the org.apache.hadoop.mapreduce.Job class. You may also notice that the constructor we used to use for the Job  class has also been deprecated and instead we should be using the new factory method getInstance(Configuration conf). The alternative solution would look as follows:

// In the main driver class using the new mapreduce API
Configuration conf = getConf();
...
Job job = Job.getInstance(conf);
...
job.addCacheFile(new URI(filename));

// In the mapper class, mostly in the setup method
URI[] localPaths = context.getCacheFiles();

Any feedback, good or bad is most welcome.

Name

Email *

Message *