Saturday, June 7, 2014

Alternative to deprecated DistributedCache class in Hadoop 2.2.0

As of Hadoop 2.2.0, if you use org.apache.hadoop.filecache.DistributedCache class to load files you want to add to your job as distributed cache, then your compiler will warn you regarding this class being deprecated.

In earlier versions of Hadoop, we used DistributedCache class in the following fashion to add files to be available to all mappers and reducers locally:
// In the main driver class using the new mapreduce API
Configuration conf = getConf();
...
DistributedCache.addCacheFile(new Path(filename).toUri(), conf);
...
Job job = new Job(conf);
...

// In the mapper class, mostly in the setup method
Path[] myCacheFiles = DistributedCache.getLocalCacheFiles(job);

But now, with Hadoop 2.2.0, the functionality of addition of files to distributed cache has been moved to the org.apache.hadoop.mapreduce.Job class. You may also notice that the constructor we used to use for the Job  class has also been deprecated and instead we should be using the new factory method getInstance(Configuration conf). The alternative solution would look as follows:

// In the main driver class using the new mapreduce API
Configuration conf = getConf();
...
Job job = Job.getInstance(conf);
...
job.addCacheFile(new URI(filename));

// In the mapper class, mostly in the setup method
URI[] localPaths = context.getCacheFiles();

42 comments:

  1. Hello Amarkant,
    Thank you for highlighting this new feature.
    I tried to reproduce this implementation using Amazon EMR. However when I try to open the cached files, the URI (representing a file stored in S3) seems to get messed up. Instead of being "s3n://mybucket/my/file" it turns into "s3n:/mybucket/my/file" so I get a java.io.FileNotFoundException
    Any idea why that "/" is being removed? and how to solve this issue?
    Cheer,
    David

    ReplyDelete
    Replies
    1. I have read your blog its very attractive and impressive. I like it your blog.


      JavaEE Training in Chennai JavaEE Training in Chennai

      Java Training in Chennai Core Java Training in Chennai Core Java Training in Chennai

      Java Online Training Java Online Training Core Java 8 Training in Chennai Java 8 Training in Chennai

      Delete
    2. Hibernate Online Training Hibernate Online Training Hibernate Training in Chennai Hibernate Training in Chennai Java Online Training Java Online Training Hibernate Training Institutes in ChennaiHibernate Training Institutes in Chennai

      Delete
  2. That is indeed a bit weird. I haven't tried it on EMR yet with 2.2.0, but you may try 2 things:
    1) s3n://mybucket/my/file#file
    2) s3n:////mybucket/my/file

    ReplyDelete
  3. Hello Amarkant,
    Thank you for your response. I tried both possibilities but with no luck. I'm getting the same Exception still.

    1) s3n:/ngs.dcr.repository/test/ADN76_S1.paired.sorted.bai#ADN76_S1.paired.sorted.bai (No such file or directory)

    2) s3n:/ngs.dcr.repository/test/ADN76_S1.paired.sorted.bai (No such file or directory)

    The second one is even weirder as it seems like it just ignores all those extra slash characters.
    I'll continue trying to figure out why this happends, but if you have time to try it on EMR please let me know what results do you obtain.
    Thank you again,
    David

    ReplyDelete
  4. I tried Path[] localPaths = context.getLocalCacheFiles();
    in setup method of reducer.I'm not getting values.It works fine in mapper. Please help me with a solution

    ReplyDelete
    Replies
    1. Hi Anupriya,
      Try using context.getCacheFiles() as context.getLocalCacheFiles() is also deprecated.

      Delete
  5. Thanks for the blog post.
    I've tried to switch to the new way, unfortunately without success.
    I get the java.io.FileNotFoundException: /user/devclient/dev/mrjoins/poc/movies_poc.dat (No such file or directory)

    When I used the deprecated functionality, it worked in the following way:
    First, in my main driver class, I called the DistributedCache.addCacheFile(new URI("/user/devclient/dev/mrjoins/poc/movies_poc.dat"), conf);
    The String argument in my case is the HDFS location of the file I needed to be cached.

    Next, in my mapper class, calling the method DistributedCache.getLocalCacheFiles(context.getConfiguration()); returned a org.apache.hadoop.fs.Path array, the elements of which were the following:
    /data/yarn/nm/usercache/devclient/appcache/application_1422517810379_0006/container_e52_1422517810379_0006_01_000002/movies_poc.dat
    which is the location of the cached file on my local machine.

    Then I called the method getName() on the Path instance, and it produced the following output:
    movies_poc.dat
    which was exactly the file I needed to retrieve from cache.


    However, with the new suggested way, I get different results.
    First, as is stated in your blog, in my main driver class I call job.addCacheFile(new URI("/user/devclient/dev/mrjoins/poc/movies_poc.dat"));

    Next, calling the method context.getCacheFiles(); in the mapper class produces a java.net.URI array, the elements of which are the following:
    /user/devclient/dev/mrjoins/poc/movies_poc.dat
    which is completely different from those, returned by the deprecated method. It just returned the HDFS location of the file I needed to be cached.
    As a consequence, I get the java.io.FileNotFoundException: /user/devclient/dev/mrjoins/poc/movies_poc.dat (No such file or directory), because this file doesn't exist on my local machine, it is stored on HDFS only.

    Could it mean that the file wasn't cached after the call of the method job.addCacheFile(new URI("/user/devclient/dev/mrjoins/poc/movies_poc.dat")); in the main driver class?

    Any help on how to fix this issue would be greatly appreciated, thanks.

    ReplyDelete
  6. This comment has been removed by the author.

    ReplyDelete
  7. Hi Amarkanth, I find this article very useful for my hadoop training chennai program. Thanks for sharing.

    ReplyDelete
  8. In near future, big data handling and processing is going to the future of IT industry. Thus taking Hadoop Training in Chennai | Big Data Training in Chennai will prove beneficial for talented professionals.

    ReplyDelete
  9. This would work only in a single node cluster and not in a multi node cluster.

    ReplyDelete



  10. yours idea is really good and innovative , these resources are really awesome thanks for sharing those information and i got more in formation about this concept.


    hadoop training in chennai

    ReplyDelete
  11. I am receiving run time error: java.lang.NoSuchMethodError: org.apache.hadoop.mapreduce.Job.addCacheFile(Ljava/net/URI;)
    There is no compile time error but when i run the code through unix terminal on cloudera using command hadoop jar , it is giving this error.
    Please can anyone help.

    ReplyDelete
  12. Data Scientist Course took this concept as an example so I visited this website for more information.

    ReplyDelete
  13. Thanks for posting the blog I am new to this website I feel happy reading this post. Keep posting more blogs on oracle.
    Oracle Fusion HCM Technical Training

    ReplyDelete
  14. Good and interesting article . thanks for sharing your post...

    Dot Net Training in chennai

    ReplyDelete
  15. Nice Info Regarding DistributedCache class in Hadoop 2.2.0 my sincere thanks for sharing this post please continue to share this post
    Hadoop Training in Chennai

    ReplyDelete
  16. nice blog has been shared by you. before i read this blog i didn't have any knowledge about this but now i got some knowledge. so keep on sharing such kind of an interesting blogs.
    softwaretesting training in chennai

    ReplyDelete
  17. This comment has been removed by the author.

    ReplyDelete
  18. Hi! This is my first time i visit here. You have post an very informative information about Hadoop bigdata. Its very useful to me. keep sharing. thank you... Hadoop Training in Chennai | Big data Analytics Training in Chennai

    ReplyDelete
  19. Well Said, you have provided the right info that will be beneficial to somebody at all time. Thanks for sharing your valuable Ideas to our vision.Big Data Hadoop Training in Bangalore | AWS Training in Bangalore

    ReplyDelete
  20. Each lie detection tests use and psycho-physiology and scientific method. Polygraph is accepted in court for some states in the US. There are companies who offer lie detection test in Los Angeles and polygraph test in Los Angeles.liedetector.uk

    ReplyDelete
  21. Well Said, you have provided the right info that will be beneficial to somebody at all time. Thanks for sharing your valuable Ideas to our vision.


    Hadoop Training in Marathallai



    Hadoop Training in BtmLayout

    ReplyDelete
  22. Good blog about hadoop thanks for sharing.
    Big Data and Hadoop is an ecosystem of open source components that fundamentally changes the way enterprises store, process, and analyze data.
    Hadoop training in bangalore

    ReplyDelete
  23. I’ve been surfing online more than three hours today, yet I never found any interesting article like yours. It’s pretty worth enough for me. In my opinion, if all webmasters and bloggers made good content as you did, the web will be a lot more useful than ever before. Telangana Ration Card Download

    ReplyDelete

Any feedback, good or bad is most welcome.

Name

Email *

Message *