Friday, December 28, 2012

EMR Streaming job using Java code for mapper and reducer [Creatingcustom jar]

Here is a basic sample of how to create a custom jar for an EMR streaming job.

Let's assume that the mapper code needs to reads from a csv file (which will be read into EMR's distributed cache) as well as it reads from the input s3 bucket which also has some csv files, does some calculations and prints a csv output lines to standard output.
There will be one Main class which would contain one implementation each of the following classes:

org.apache.hadoop.mapreduce.Mapper;
org.apache.hadoop.mapreduce.Reducer;


Each of these have to override methods map() and reduce() to do the desired job.

The Java class for Mapper would look like following:

public class SomeJob extends Configured implements Tool {

private static final String JOB_NAME = "My Job";

/**
* This is Mapper.
*/
public static class MapJob extends Mapper {

private Text outputKey = new Text();
private Text outputValue = new Text();

@Override
protected void setup(Context context) throws IOException, InterruptedException {

// Get the cached file
Path file = DistributedCache.getLocalCacheFiles(context.getConfiguration())[0];

File fileObject = new File (file.toString());
// Do whatever required with file data
}

@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
outputKey.set("Some key calculated or derived");
outputVey.set("Some Value calculated or derived");
context.write(outputKey, outputValue);
}
}

/**
* This is Reducer.
*/
public static class ReduceJob extends Reducer {

private Text outputKey = new Text();
private Text outputValue = new Text();

@Override
protected void reduce(Text key, Iterable values, Context context) throws IOException,
InterruptedException {
outputKey.set("Some key calculated or derived");
outputVey.set("Some Value calculated or derived");
context.write(outputKey, outputValue);
}
}

@Override
public int run(String[] args) throws Exception {

try {
Configuration conf = getConf();
DistributedCache.addCacheFile(new URI(args[2]), conf);
Job job = new Job(conf);

job.setJarByClass(TaxonomyOverviewReportingStepOne.class);
job.setJobName(JOB_NAME);

job.setMapperClass(MapJob.class);
job.setReducerClass(ReduceJob.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
FileInputFormat.setInputPaths(job, args[0]);
FileOutputFormat.setOutputPath(job, new Path(args[1]));

boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
} catch (Exception e) {
e.printStackTrace();
return 1;
}

}

public static void main(String[] args) throws Exception {

if (args.length < 3) {
System.out
.println("Usage: SomeJob   ");
System.exit(-1);
}

int result = ToolRunner.run(new TaxonomyOverviewReportingStepOne(), args);
System.exit(result);
}

}

Now in order to spawn the cluster the command should look like:

ruby elastic-mapreduce --create --alive --plain-output --master-instance-type m1.xlarge --slave-instance-type m1.xlarge --num-instances 11  --name "Java Pipeline" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "--mapred-config-file, s3://com.versata.emr/conf/mapred-site-tuned.xml"

This command should return a job ID, which shall be used in order to add steps to be executed in orderly fashion by the cluster in distributed fashion.

To add Job Steps:

Step 1:

ruby elastic-mapreduce --jobflow  --jar s3://somepath/job-one.jar --arg s3://somepath/input-one --arg s3://somepath/output-one --args -m,mapred.min.split.size=52880 -m,mapred.task.timeout=0
Step2:

ruby elastic-mapreduce --jobflow  --jar s3://somepath/job-two.jar --arg s3://somepath/output-one --arg s3://somepath/output-two --args -m,mapred.min.split.size=52880 -m,mapred.task.timeout=0

43 comments:

  1. There are lots of information about latest technology and how to get trained in them, like Hadoop Training Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Hadoop Training in Chennai). By the way you are running a great blog. Thanks for sharing this.

    Big Data Training | Big Data Course in Chennai

    ReplyDelete
    Replies
    1. với phương pháp này , thần âm hệ sẽ gặp rất nhiều nguy hiểm , nếu ‘đoản chiến’ thì sẽ có một cơ hội mong manh nhưng khi chọn ‘quần lang’ , một long kỵ binh sẽ có thể dễ dàng hạ gục một thần âm sư yếu ớt , nói không chừng bên trọng kỵ binh hệ chỉ cần phái ra một người cũng đủ để hạ gục toàn bộ thành viên của thần âm hệ .
      Trên đài , Phất Cách Sâm nở ra một nụ cười , gật đầu tán thưởng : “ Quả là một phương án lựa chọn tối ưu , tránh được sự phối hợp của các long kỵ binh . Có lẽ lần này Diệp Âm Trúc sẽ lại đem cho chúng ta một kỳ tích đây “ . Thật ra đại đa số các học viên trong Mễ lan học viện đều không biết , trọng kỵ binh lợi hại nhất là ‘đoản chiến’ .
      “ Hải Dương học tả , chúng ta thật sự không cần trợ giúp Âm Trúc sao ? “ . Lam Hi có chút lo lắng , thấp giọng hỏi .
      Hải Dương liếc mắt nhìn nàng đầy ngụ ý : “ Ngươi cho rằng chúng tađồng tâm
      game mu
      cho thuê nhà trọ
      cho thuê phòng trọ
      nhac san cuc manh
      số điện thoại tư vấn pháp luật miễn phí
      văn phòng luật
      tổng đài tư vấn pháp luật
      dịch vụ thành lập công ty trọn gói
      http://we-cooking.com/
      chém gió có thể hỗ trợ được cho hắn sao ? “ .
      Ngẩng đầu lên nhìn thân hình to lớn của những long kỵ binh , sắc mặt Lam Hi trở nên trắng bệch , khe khẽ lắc đầu . Lúc này sự kiêu ngạo của Khổng Tước cũng đã biến mất , cúi đầu không nói năng gì .
      Trước khi trận đấu bắt đầu , Diệp Âm Trúc đã xác định chọn phương án ‘quần lang’ , nhưng

      Delete
    2. I have read your blog its very attractive and impressive. I like it your blog.


      JavaEE Training in Chennai JavaEE Training in Chennai

      Java Training in Chennai Core Java Training in Chennai Core Java Training in Chennai

      Java Online Training Java Online Training Core Java 8 Training in Chennai Java 8 Training in Chennai

      Delete
    3. JMS Training Institutes in Chennai JMS Training Institutes in Chennai | JSP Training Institutes in Chennai | Spring Training Institutes in Chennai Spring Training Institutes in ChennaiMicroServices Training Institutes In Chennai Java MicroServices Training Institutes In Chennai
      Java EE Training Institutes in Chennai Java EE Training Institutes in Chennai

      Delete
    4. Java Online Training Java Online Training Java Online Training Java Online Training Java Online Training Java Online Training

      Hibernate Online Training Hibernate Online Training Spring Online Training Spring Online Training Spring Batch Training Online Spring Batch Training Online

      Delete
  2. Excellent post!!! Your article helped to under the future of java development. Being an open source platform, java is integrated in most of the software development industries to create rich featured applications. J2EE Training in Chennai | JAVA Training in Chennai

    ReplyDelete
  3. Upgrading ourselves to the upcoming technology is the best way to survive in this modern and fast paced technology world. Reading contents like this will create a positive impact within me. Thanks for writing such a valuable content. Keep up this work.

    JAVA Training in Chennai | JAVA Training Chennai | JAVA J2EE Training in Chennai | J2EE Training in Chennai

    ReplyDelete
  4. Best woo sms for your Online shopping store.....

    ReplyDelete
  5. Thank you intended for Pleasant and Educational Publish blog management

    ReplyDelete
  6. • I love all the posts, I really enjoyed, I would like more information about this, because it is very nice., Thanks for sharing.

    qtp training in chennai

    ReplyDelete
  7. Wonderful blog.. Thanks for sharing informative blog.. its very useful to me..

    iOS Training in Chennai

    ReplyDelete
  8. Hi, you have given really informative post. Thanks for sharing this post to our vision. Learn Hadoop Online Training will helps you to reach your goal.Selenium Online Training

    ReplyDelete
  9. Great post! I am actually getting ready to across this information, It's very helpful for this blog.Also great with all of the valuable information
    Selenium Training in Chennai
    Selenium Course in Chennai

    ReplyDelete
  10. You have done really great job. Your blog is very unique and informative. Thanks. Devops Online Training | Data Science Online Training

    ReplyDelete
  11. Grateful informative blog posting article! Selenium Training Institute in Chennai I'm read this information, It's my first command of this blog sites. We share very great knowledgeable information post here. Selenium Training in Chennai | Selenium Course in Chennai

    ReplyDelete
  12. How to handle if CSV file size is very huge? is it take care by system itself? Big data can help to process. Big data and Hadoop training in Chennai

    Android Training in Chennai


    ReplyDelete
  13. Thanks for sharing the EMR Streaming ........ importance.I get more knowledge in your blog.keep in blogging.i am waiting for your next blog............ Selenium Training in Chennai
    Dot Net Training in Chennai
    Android Training in Chennai
    Hadoop Training in Chennai

    ReplyDelete
  14. Thanks for sharing your valuable ideas on EMR Streaming, it is very useful.
    keep rocks.
    Android Training in chennai | Best Android Training in velachery

    ReplyDelete
  15. I wish to show thanks to you just for bailing me out of this particular
    trouble.As a result of checking through the net and meeting
    techniques that were not productive, I thought my life was done.


    java training in chennai

    ReplyDelete
  16. It was so good to read and useful to improve my knowledge as updated one.Thanks to Sharing.

    Informatica Online Training|ETL Testing Online Training|Hadoop online Training

    ReplyDelete
  17. I believe there are many more pleasurable opportunities ahead for individuals that looked at your site.
    aws training in Chennai

    ReplyDelete
  18. Excellent information with unique content and it is very useful to know about the information based on blogs...Embedded Project Center in Chennai | Embedded Project Center in Velachery

    ReplyDelete
  19. Very informative blog.Thanks for sharing such good information and keep on updating..Embedded Project Center in Chennai | Embedded Project Center in Velachery

    ReplyDelete
  20. Very good informative article. Thanks for sharing such nice article, keep on up dating such good articles.
    VMware Exam Centers in Chennai | VMware Exam Centers in Velachery

    ReplyDelete
  21. Good and nice blog post, thanks for sharing your information.. it is very useful to me.. keep rocks and updating.
    Citrix Exams in Chennai | Xenapp exam center in Chennai

    ReplyDelete
  22. Nice..You have clearly explained about it ...Its very useful for me to know about new things..Keep on blogging..
    VMware Exam Centers in Chennai | VMware Exam Centers in Velachery

    ReplyDelete

Any feedback, good or bad is most welcome.

Name

Email *

Message *