Thursday, May 23, 2013

Amazon EMR : How to add More than 256 Steps to a Cluster?

If you have been using Amazon EMR for long and complex tasks, you might know that EMR currently limits the number of steps which can be added to 256.

But at times, it becomes a tad difficult to limit the number of steps to 256. It might be because the problem at hand being complex and needed to be broken into several steps and needed to be run over varied sets of data. Or one might have long running jobs taking care of multiple tasks for a Hive-based data warehouse. Whatever, may be the reason, you shouldn't get depressed about it! As there is a simple workaround for this:

Manually connect to the master node and submit job steps! Just like you run it on you local machine!
Yeah, that's it. Simple.

EMR's CLI already has ways which can facilitate things for us here.
Assuming that you already have a cluster spawned and have it's JobFlowID, follow the below steps to submit job steps directly to the master node:

  1. Move your executables to the master node
    In order to run your job step, you will need to have the jar and or other files required by your job to be moved to the master node. This can be done as follows using EMR CLI's --scp:
    ruby elastic-mapreduce --jobflow JobFlowID --scp myJob.jar
  2. Execute hadoop command, just like you do in local machine.
    This can also be done as follows using EMR CLI's --ssh:
    ruby elastic-mapreduce --jobflow JobFlowID --ssh ' hadoop jar myJob.jar inputPath outputPath otherArguments'
There are other ways also. Refer here for more.


Tuesday, May 14, 2013

How to read a file in HDFS using Hadoop

You may want to read programmatically from HDFS in your mapper/reducer. Following on my post on How to write a file in HDFS using Hadoop, here is the simple code to read a file from HDFS:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

public class ReadFromHDFS {

 public static void main(String[] args) throws Exception {

  if (args.length < 1) {
   System.out.println("Usage: ReadFromHDFS ");
   System.out.println("Example: ReadFromHDFS 'hdfs:/localhost:9000/myFirstSelfWriteFile'");
   System.exit(-1);
  }

  try {
   Path path = new Path(args[0]);
   FileSystem fileSystem = FileSystem.get(new Configuration());
   BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(fileSystem.open(path)));
   String line = bufferedReader.readLine();
   while (line != null) {
    System.out.println(line);
    line = bufferedReader.readLine();
   }
  } catch (IOException e) {
   e.printStackTrace();
  }
 }
}

Any feedback, good or bad is most welcome.

Name

Email *

Message *