Wednesday, June 11, 2014

What is the need to use job.setJarByClass in the driver section of Hadoop MapReduce solution?

Have you ever thought of this?

We do provide the jar to be executed by hadoop while executing hadoop command, don't we?

$ hadoop jar /some-jar.jar

So, why do we need to have the following line of code in our Driver section while declaring the Job object properties:

Answer to that is very simple. Here you help Hadoop to find out that which jar it should send to nodes to perform Map and Reduce tasks. Your some-jar.jar might have various other jars in it's classpath, also your driver code might be in a separate jar than that of your Mapper and Reducer classes.

Hence, using this setJarByClass method we tell Hadoop to find out the relevant jar by finding out that the class specified as it's parameter to be present as part of that jar. So usually you should provide either MapperImplementation.class or your Reducer implementation or any other class which is present in the same jar as that pf Mapper and Reducer. Also make sure that both Mapper and Reducer are part of the same jar.

Saturday, June 7, 2014

Alternative to deprecated DistributedCache class in Hadoop 2.2.0

As of Hadoop 2.2.0, if you use org.apache.hadoop.filecache.DistributedCache class to load files you want to add to your job as distributed cache, then your compiler will warn you regarding this class being deprecated.

In earlier versions of Hadoop, we used DistributedCache class in the following fashion to add files to be available to all mappers and reducers locally:
// In the main driver class using the new mapreduce API
Configuration conf = getConf();
DistributedCache.addCacheFile(new Path(filename).toUri(), conf);
Job job = new Job(conf);

// In the mapper class, mostly in the setup method
Path[] myCacheFiles = DistributedCache.getLocalCacheFiles(job);

But now, with Hadoop 2.2.0, the functionality of addition of files to distributed cache has been moved to the org.apache.hadoop.mapreduce.Job class. You may also notice that the constructor we used to use for the Job  class has also been deprecated and instead we should be using the new factory method getInstance(Configuration conf). The alternative solution would look as follows:

// In the main driver class using the new mapreduce API
Configuration conf = getConf();
Job job = Job.getInstance(conf);
job.addCacheFile(new URI(filename));

// In the mapper class, mostly in the setup method
URI[] localPaths = context.getCacheFiles();

Wednesday, May 21, 2014

DataWarehousing with Redshift: Benefits of a columnar data storage

This a short article stating the benefits of a columnar storage architecture.
In traditional database systems, records are stored into disk blocks by row. Each row is stored one after other. Each row sits in a block. Within those blocks each column sits one after another.
Following diagram illustrates this row-based storage architecture:

This kind of data storage architecture has been created keeping in mind that often a complete record is queried. This is a good fit for scenarios where in CRUD operations are in picture, and data is sought often for a complete row. This architecture is good for OLTP type of applications. 

However, for data analytics and data warehousing use-cases this type of storage architecture isn't a good fit. Amazon's Redshift takes up this issue and have come up with a columnar storage which helps in optimizing data warehousing related queries. In this architecture values for each column are stored sequentially into disk blocks.

Following diagram illustrates this column-based storage architecture:

Row-wise storage disadvantages:

  • If block size is smaller than the size of a record, storage for an entire record may take more than one block.
  • If block size is larger than the size of a record, storage for an entire record may take less than one block, resulting in an inefficient use of disk space.

So, if you can see, storing records in row-wise fashion isn't storage space optimized.

Column-wise storage advantages:

  • Each data block holds column field values for as many as three times as many records as row-based storage.
  • Requires a third of the I/O operations compared to row-wise storage. In practice, using tables with very large numbers of columns and very large row counts, storage efficiency is even greater.
  • Since each block holds the same type of data, it can use a compression scheme selected specifically for the column data type, further reducing disk space and I/O.
Hence, by using a columnar storage architecture Redshift has reduced the disk space being used to store same data resulting in reduced number of seeks it would be required to read the stored data in order to process an user query, ultimately increasing the query performance on large datasets.

Redshift achieves high performance over queries on very huge amount of data. Along with the columnar architecture there are other features Redshift has added in order to achieve this performance, they are:

  • Massively parallel processing
  • Data compression
  • Query optimization
  • Compiled code

I will take these up in detail in one of future posts. Try out Redshift if you need to analyze your data using some of the popular BI tools like Tableau and Jaspersoft. If your dataset size is huge and you are struggling to get good query performace, then I would certainly suggest you to try out Redshift once. Read more about it here.

Disclaimer: Many details including diagrams were taken from official AWS documentations.

Any feedback, good or bad is most welcome.


Email *

Message *