Tuesday, April 9, 2013

HDFS split versus Mapper input split ?

Following 2 types of splitting are completely separate:
  1. Splitting files into HDFS blocks
  2. Splitting files to be distributed to the mappers

Now, by default if you are using FileInputFormat, then these both types of splitting kind-of overlaps (and hence are identical).

But you can always have a custom way of splitting for the second point above(or even have no splitting at all, i.e. have one complete file go to a single mapper).

Also you can change the hdfs block size independent of the way your InputFormat is splitting input data.

Another important point to note here is that, while the files are actually broken physically when getting stored in HDFS, but for the split in order to distribute to mappers, there is no actual physical split of files, rather it is only logical split.

Let's discuss it with an example from here :

Suppose we want to load a 110MB text file to hdfs. hdfs block size and Input split size is set to 64MB.
  1. Number of mappers is based on number of Input splits not number of hdfs block splits.
  2. When we set hdfs block to 64MB, it is exactly 67108864(64*1024*1024) bytes. I mean it doesn't matter the file will be split from middle of the line.
  3. Now we have 2 input split (so two maps). Last line of first block and first line of second block is not meaningful. TextInputFormat is responsible for reading meaningful lines and giving them to map jobs. What TextInputFormat does is:
    • In second block it will seek to second line which is a complete line and read from there and gives it to second mapper.
    • First mapper will read until the end of first block and also it will process the (last incomplete line of first block + first incomplete line of second block).
Read more here.

1 comment:

Any feedback, good or bad is most welcome.


Email *

Message *