In this post we’ll see what is data locality in Hadoop and how it helps in minimizing the network traffic and increasing the overall throughput of the cluster.
Data locality optimization in Hadoop
To understand data locality in Hadoop you will need to understand how file is stored in HDFS and how MapReduce job calculates the number of input splits and launch map tasks to process data referred by those splits.
HDFS stores a file by dividing it into blocks of 128 MB (which is the default block size). These blocks are then stored in different nodes across the Hadoop cluster. There is also replication of blocks (by default replication factor is 3) so each block is stored on 3 different nodes for redundancy.
- Refer Replica Placement Policy in Hadoop Framework to understand how blocks are replicated in HDFS keeping cluster topology in view.
MapReduce job splits its input into input splits where split size is the size of an HDFS block, which is 128 MB by default. Hadoop creates one map task for each split i.e. if there are 8 input splits then 8 map tasks will be launched.
It is actually the client running the MapReduce job that calculates the splits for the job by calling getSplits().That split information is used by YARN ApplicationMaster to try to schedule map tasks on the same node where split data is residing thus making the task data local. If map tasks are spawned at random locations then each map task has to copy the data it needs to process from the DataNode where that split data is residing, resulting in lots of cluster bandwidth. By trying to schedule map tasks on the same node where split data is residing what Hadoop framework does is to send computation to data rather than bringing data to computation, saving cluster bandwidth. This is called the data locality optimization.
Note here that it is not always possible to launch the map task on the same node where the input data resides because of resource constraints, in that case Hadoop framework will try to minimize the distance by trying to make map task rack local, if that is also not possible then it runs map task on different rack.
Categories based on data proximity
Based on where data for the mapper resides there are three categories.
- Data local– If map task runs on the same node where the split data resides it is referred as data local. This is the optimal scenario.
- Rack local– If the map task working on the data is launched on different node but in the same rack where the data resides this is known as rack local.
- Different rack- If rack local is also not possible then map task is launched on a node on a different rack. In this case data has to be transferred between racks from the node where the split data resides to the node where map task is running.
That's all for this topic Data Locality in Hadoop. If you have any doubt or any suggestions to make please drop a comment. Thanks!
>>>Return to Hadoop Framework Tutorial Page
Related Topics
You may also like-