In the post What is Big Data it has already been discussed that the challenges such a huge data poses are in the form of-
- How to store such huge data.
- How to process it.
This post gives introduction to one of the most used big data technology Hadoop framework. Apache Hadoop is an open source framework, written in Java programming language, that provides both-
- Distributed storage.
- Parallel processing of large data sets on a cluster of nodes.
Apache Hadoop framework
Hadoop framework is designed to scale up from a single machine to thousands of machines in a cluster, where each node in a cluster offers local computation and storage.
Apart from storage and processing of large data sets Hadoop framework provides other features like High Availability (Data blocks are replicated across nodes), fault tolerance (framework itself is designed to detect and handle failures at the application layer).
Just to make clear a series of machines running HDFS and MapReduce is known as a Hadoop Cluster. Each machine in the cluster is known as node.
Hadoop modules
Hadoop framework includes the following modules-- Hadoop Common- This module contains the utilities used by other Hadoop modules.
- Hadoop Distributed File System (HDFS) – HDFS is the distributed file system that can be called as the module which provides the storage capabilities in Hadoop framework. HDFS provides high throughput access to application data by breaking the large file into blocks and storing those blocks in different nodes across the cluster.
- Hadoop MapReduce - This module is the implementation of the Map-Reduce programming model where processing of the large data sets is done in parallel. Each map task works on the part of the input data. The output of map tasks is processed further in Reduce phase giving the final output.
- Hadoop YARN- Yet Another Resource Negotiator (YARN) module is used for job scheduling and managing cluster resources.
How Hadoop works
Now when we know about the Hadoop modules let’s see how actually Hadoop framework works.
When a huge file is put into HDFS, the Hadoop framework splits that file into blocks (Block size 128 MB by default). These blocks are then copied into nodes across the cluster. Each block is replicated too (default is 3 per block) so that failure of node or corruption of data block won’t result in data loss.
The MapReduce program that uses the data is also distributed across the nodes, preferably on the same nodes where data block resides, to take advantage of data locality. That way data is processed in parallel as different blocks of the file are processed on different nodes simultaneously by the MapReduce code.
YARN is responsible for allocating resources to the various running applications.
A pictorial representation of how Hadoop framework works is as follows.
Advantages of Hadoop
Some of the advantages of Hadoop are as follows-
- Open source– Apache Hadoop is open source so it has a vibrant community contributing to the development, it can be freely downloaded, can also be modified is needed as per business requirement.
- Scalability– Hadoop is scalable, if more nodes are required in a cluster that can be added very easily. Also, if you write a Hadoop MapReduce program that functions properly on a five node cluster you don’t need to change anything to make it work on a larger cluster.
- Fault tolerance– Hadoop framework is designed to be fault tolerant, data blocks are replicated across nodes so that there is no data loss. Also, health check for nodes is done with in the framework so that tasks running on the dead node can be run on another node.
- Cost effective– Hadoop framework runs on cluster of commodity hardware, there is no need for high end specialized hardware for Hadoop cluster.
Limitations of Hadoop
- Not suited for small files – Hadoop works better with small number of large files not with a large number of small files as the overhead involved negates the benefit. Also having a large number of small files will overload the namenode which stores metadata about the files.
- Better for batch processing – Hadoop MapReduce programming model is essentially a batch processing system. It does not support processing streamed data.
- Slower processing – In Hadoop data is distributed and processed across the cluster. There is disk I/O involved, initially data block is read from disk, intermediate map task output is again written to the disk, that data after some processing by Hadoop framework is again read by reduce phase. All these steps add to latency. There is no in-memory processing like offered by Apache Spark.
That's all for this topic Introduction to Hadoop Framework. If you have any doubt or any suggestions to make please drop a comment. Thanks!
>>>Return to Hadoop Framework Tutorial Page
Related Topics
You may also like-
very good information about machine learning and hadoop framework.
ReplyDelete