What are the daemons that are required to start the HDFS

Hadoop ≅ HDFS + MapReduce (Part – I)

Nosotros discussed in the terminal postal service that Hadoop has many components in its ecosystem such every bit Pig, Hive, HBase, Flume, Sqoop, Oozie etc. Merely the 2 core components that forms the kernel of Hadoop are HDFS and MapReduce.We will discuss HDFS in more detail in this post.

HDFS is Hadoop Distributed File System, which is responsible for storing data on the cluster in Hadoop. Files in HDFS are split up into blocks before they are stored on the cluster. The typical size of a cake is 64MB or 128MB. The blocks belonging to one file are then stored on different nodes. The blocks are as well replicated to ensure loftier reliability. To delve deeper into how HDFS achieves all this, we need to first understand Hadoop Daemons.

Hadoop Daemons

Daemons in computing terms is a process that runs in the background. Hadoop has five such daemons. They are NameNode, Secondary NameNode, DataNode, JobTracker and TaskTracker. Each daemons runs separately in its own JVM. We discuss about NameNode, Secondary NameNode and DataNode in this post equally they are associated with HDFS.

NameNode – Is is the Main node which is responsible for storing the meta-data for all the files and directories. It has information such every bit the blocks that make a file, and where are those blocks located in the cluster.
DataNode – Information technology is the Slave node that contains the actual data. Information technology reports information of the blocks it contains to the NameNode in a periodic fashion.

It should exist understood that it is highly important the NameNode runs all the time. The failure of the NameNode makes the cluster inaccessible every bit there would exist no information on where the files are located in the cluster. For the very reason, we have secondary NameNode.

Secondary NameNode – It periodically merges changes in the NameNode with the edit log and so that it doesn't grow besides large in size. It too keeps a copy of the image which can be used in example of failure of NameNode.

We will now talk over the following with respects to HDFS:-

Writing file on the cluster
Reading file from the cluster
Error tolerance Strategy
Replication Strategy

Writing file on the cluster

The user through a customer, requests to write data on Hadoop cluster.
The user sets the replication factor (default iii) and block size through the configuration options.
The client splits the file into blocks and contacts the NameNode.
The NameNode returns the DataNodes (in increasing order of the distance from the client node).
The Client sends the data to the commencement DataNode, which while receiving the information, transfers the aforementioned to the side by side DataNode (which does the same and this forms the replication pipeline).
The DataNodes send acknowledgments to the NameNode on successfully receiving the data.
The Client repeats the same process for all the other blocks that constitute the file.
When all the blocks are written on the cluster, the NameNode closes the file and stores the meta-data information.

Reading data from the cluster

The user provides the filename to the customer.
The Client passes the filename to the NameNode.
The NameNode sends the name of the blocks that constitute the file. Its likewise sends the location (DataNode) where the blocks are available (over again in increasing order of the altitude from the customer).
The Client then downloads the data from the nearest DataNode.

Fault tolerance strategy

At that place are three types of failure that tin can occur, namely, node failure, advice failure and data corruption.

In instance of NameNode failure, the responsibleness of the Secondary NameNode comes into play. The NameNode then has to restored with the help of the merged copy of the NameNode image.
The DataNode sends a heartbeat bulletin to the NameNode every 3 seconds to inform the NameNode that it is alive. If the NameNode doesn't receive a heartbeat message from the DataNode in 10 mins (configurable), it considers the DataNode to be dead. It and so stores the replica of the cake in some other DataNode.
The Client receives an ACK course the DataNode that it has received the data. If it doesn't later on several tries, it is understood that either there is network failure or the DataNode has failed.
Checksum is sent along with the information to wait for data corruption.
Periodically the DataNodes sends the study containing the list of blocks that are uncorrupted. The NameNode then updates the list of valid blocks a DataNode contains.
For all such under replicated blocks, the NameNode adds other DataNodes to the replication pipeline.

Replication strategy

The replication factor is set the three past default (can be configured). The cluster is split in terms of racks, where each rack contains DataNodes.

The NameNode tries to make the customer as the first DataNode replica. If it is not free then any node in the same rack as that of the client is fabricated the outset replica.
Then the other two replicas are stored on two different DataNodes on a rack unlike from the rack of the first replica.

Earlier I end this post. I would like to point out that HDFS is non fit for all types of applications and files.

HDFS is a fit when,

Files to be stored are large in size.
Your application need to write once and read many times.
You want to utilise cheap, commonly available hardware.

And it is not a fit when,

Yous want to store a large number of small files. It is better to store million of large file when compare to billions of small files.
There are multiple writers. It is only designed for writing at the end of file and not at a random first.

I hope this postal service gives y'all an overview of HDFS and how it provides failure support, data recoverability and consistency. In the adjacent postal service nosotros will talk nearly MapReduce in detail.

—
References

1. Hadoop – The Definitive Guide (Tom White)

carneymasul1982.blogspot.com

Source: https://learnhadoopwithme.wordpress.com/tag/hadoop-daemons/

What are the daemons that are required to start the HDFS

Hadoop ≅ HDFS + MapReduce (Part – I)

Writing file on the cluster

Reading data from the cluster

Fault tolerance strategy

Replication strategy

0 Response to "What are the daemons that are required to start the HDFS"

ارسال یک نظر

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel