Hadoop Architechture
- Hadoop divides a file into chunks (typically 64 MB in size) and stores each chunk on a DataNode.
- Each chunk is replicated multiple times (typically 3 times) to guard against node failure.
If any node fails, all the chunks in it are automatically copied from other nodes to keep the replication factor same as before.
- One node in the Hadoop cluster is called the NameNode.
This node stores only the meta-data for chunks of files and keeps this information in memory.
This helps the NameNode to respond very quickly when it is asked about the whereabouts of a file.
- When chunks are needed, the NameNode only provides the location.
Accessing the chunks happens directly from the DataNodes.
Why huge block-sizes?
Lets say, HDFS is storing a 1000Mb file.With a 4k block size, 256,000 requests will be required to get that file (1 request per block).
In HDFS, those requests go across a network and come with a lot of overhead.
Additionally, each request is processed by the NameNode to figure out the block's physical location.
With 64Mb blocks, the number of requests goes down to 16, which is much much more efficient for network traffic.
It reduces the load on the NameNode and also reduces the meta-data for the entire file, allowing meta-data to be stored in memory.
Thus, for large files, a bigger block size in HDFS is a boon.
Read full article from Hadoop Basics - PrismoSkills
No comments:
Post a Comment