zenuch October 21, 2008 at 15:49

Hadoop Distributed File System

Current trends in the development of web-applications and the exponential growth of information processed by them, has led to the need for the emergence of file systems oriented to ensure high performance, scalability, reliability and availability. Such giants of the search industry as Google and Yahoo could not stay away from this problem .

The specifics of Google applications and computing infrastructure, built on a huge number of low-cost servers, with their inherent constant failures, led to the development of its own private distributed file system Google File System (GFS). This system is aimed at automatic recovery from failures, high fault tolerance, high throughput when accessing data in streaming mode. The system is designed to work with large volumes of data, implying large sizes of stored files, therefore GFS is optimized for the corresponding operations. In particular, in order to simplify implementation and increase efficiency, GFS does not implement the standard POSIX interface.

GFS's response was the open source Hadoop project , with its Hadoop Distributed File System. The project is actively supported and developed by Yahoo (18 people). We will conduct a comparative analysis of the terms used in these systems, establish their correspondence and dwell on HDFS in more detail:

	HDFS	Gfs
Main server	Namenode	Master
Slave servers	DataNode Servers	Chunk servers
Append and Snapshot Operations	-	+
Automatic recovery after a failure of the main server	-	+
Implementation language	Java	C ++

HDFS is the distributed file system used in the Hadoop project. An HDFS cluster primarily consists of a NameNode server and DataNode servers that store data directly. The NameNode server manages the file system namespace and client access to data. To offload the NameNode server, data transfer is carried out only between the client and the DataNode server.

hdfs_arch

Secondary NameNode:

The main NameNode server captures all transactions related to changing file system metadata in a log file called EditLog. When starting the main NameNode server, it reads the HDFS image (located in the FsImage file) and applies to it all the changes accumulated in EditLog. Then a new image is recorded with the changes applied, and the system starts working with a clean log file. It should be noted that the NameNode server performs this work once at its first start. Subsequently, such operations are assigned to the secondary NameNode server. FsImage and EditLog are ultimately stored on the main server.

Replication mechanism:

If the NameNode server detects a failure of one of the DataNode servers (the absence of heartbeat messages from it), the data replication mechanism starts:

- selection of new DataNode servers for new replicas
- balancing the placement of data on DataNode servers

Similar actions are performed if the replicas are damaged or in the case of an increase in the number of replicas inherent in each block.

Replica Placement Strategy:

Data is stored as a sequence of blocks of a fixed size. Copies of blocks (replicas) are stored on several servers, by default - three. Their placement is as follows:

- the first replica is located on the local node
- the second replica on the other node in the same rack
- the third replica is on the arbitrary node of the other rack
- the rest of the replicas are placed in any way

When reading data, the client selects the DataNode server closest to it with the replica .

Data Integrity:

The weakened data integrity model implemented in the file system does not guarantee replica identity. Therefore, HDFS shifts data integrity checks to clients. When creating the file, the client calculates checksums every 512 bytes, which are subsequently stored on the DataNode server. When reading a file, the client accesses data and checksums. And, in case of their inconsistency, an appeal is made to another replica.

Data Recording:

“When writing data to HDFS, an approach is used to achieve high throughput. The application records in streaming mode, while the HDFS client caches the recorded data in a temporary local file. When data is accumulated on a single HDFS block in a file, the client contacts the NameNode server, which registers a new file, selects the block, and returns to the client a list of datanode servers for storing block replicas. The client begins to transfer the block data from the temporary file to the first DataNode server from the list. The DataNode server saves the data on disk and sends it to the next DataNode server in the list. Thus, data is transmitted in the pipelined mode and replicated on the required number of servers. At the end of the recording, the client notifies the NameNode server, which captures the file creation transaction,

Data Deletion:

Due to the safety of data (in case the operation is rolled back), deletion in the file system occurs according to a certain technique. At first, the file is moved to the / trash directory specially designated for this, and after a certain time has elapsed, it is physically deleted:

- the file is deleted from the HDFS namespace
- the blocks associated with the data are released

Current disadvantages:

- lack of automatic start of the main server in case of its failure (this functionality is implemented in GFS)
- lack of append operations (assumed in version 0.19.0) and snapshot (these functions are also implemented in GFS)

Read what will happen in future versions of HDFS on the wiki project on the Apache Foundation website . Additional information and opinions of people working with Hadoop can be found in the blogs of companies actively using this technology: Yahoo , A9 , Facebook , Last.fm , Laboratory

Sources:

- Dhruba B. Hadoop Distributed File System, 2007
- Tom W. A Tour of Apache Hadoop
- Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung The Google File System
- O. Sukhoroslov New technologies for distributed storage and processing of large amounts of data

This article is introductory, its purpose: to enlighten the reader into the atmosphere of relevant developments. In case of positive feedback and / or interest of readers, we will prepare a number of additional related articles:

Installing Hadoop Core + Hbase on Windows OS (+ php class that implements interaction with Hbase using the REST API)
Translation of the article: “MapReduce: Simplified Data Processing on Large Clusters ”

Tags: