HDFS has proven to be a scalable, fault-tolerant and distributed storage solution which is fast being adopted by various industries today. The distributed storage along with the ability to scale-out in a linear way makes the entire Hadoop framework really cost effective.
With constant improvements in the Hadoop eco-system we are seeing some new and exciting features added into various components running on top of Hadoop – HDFS being one of them. The ability to have a Heterogeneous Storage options for HDFS is one such major ability.
What is HDFS
HDFS is a Java-based distributed file system which provides high-throughput access to application data. It provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
HDFS combined with YARN provides an ideal data platform and computing solutions in the form of Apache Hadoop. A framework which can reliably be used for enterprise data. Add to this, the ability of linear growth of storage which can be accomplished by adding more “commodity” hardware into the cluster.
Apache Hadoop – Brief Journey
We recently celebrated 10 years of Apache Hadoop. A journey which has seen tremendous improvement in the reliability and overall performance of the framework. Hadoop was specifically designed to run on a commodity grade hardware keeping in mind that failures are bound to happen in a cluster. The distributed processing power combined with a robust and durable storage triggered a lot of interest.
The framework however has been implemented on homogeneous computing nodes. One reason being the problems faced with different computing capacity in a heterogenous cluster. Overtime improvements has made it possible for Hadoop to run seamlessly in a heterogenous cluster as well.
Hadoop today is moving fast to fulfill the enterprise grade framework requirement. One of them happens to be the enterprise hardware.
Apache Hadoop – Storage Model
The hardware horizon is changing fast as well. We are seeing new class of nodes which has higher storage density and low processing power. These nodes are designed with just one goal – More storage at Cheaper price. The tradeoff is – Processing power. On the other hand we are seeing faster drives which provides lower latency to the storage medium – like SSD. Application with random IO patterns can benefit a lot if they can use these media.
Another important aspect of a storage medium to consider is – Throughput. The adhoc or batch process apps performance affects with the throughput it has to deal with the storage medium.
Until Hadoop 2.3, HDFS used a single storage model.
Apache Hadoop – Heterogeneous Storage Model
With the release of Hadoop 2.3 HDFS started supporting Heterogeneous storage model. So what does the HDFS Heterogeneous storage model really do?
It basically allows you to define the type of storage medium used on the nodes in the cluster. The notion Storage Type can be used to identify the kind of underlying storage medium.
Let us see how this really fits into the whole picture. A few lines earlier I mentioned about the dense storage nodes with less processing power and large storage. I can now add these nodes in my cluster. Since these nodes are not really meant for any computing I will use these nodes for storage only. I can store infrequently or unused data which I can term as archive.
So the HDFS Heterogeneous storage model will allow me to define the kind of Data Directory available on the nodes in my cluster. In this case above I can say that the storage type on the node is ARCHIVE.
Hadoop 2.6 release made further improvements in the Heterogeneous Storage model with its phase 2 implementation. It now supports additional storage types like SSD and RAM_DISK. At the time of writing this article HDFS Heterogenous storage model supports the Storage Types as – ARCHIVE, DISK, SSD and RAM_DISK. The default being DISK.
Also starting Hadoop 2.6 release you can attach Storage Policy to a directory or a file in HDFS. A Storage policy defines the number of replicas to be placed on each tier or storage type. You can remove or change the policy on the directory or file. However, at this time you cannot modify or define custom Storage Policies.
Heterogeneous Storage Model – Use cases
The use of HDFS Heterogenous storage model opens up some interesting avenues now. We can now use Storage Policies like Hot, Cold, Warm, All_SSD which can control the placement of replicas.
I can now set the storage policy of All_SSD for the directory in HDFS which holds all the HBase data. The HBase applications can see a tremendous boost in the performance with the improved IOPS.
Or I can think of a ‘Cold’ data which I can persist for a long time without accessing it frequently.
Or use the RAM storage to storage a copy for a higher throughput and IOPS.
With the ever increasing adoption of Hadoop and HDFS as a reliable and durable storage solution, Heterogeneous storage solution gives up the ability to utilize the cluster intelligently.