Hadoop 2.0 – Storage Consumption
With the Hortonworks Data Platform on Nutanix solution you have the flexibility to start small with a single block and scale up incrementally a node at a time. This provides the best of both worlds–the ability to start small and grow to massive scale without any impact to performance.
The below diagram shows a typical workflow when a client starts a job that is using MapReduce. We want to focus on what happens when a DataNode writes to disk.
Hadoop 2.0 Workflow
1. Client submits a job
2. Response with ApplicationID
3. Containers Launch Context
4. Start ApplicationMaster
5. Get Capabilities
6. Request / Receive Containers
7. Container Launch Requests
8. Data being written
On step 8 from Figure 9, Node 1 it’s writing to the local disk and creating local copies. By default DFS replication is set to 3. That means for every piece of data that is created, 3 copies of data is created. The 1st copy is stored on the local node (A1), the 2nd copy of data will try to be placed off rack if possible and the 3rd copy will be placed in the same rack as the 2nd copy randomly. This is done for data availability and allows multiple nodes to use the copies of data to parallelize their efforts to get fast results. When new jobs are ran, NodeManagers will be selected where the data resides to reduce network congestions and increase performance. RF3 with Hadoop will have overhead of 3X.
Hadoop 2.0 on Nutanix- Storage Consumption
Both Hadoop and Nutanix have similar architectures around data locality and using replication factor for availability and throughput. This section will give a good idea on the impacts of changing replication factor on HDFS and ADSF.
Test & Development Environments
For Test and development environments HDFS replication factor can be set to 1. Since the requirement for performance will be less you can drop the value and save on storage consumption. With Acropolis Replication Factor set to 2, availability will be handled by ADSF.
Hadoop on ADFS Parameters for Test/Dev
Item ——————– Detail ———————————- Rationale
HDFS Replication Factor (RF) ——————– 1——————– Performance isn’t as important
———————————————————————— Data Availability handled by Nutanix
Acropolis Replication Factor (RF) —————- 2 ——————- Data availability
In the above diagram once the local data node writes A1, ADFS will be create B1 locally and will create the 2nd copy based on Availability domains from Nutanix. Since the Hadoop DataNodes will only have knowledge of A1 copy you can use Acropolis High Availability (HA) to quickly restart your NameNode in the event of a failure. With using this configuration the HDFS / ADFS solution will have an overhead of 2X.
In production environments a minimum of HDFS RF 2 should be used so the NameNode has multiple options to place containers for YARN to work with local data. RF2 on HDFS also helps with job reliability if a physical node or VM goes down due to error or maintenance. The YARN jobs can quickly restart using the built in mechanisms by using the below recommendations and have enterprise class data availability with ADSF.
Hadoop on ADFS Parameters for Production
Item ———————————————– Detail ————— Rationale
HDFS Replication Factor (RF) —————————- 2 ——————- Hadoop Job Reliability and Parallelization
Acropolis Replication Factor (RF) ——————– 2 ——————– Data availability
In the above diagram once the local data node writes A1, ADFS will be create B1 locally and will create the 2nd copy based on Availability domains from Nutanix. HDFS also writes A2 so the same process happens with C! and C2 being created synchronously. Since the Hadoop DataNode will have knowledge of A1 and A2 both copies can be used for task parallelization.
In this environment you would potential have 1 extra copy of data versus traditional Hadoop. To address the extra storage consumption you can apply EC-X. As an example you may have 30 node Hadoop cluster formed with NX-6235 which would have ~900 TB of raw capacity. If you set the EC-X strip width to 18/1 you can figure out the following overhead.
Useable Storage = ((20% * Total RAW capacity * / ADSF RF Overhead) + (80% * Total RAW capacity * EC-X Overhead)) / (HDFS RF2)
Useable Storage = (0.2 * 9252 GB * 2) + ( 0.8 * 9252 * 1.06) / HDFS RF
Useable Storage = 925.2 GB + 6982.6 GB / HDFS RF
Useable Storage = 7907.8 GB / 2
Useable Storage = 3953.9 GB
Therefore 9252 GB / 3953.9 GB = 2.34 X Overhead which is less than traditional Hadoop.
Nutanix provides the ideal combination of compute and high-performance local storage; providing the best possible architecture for Hadoop and other distributed applications and gives you more space to perform business analytics.