Archives for September 2015


Make Hadoop More Resilient and Space Efficient with HDP and Nutanix

Hadoop 2.0 – Storage Consumption

With the Hortonworks Data Platform on Nutanix solution you have the flexibility to start small with a single block and scale up incrementally a node at a time. This provides the best of both worlds–the ability to start small and grow to massive scale without any impact to performance.

The below diagram shows a typical workflow when a client starts a job that is using MapReduce. We want to focus on what happens when a DataNode writes to disk.


Hadoop 2.0 Workflow
1. Client submits a job
2. Response with ApplicationID
3. Containers Launch Context
4. Start ApplicationMaster
5. Get Capabilities
6. Request / Receive Containers
7. Container Launch Requests
8. Data being written

On step 8 from Figure 9, Node 1 it’s writing to the local disk and creating local copies. By default DFS replication is set to 3. That means for every piece of data that is created, 3 copies of data is created. The 1st copy is stored on the local node (A1), the 2nd copy of data will try to be placed off rack if possible and the 3rd copy will be placed in the same rack as the 2nd copy randomly. This is done for data availability and allows multiple nodes to use the copies of data to parallelize their efforts to get fast results. When new jobs are ran, NodeManagers will be selected where the data resides to reduce network congestions and increase performance. RF3 with Hadoop will have overhead of 3X.

Hadoop 2.0 on Nutanix- Storage Consumption

Both Hadoop and Nutanix have similar architectures around data locality and using replication factor for availability and throughput. This section will give a good idea on the impacts of changing replication factor on HDFS and ADSF.

Test & Development Environments

For Test and development environments HDFS replication factor can be set to 1. Since the requirement for performance will be less you can drop the value and save on storage consumption. With Acropolis Replication Factor set to 2, availability will be handled by ADSF.

Hadoop on ADFS Parameters for Test/Dev

Item ——————– Detail ———————————- Rationale
HDFS Replication Factor (RF) ——————– 1——————– Performance isn’t as important
———————————————————————— Data Availability handled by Nutanix
Acropolis Replication Factor (RF) —————- 2 ——————- Data availability


In the above diagram once the local data node writes A1, ADFS will be create B1 locally and will create the 2nd copy based on Availability domains from Nutanix. Since the Hadoop DataNodes will only have knowledge of A1 copy you can use Acropolis High Availability (HA) to quickly restart your NameNode in the event of a failure. With using this configuration the HDFS / ADFS solution will have an overhead of 2X.

Production Environments

In production environments a minimum of HDFS RF 2 should be used so the NameNode has multiple options to place containers for YARN to work with local data. RF2 on HDFS also helps with job reliability if a physical node or VM goes down due to error or maintenance. The YARN jobs can quickly restart using the built in mechanisms by using the below recommendations and have enterprise class data availability with ADSF.

Hadoop on ADFS Parameters for Production

Item ———————————————– Detail ————— Rationale
HDFS Replication Factor (RF) —————————- 2 ——————- Hadoop Job Reliability and Parallelization
Acropolis Replication Factor (RF) ——————– 2 ——————– Data availability


In the above diagram once the local data node writes A1, ADFS will be create B1 locally and will create the 2nd copy based on Availability domains from Nutanix. HDFS also writes A2 so the same process happens with C! and C2 being created synchronously. Since the Hadoop DataNode will have knowledge of A1 and A2 both copies can be used for task parallelization.
In this environment you would potential have 1 extra copy of data versus traditional Hadoop. To address the extra storage consumption you can apply EC-X. As an example you may have 30 node Hadoop cluster formed with NX-6235 which would have ~900 TB of raw capacity. If you set the EC-X strip width to 18/1 you can figure out the following overhead.

Useable Storage = ((20% * Total RAW capacity * / ADSF RF Overhead) + (80% * Total RAW capacity * EC-X Overhead)) / (HDFS RF2)
Useable Storage = (0.2 * 9252 GB * 2) + ( 0.8 * 9252 * 1.06) / HDFS RF
Useable Storage = 925.2 GB + 6982.6 GB / HDFS RF
Useable Storage = 7907.8 GB / 2
Useable Storage = 3953.9 GB
Therefore 9252 GB / 3953.9 GB = 2.34 X Overhead which is less than traditional Hadoop.

Nutanix provides the ideal combination of compute and high-performance local storage; providing the best possible architecture for Hadoop and other distributed applications and gives you more space to perform business analytics.


Why virtualize Hadoop nodes on the Nutanix Xtreme Computing Platform?


o Make Hadoop an App: Prism’s HTML 5 user interface makes managing infrastructure pain free with one-click upgrades. Integrated data protection can be used to manage golden images for Hadoop across multiple Nutanix clusters. Painfully firmware upgrades are easily addressed and time saved.
o No Hypervisor Tax: The Acropolis Hypervisor is included with all Nutanix clusters. Acropolis High Availability and automated Security Technical Implementation Guides (STIG) keeps your data available and secure.
o Hardware utilization: Bare-metal Hadoop deployments average 10-20% CPU utilization, a major waste of hardware resources and datacenter space. Virtualizing Hadoop allows for better hardware utilization and flexibility. Virtualization can also help in right size your solution. If you job complementation times are meeting windows no need buying more hardware. If more resources are needed, they can easily be adjusted.
o Elastic MapReduce and scaling: Dynamic addition and removal of Hadoop nodes based on load allow you to scale based upon your current needs, not what you expect. Enable supply and demand to be in true synergy. Hadoop DataNodes can be quickly clones out in seconds.
o DevOps: Big Data scientists demand performance, reliability, and a flexible scale model. IT operations relies on virtualization to tame server sprawl, increase utilization, encapsulate workloads, manage capacity growth, and alleviate disruptive outages caused by hardware downtime. By virtualizing Hadoop, Data Scientists and IT Ops mutually achieve all objectives while preserving autonomy and independence for their respective responsibilities
o Sandboxing of jobs: Buggy MapReduce jobs can quickly saturate hardware resources, creating havoc for remaining jobs in the queue. Virtualizing Hadoop clusters encapsulates and sandboxes MapReduce jobs from other important sorting runs and general purpose workloads
o Batch Scheduling & Stacked workloads: Allow all workloads and applications to co-exist, e.g. Hadoop, Virtual Desktops and Servers. Schedule job runs during off-peak hours to take advantage of idle night time and weekend hours that would otherwise go to waste. Nutanix also allows to bypass the flash tier for sequential workloads which can prevent the time it takes to rewarm cache for mixed workloads.
o New Hadoop economics: Bare metal implementations are expensive and can spiral out of control. Downtime and underutilized CPU consequences of physical server’s workloads can jeopardize project viability. Virtualizing Hadoop reduces complexity and ensures success for sophisticated projects with a scale-out grow as you go model – a perfect fit for Big Data projects
o Blazing fast performance: Up to 3,500 MB/s of sequential throughput in a compact 2U 4-node cluster. A TeraSort benchmark yields 529 MB/s in the same 2U cluster
o Unified data platform: Run multiple data processing platforms along with Hadoop YARN on a single unified data platform, Acropolis Distributed File System (ADFS).
o Flash SSDs for NoSQL: The summaries that roll up to a NoSQL database like HBase are used to run business reports and are typically memory and IOPS-heavy. Nutanix has SSD tiers coupled along with dense memory capacities. With its automatic tiering technology can transparently bring IOPS-heavy workloads to SSD tiers
o Analytic High-density Engine: With the Nutanix solution you can start small and scale. A single Nutanix block can comes packed up to 40TB storage and 96 cores in a compact 2U footprint. Given the modularity of the solution, you can granularly scale per-node (up to ~10TB/24 cores), per-block (up to ~40TB/96 cores), or with multiple blocks giving you the ability to accurately match supply with demand and minimize the upfront CapEx.
o Change management: Maintain environmental control and separation between development, test, staging, and production environments. Snapshots and fast clones can help in sharing production data with non-production jobs, without requiring full copies and unnecessary data duplication.
o Business continuity and data protection: Nutanix can provide replication across sites to provide additional protection for the NameNode and DataNodes. Replication can be setup to avoid sending wasteful temporary data across the WAN using per VM replication and container based replication.
o Data efficiency: The Nutanix solution is truly VM-centric for all compression policies. Unlike traditional solutions that perform compression mainly at the LUN level, the Nutanix solution provides all of these capabilities at the VM and file level, greatly increasing efficiency and simplicity. These capabilities ensure the highest possible compression/decompression performance on a sub-block level. While developers may or may not run jobs with compression, IT Operations can ensure cold data is effectively stored. Nutanix Erasure Coding and also be applied on top of compression saving.
o Automatic Auto-Leveling and Auto-Archive: Nutanix will spread data evenly across the cluster ensuring local drives don’t fill up causing an outage when space is available. Using Nutanix cold storage nodes cold data can be moved from compute nodes, freeing up room for hot data while not consuming additional licenses.
o Time-sliced clusters: Like public cloud EC2 environments, Nutanix can provide a truly converged cloud infrastructure allowing you to run your Hadoop, server and desktop virtualization on a single converged cloud. Get the efficiency and savings you require with a converged cloud on a truly converged architecture.


Why is Nutanix delivering the Acropolis Hypervisor?

CEO, Dheeraj Pandey talks about why Nutanix is delivering another hypervisor and talks about Prism.