Hadoop For Free: Skip Flash

I see more and more people looking at getting started with Hadoop but it can be risky if you don’t have the skills and time in your organization. To top that off you have to buy new equipment that will mostly likely only be used 20% of the time.

It takes more resources and time when deploying Hadoop for the first time.

It takes more resources and time when deploying Hadoop for the first time.

Nutanix has always supported mixed workloads but Hadoop can be blunt force trauma on storage for a variety reasons:

1) Hadoop was never built for shared storage, it was architect around data locality which is core architectural design feature of Nutanix.

2) Large ingest and working sets can destroy the storage cache for your remaining workloads. With Nutanix you can bypass the Flash tier for sequential traffic. If the workloads was sized properly the HDDs are usually relative inactive as they are meant to be cold storage. Using the idle disk for Hadoop will give infrastructure and hadoop teams the ability to test the waters before carrying on.

In the case of customers running NX-8150, they might never need to buy extra nodes for compute. With 20 HDDs at your disposable the raw disk gives great performance with out flash. If your performance is fine running just from HDD you can save additional cost by adding storage only nodes. The storage only nodes don’t require additional licensing from Cloudera or Hortonworks.

While the HDD are idle, the, hadoop admins will play.

While the HDD are idle, the, hadoop admins will play.

Performance on Cloudera with 4 nodes of NX-8150 using no flash

Green = Writes Blue = Reads

Green = Writes
Blue = Reads

In the above case CPU was only at 50% so you could run additional workloads even while Hadoop was running. If your goal is just Test/Dev you can also turn HDSF replication factor to 1 since Nutanix provides enterprise class redundancy already. When you add in erasure encoding the effective capacity will be less than 2X compared to 3X with traditional hadoop.


Please hit me up on twitter @dlink7 if you have any questions.



Splunk on the Nutanix Acropolis Hypervisor

With Nutanix, customers can start their Splunk deployment small and then scale out the infrastructure as needed to meet data ingest and retention requirements. And the Nutanix Xtreme Computing Platform ensures that the system will remain available, providing consistent ingest, indexing, and search performance. Administrators can focus on Splunk and the application, not on the infrastructure, adding compute and storage resources transparently to the cluster as the environment grows.

Running Splunk on Acorpolis is simple, secure and can scale.


If you can open a web browser and log into Prism your good to go. No hypervisor management to setup and all scaling and data migrations issues are handle by the storage fabric.

If you can open a web browser and log into Prism your good to go. No hypervisor management to setup and all scaling and data migrations issues are handle by the storage fabric.


Nutanix believes that a system is only as secure at the last time its security posture was checked. Nutanix provides a Security Technical Implementation Guide (STIG) that uses machine-readable code to automate compliance against rigorous common standards. You can quickly assess and remediate your platform to ensure that all regulatory requirements are exceeded.


With the Nutanix solution you can start small and scale. A single Nutanix block (up to four nodes) provides from 20-40+ TB storage and 96+ cores in a compact footprint. Given the modularity of the solution, you can scale per node, giving you the ability to accurately match supply with demand and minimize the upfront CapEx. If the speed of solution is meeting your requirements you can just simply add storage nodes and save cash for another project.

Download and read all the work Steve Poitras put into the Splunk Nutanix to get all of your sizing information.

< Splunk on Nutanix >

If you can’t be bothered to read, hit up the sizer and it will do all the work for you!



Nutanix Top Of Rack Integration: Video

Quick demo of the Top of Rack integration inside of Prism.


The Easy Button For Hadoop

The Nutanix team just finished wrapping up the lastest revsion of Hadoop on Nutanix. Personally I’ve had a great time working on the newest reference architecture for Hadoop running on top of the Acropolis hypervisor. It was fun to work with so many people across the company to get the testing and validation finished. Engineering, Solutions, Alliances, Marketing and SE’s alike all helped at some point.

< read more here >


Make Hadoop an Application, Not an Adventure.

In a couple of weeks we’ll discuss how to bring blazing fast performance for your analytic jobs while empowering your application team to deploy additional resources and makes changes on the fly. The first webinar will start at November 3, 2015 at 7:00AM PST.

The big added bonus is having Andy Nelson is joining me on the webinar. Andy is Distributed Systems Specialist at Nutanix and also has worked in the Office of the CTO at VMware. He brings a wealth of knowledge on Hadoop and variety of open source projects and can even talk a mean game of containers. Andy has also been a speaker at Hadoop World and can talk to this ever changing landscape.

Signup for the webinar -> HERE

Andy’s Last Talk at Hadoop World: Scalable On Demand Hadoop Clusters with Docker and Mesos


Make Hadoop More Resilient and Space Efficient with HDP and Nutanix

Hadoop 2.0 – Storage Consumption

With the Hortonworks Data Platform on Nutanix solution you have the flexibility to start small with a single block and scale up incrementally a node at a time. This provides the best of both worlds–the ability to start small and grow to massive scale without any impact to performance.

The below diagram shows a typical workflow when a client starts a job that is using MapReduce. We want to focus on what happens when a DataNode writes to disk.


Hadoop 2.0 Workflow
1. Client submits a job
2. Response with ApplicationID
3. Containers Launch Context
4. Start ApplicationMaster
5. Get Capabilities
6. Request / Receive Containers
7. Container Launch Requests
8. Data being written

On step 8 from Figure 9, Node 1 it’s writing to the local disk and creating local copies. By default DFS replication is set to 3. That means for every piece of data that is created, 3 copies of data is created. The 1st copy is stored on the local node (A1), the 2nd copy of data will try to be placed off rack if possible and the 3rd copy will be placed in the same rack as the 2nd copy randomly. This is done for data availability and allows multiple nodes to use the copies of data to parallelize their efforts to get fast results. When new jobs are ran, NodeManagers will be selected where the data resides to reduce network congestions and increase performance. RF3 with Hadoop will have overhead of 3X.

Hadoop 2.0 on Nutanix- Storage Consumption

Both Hadoop and Nutanix have similar architectures around data locality and using replication factor for availability and throughput. This section will give a good idea on the impacts of changing replication factor on HDFS and ADSF.

Test & Development Environments

For Test and development environments HDFS replication factor can be set to 1. Since the requirement for performance will be less you can drop the value and save on storage consumption. With Acropolis Replication Factor set to 2, availability will be handled by ADSF.

Hadoop on ADFS Parameters for Test/Dev

Item ——————– Detail ———————————- Rationale
HDFS Replication Factor (RF) ——————– 1——————– Performance isn’t as important
———————————————————————— Data Availability handled by Nutanix
Acropolis Replication Factor (RF) —————- 2 ——————- Data availability


In the above diagram once the local data node writes A1, ADFS will be create B1 locally and will create the 2nd copy based on Availability domains from Nutanix. Since the Hadoop DataNodes will only have knowledge of A1 copy you can use Acropolis High Availability (HA) to quickly restart your NameNode in the event of a failure. With using this configuration the HDFS / ADFS solution will have an overhead of 2X.

Production Environments

In production environments a minimum of HDFS RF 2 should be used so the NameNode has multiple options to place containers for YARN to work with local data. RF2 on HDFS also helps with job reliability if a physical node or VM goes down due to error or maintenance. The YARN jobs can quickly restart using the built in mechanisms by using the below recommendations and have enterprise class data availability with ADSF.

Hadoop on ADFS Parameters for Production

Item ———————————————– Detail ————— Rationale
HDFS Replication Factor (RF) —————————- 2 ——————- Hadoop Job Reliability and Parallelization
Acropolis Replication Factor (RF) ——————– 2 ——————– Data availability


In the above diagram once the local data node writes A1, ADFS will be create B1 locally and will create the 2nd copy based on Availability domains from Nutanix. HDFS also writes A2 so the same process happens with C! and C2 being created synchronously. Since the Hadoop DataNode will have knowledge of A1 and A2 both copies can be used for task parallelization.
In this environment you would potential have 1 extra copy of data versus traditional Hadoop. To address the extra storage consumption you can apply EC-X. As an example you may have 30 node Hadoop cluster formed with NX-6235 which would have ~900 TB of raw capacity. If you set the EC-X strip width to 18/1 you can figure out the following overhead.

Useable Storage = ((20% * Total RAW capacity * / ADSF RF Overhead) + (80% * Total RAW capacity * EC-X Overhead)) / (HDFS RF2)
Useable Storage = (0.2 * 9252 GB * 2) + ( 0.8 * 9252 * 1.06) / HDFS RF
Useable Storage = 925.2 GB + 6982.6 GB / HDFS RF
Useable Storage = 7907.8 GB / 2
Useable Storage = 3953.9 GB
Therefore 9252 GB / 3953.9 GB = 2.34 X Overhead which is less than traditional Hadoop.

Nutanix provides the ideal combination of compute and high-performance local storage; providing the best possible architecture for Hadoop and other distributed applications and gives you more space to perform business analytics.


Why virtualize Hadoop nodes on the Nutanix Xtreme Computing Platform?


o Make Hadoop an App: Prism’s HTML 5 user interface makes managing infrastructure pain free with one-click upgrades. Integrated data protection can be used to manage golden images for Hadoop across multiple Nutanix clusters. Painfully firmware upgrades are easily addressed and time saved.
o No Hypervisor Tax: The Acropolis Hypervisor is included with all Nutanix clusters. Acropolis High Availability and automated Security Technical Implementation Guides (STIG) keeps your data available and secure.
o Hardware utilization: Bare-metal Hadoop deployments average 10-20% CPU utilization, a major waste of hardware resources and datacenter space. Virtualizing Hadoop allows for better hardware utilization and flexibility. Virtualization can also help in right size your solution. If you job complementation times are meeting windows no need buying more hardware. If more resources are needed, they can easily be adjusted.
o Elastic MapReduce and scaling: Dynamic addition and removal of Hadoop nodes based on load allow you to scale based upon your current needs, not what you expect. Enable supply and demand to be in true synergy. Hadoop DataNodes can be quickly clones out in seconds.
o DevOps: Big Data scientists demand performance, reliability, and a flexible scale model. IT operations relies on virtualization to tame server sprawl, increase utilization, encapsulate workloads, manage capacity growth, and alleviate disruptive outages caused by hardware downtime. By virtualizing Hadoop, Data Scientists and IT Ops mutually achieve all objectives while preserving autonomy and independence for their respective responsibilities
o Sandboxing of jobs: Buggy MapReduce jobs can quickly saturate hardware resources, creating havoc for remaining jobs in the queue. Virtualizing Hadoop clusters encapsulates and sandboxes MapReduce jobs from other important sorting runs and general purpose workloads
o Batch Scheduling & Stacked workloads: Allow all workloads and applications to co-exist, e.g. Hadoop, Virtual Desktops and Servers. Schedule job runs during off-peak hours to take advantage of idle night time and weekend hours that would otherwise go to waste. Nutanix also allows to bypass the flash tier for sequential workloads which can prevent the time it takes to rewarm cache for mixed workloads.
o New Hadoop economics: Bare metal implementations are expensive and can spiral out of control. Downtime and underutilized CPU consequences of physical server’s workloads can jeopardize project viability. Virtualizing Hadoop reduces complexity and ensures success for sophisticated projects with a scale-out grow as you go model – a perfect fit for Big Data projects
o Blazing fast performance: Up to 3,500 MB/s of sequential throughput in a compact 2U 4-node cluster. A TeraSort benchmark yields 529 MB/s in the same 2U cluster
o Unified data platform: Run multiple data processing platforms along with Hadoop YARN on a single unified data platform, Acropolis Distributed File System (ADFS).
o Flash SSDs for NoSQL: The summaries that roll up to a NoSQL database like HBase are used to run business reports and are typically memory and IOPS-heavy. Nutanix has SSD tiers coupled along with dense memory capacities. With its automatic tiering technology can transparently bring IOPS-heavy workloads to SSD tiers
o Analytic High-density Engine: With the Nutanix solution you can start small and scale. A single Nutanix block can comes packed up to 40TB storage and 96 cores in a compact 2U footprint. Given the modularity of the solution, you can granularly scale per-node (up to ~10TB/24 cores), per-block (up to ~40TB/96 cores), or with multiple blocks giving you the ability to accurately match supply with demand and minimize the upfront CapEx.
o Change management: Maintain environmental control and separation between development, test, staging, and production environments. Snapshots and fast clones can help in sharing production data with non-production jobs, without requiring full copies and unnecessary data duplication.
o Business continuity and data protection: Nutanix can provide replication across sites to provide additional protection for the NameNode and DataNodes. Replication can be setup to avoid sending wasteful temporary data across the WAN using per VM replication and container based replication.
o Data efficiency: The Nutanix solution is truly VM-centric for all compression policies. Unlike traditional solutions that perform compression mainly at the LUN level, the Nutanix solution provides all of these capabilities at the VM and file level, greatly increasing efficiency and simplicity. These capabilities ensure the highest possible compression/decompression performance on a sub-block level. While developers may or may not run jobs with compression, IT Operations can ensure cold data is effectively stored. Nutanix Erasure Coding and also be applied on top of compression saving.
o Automatic Auto-Leveling and Auto-Archive: Nutanix will spread data evenly across the cluster ensuring local drives don’t fill up causing an outage when space is available. Using Nutanix cold storage nodes cold data can be moved from compute nodes, freeing up room for hot data while not consuming additional licenses.
o Time-sliced clusters: Like public cloud EC2 environments, Nutanix can provide a truly converged cloud infrastructure allowing you to run your Hadoop, server and desktop virtualization on a single converged cloud. Get the efficiency and savings you require with a converged cloud on a truly converged architecture.


My View of Hadoop Distributions from the Passenger Seat

This blog post comes from the passenger seat of my Yukon as I head to the lake. It’s a brief musing of thoughts on Hadoop that would really fit into 140 characters.

Hadoop is here and making its way into the Enterprise. Data growth will explode 50X according to IDC over the next decade. The 50X doesn’t even include all data that will flow thru your business. This data represents competitive advantage, you just need the ability to collect and analyze it.

Hadoop known mostly for analyzing large datasets in a batch process is rapidly changing. “Just In Time” processing is now a reality. SQL and NoSQL are getting mashed together and data stored in HDFS is not having to get moved out to be analyzed.

Battle of the Distro’s

MapR – Taking the approach of releasing a more proprietary release of Hadoop. Fast out if the gate, they seem to be doing well. My fear is that they will get to far down one path and won’t be able use the power of the community. They do have committers on their team so that should help. They also have partnerships with Google and Amazon.

Cloudera – A mix between proprietary and open-source. They have seen success and it can be contributed to the tools that they have built to help run and maintain their distribution. Lots of talk about Impala, super fast query performance against HDFS and HBase. Jim Hammerbacher, pervious from Facebook gives them a lot street cred.

Hortonworks – Taking the long term approach, Hortonworks is 100% open source. They make their revenue off training and support services for their distribution. They have a Impala like project called Stinger. The difference is they are still using Hive, just speeding it up by orders of magnitude. I personally dig Hortonworks because they seem to have strong support around virtualizing Hadoop. I also like Hortonworks partnership with Microsoft, sure to help speed up SQL performance.

Intel – Seems to be focusing around security with making the best use of there CPU’s and their SSD’s for compression and encryption. Personally I don’t see how that gives them a leg up on the other distributions as all of them could use their hardware. Intel seems to going the OEM route which is not surprising. I think there relationship with SAP will bold really well for them in the enterprise space.

Please leave your own thoughts, a very interesting landscape.


Zero to Big Data in 15 Minutes – Journey to Hadoop

I recently downloaded the Hortonworks Sandbox on my Mac Book Air. It was super simple and I really didn’t take any brain power to get going. In my quest to learn Hadoop the first couple tutorials in grain the basics and the terminology needed to proceed. The download contains a Virtual Machine configured with Apache Hadoop and all the material to get you going. It was great to see Pig, Hive, HCatalog and HBase in action.

The downside is that I actually want to learn how many mappers I need to configure and how many reducers I need. I think those types of questions may take a data scientist and I am not. From working at Nutanix and following some other Hadoop companies I would pick platforms the stay as close to the open source projects as much as possible. If you stray to far when new tools come out you won’t be seeing them anytime soon on your tool-belt. Hortonworks has figured this out.

If have a spare 15 minutes, clink the link below