Hadoop For Free: Skip Flash

I see more and more people looking at getting started with Hadoop but it can be risky if you don’t have the skills and time in your organization. To top that off you have to buy new equipment that will mostly likely only be used 20% of the time.

It takes more resources and time when deploying Hadoop for the first time.

It takes more resources and time when deploying Hadoop for the first time.

Nutanix has always supported mixed workloads but Hadoop can be blunt force trauma on storage for a variety reasons:

1) Hadoop was never built for shared storage, it was architect around data locality which is core architectural design feature of Nutanix.

2) Large ingest and working sets can destroy the storage cache for your remaining workloads. With Nutanix you can bypass the Flash tier for sequential traffic. If the workloads was sized properly the HDDs are usually relative inactive as they are meant to be cold storage. Using the idle disk for Hadoop will give infrastructure and hadoop teams the ability to test the waters before carrying on.

In the case of customers running NX-8150, they might never need to buy extra nodes for compute. With 20 HDDs at your disposable the raw disk gives great performance with out flash. If your performance is fine running just from HDD you can save additional cost by adding storage only nodes. The storage only nodes don’t require additional licensing from Cloudera or Hortonworks.

While the HDD are idle, the, hadoop admins will play.

While the HDD are idle, the, hadoop admins will play.

Performance on Cloudera with 4 nodes of NX-8150 using no flash

Green = Writes Blue = Reads

Green = Writes
Blue = Reads

In the above case CPU was only at 50% so you could run additional workloads even while Hadoop was running. If your goal is just Test/Dev you can also turn HDSF replication factor to 1 since Nutanix provides enterprise class redundancy already. When you add in erasure encoding the effective capacity will be less than 2X compared to 3X with traditional hadoop.


Please hit me up on twitter @dlink7 if you have any questions.



Why virtualize Hadoop nodes on the Nutanix Xtreme Computing Platform?


o Make Hadoop an App: Prism’s HTML 5 user interface makes managing infrastructure pain free with one-click upgrades. Integrated data protection can be used to manage golden images for Hadoop across multiple Nutanix clusters. Painfully firmware upgrades are easily addressed and time saved.
o No Hypervisor Tax: The Acropolis Hypervisor is included with all Nutanix clusters. Acropolis High Availability and automated Security Technical Implementation Guides (STIG) keeps your data available and secure.
o Hardware utilization: Bare-metal Hadoop deployments average 10-20% CPU utilization, a major waste of hardware resources and datacenter space. Virtualizing Hadoop allows for better hardware utilization and flexibility. Virtualization can also help in right size your solution. If you job complementation times are meeting windows no need buying more hardware. If more resources are needed, they can easily be adjusted.
o Elastic MapReduce and scaling: Dynamic addition and removal of Hadoop nodes based on load allow you to scale based upon your current needs, not what you expect. Enable supply and demand to be in true synergy. Hadoop DataNodes can be quickly clones out in seconds.
o DevOps: Big Data scientists demand performance, reliability, and a flexible scale model. IT operations relies on virtualization to tame server sprawl, increase utilization, encapsulate workloads, manage capacity growth, and alleviate disruptive outages caused by hardware downtime. By virtualizing Hadoop, Data Scientists and IT Ops mutually achieve all objectives while preserving autonomy and independence for their respective responsibilities
o Sandboxing of jobs: Buggy MapReduce jobs can quickly saturate hardware resources, creating havoc for remaining jobs in the queue. Virtualizing Hadoop clusters encapsulates and sandboxes MapReduce jobs from other important sorting runs and general purpose workloads
o Batch Scheduling & Stacked workloads: Allow all workloads and applications to co-exist, e.g. Hadoop, Virtual Desktops and Servers. Schedule job runs during off-peak hours to take advantage of idle night time and weekend hours that would otherwise go to waste. Nutanix also allows to bypass the flash tier for sequential workloads which can prevent the time it takes to rewarm cache for mixed workloads.
o New Hadoop economics: Bare metal implementations are expensive and can spiral out of control. Downtime and underutilized CPU consequences of physical server’s workloads can jeopardize project viability. Virtualizing Hadoop reduces complexity and ensures success for sophisticated projects with a scale-out grow as you go model – a perfect fit for Big Data projects
o Blazing fast performance: Up to 3,500 MB/s of sequential throughput in a compact 2U 4-node cluster. A TeraSort benchmark yields 529 MB/s in the same 2U cluster
o Unified data platform: Run multiple data processing platforms along with Hadoop YARN on a single unified data platform, Acropolis Distributed File System (ADFS).
o Flash SSDs for NoSQL: The summaries that roll up to a NoSQL database like HBase are used to run business reports and are typically memory and IOPS-heavy. Nutanix has SSD tiers coupled along with dense memory capacities. With its automatic tiering technology can transparently bring IOPS-heavy workloads to SSD tiers
o Analytic High-density Engine: With the Nutanix solution you can start small and scale. A single Nutanix block can comes packed up to 40TB storage and 96 cores in a compact 2U footprint. Given the modularity of the solution, you can granularly scale per-node (up to ~10TB/24 cores), per-block (up to ~40TB/96 cores), or with multiple blocks giving you the ability to accurately match supply with demand and minimize the upfront CapEx.
o Change management: Maintain environmental control and separation between development, test, staging, and production environments. Snapshots and fast clones can help in sharing production data with non-production jobs, without requiring full copies and unnecessary data duplication.
o Business continuity and data protection: Nutanix can provide replication across sites to provide additional protection for the NameNode and DataNodes. Replication can be setup to avoid sending wasteful temporary data across the WAN using per VM replication and container based replication.
o Data efficiency: The Nutanix solution is truly VM-centric for all compression policies. Unlike traditional solutions that perform compression mainly at the LUN level, the Nutanix solution provides all of these capabilities at the VM and file level, greatly increasing efficiency and simplicity. These capabilities ensure the highest possible compression/decompression performance on a sub-block level. While developers may or may not run jobs with compression, IT Operations can ensure cold data is effectively stored. Nutanix Erasure Coding and also be applied on top of compression saving.
o Automatic Auto-Leveling and Auto-Archive: Nutanix will spread data evenly across the cluster ensuring local drives don’t fill up causing an outage when space is available. Using Nutanix cold storage nodes cold data can be moved from compute nodes, freeing up room for hot data while not consuming additional licenses.
o Time-sliced clusters: Like public cloud EC2 environments, Nutanix can provide a truly converged cloud infrastructure allowing you to run your Hadoop, server and desktop virtualization on a single converged cloud. Get the efficiency and savings you require with a converged cloud on a truly converged architecture.


Why to Virtualize Hadoop?

Shortly after coming to Nutanix I wrote an article on virtualizing Hadoop, one of the points of the articles was:

One of the bottleneck necks with virtualizing Hadoop is it’s distributed nature. Traditional SANs usually don’t scale as servers are added, usually two storage controllers are to be shared by all the servers. This model is not conducive for a Terasort benchmark. Nutanix virtualizes the storage controller so as you add servers the controllers don’t become the bottleneck.

Because of the above point I believe Nutanix makes virtualization a reality for Hadoop and this also why any “Grey Beards” of the world are going to cringe when they hear SAN and Hadoop in the same sentence. So using that as a back ground where a few points why virtualizing hadoop can save time and money.
[Read more…]


Zero to Big Data in 15 Minutes – Journey to Hadoop

I recently downloaded the Hortonworks Sandbox on my Mac Book Air. It was super simple and I really didn’t take any brain power to get going. In my quest to learn Hadoop the first couple tutorials in grain the basics and the terminology needed to proceed. The download contains a Virtual Machine configured with Apache Hadoop and all the material to get you going. It was great to see Pig, Hive, HCatalog and HBase in action.

The downside is that I actually want to learn how many mappers I need to configure and how many reducers I need. I think those types of questions may take a data scientist and I am not. From working at Nutanix and following some other Hadoop companies I would pick platforms the stay as close to the open source projects as much as possible. If you stray to far when new tools come out you won’t be seeing them anytime soon on your tool-belt. Hortonworks has figured this out.

If have a spare 15 minutes, clink the link below


How to Virtualize Hadoop the Nutanix Way

Prior to working at Nutanix I wrote an article just before VMworld 2012 VMware and Nutanix Give Wings to Hadoop. Now with my feet grounded in the company I think I deliver to you some technical content on how to put some wind beneath your wings on virtualizing Hadoop.

One of the bottleneck necks with virtualizing Hadoop is it’s distributed nature. Traditional SANs usually don’t scale as servers are added, usually two storage controllers are to be shared by all the servers. This model is not conducive for a Terasort benchmark. Nutanix virtualizes the storage controller so as you add servers the controllers don’t become the bottleneck.

[Read more…]


Dwayne’s New Year Resoultions

  1. Deploy Hadoop in a business meaningful way. Lots of talk in 2012 and reading but never got a chance to roll up my seleves that had any impact to others. Looking to create some value for one of my clients free of charge. If you think you have something we can tackle, lets talk.
  2. Compelte a 20Km race by June 30, 2013, Compelete a 40Km race by Nov 30, 2013
  3. Keep Saturdays’s Tech Free. Since I love working in my choosen field I have been known to neglect family and friends by not giving them proper attention. My mind wonders a lot to technology, I don’t think it’s bad because I love it but the reslut is don’t fully pay attention to alot of other things going around me.

Only three but I think it’s going to take work to keep on track.

Happy New Year


#EUC Tip 80: VDI with Time Sharing, Spin Down VDI, Spin UP Cloud\Hadoop

Don’t let all of the those expensive CPU cycles go to waste. On the weekends and during off hours spin down your VDI environment and run some other workloads like vCloud Director or Hadoop(Project Serengeti). Code below is kinda simple but that’s what makes PowerShell awesome I guess. To schedule the code to run vist http://www.virtu-al.net/

WARNING: Before running code you find on the Internet, test, test, test

[Read more…]


VMware and Nutanix Give Wings to Hadoop

Nutanix has the fortune and misfortune of being thought as strictly a platform for VDI. I am glad that they serve the VDI market or else I probably wouldn’t have heard of them. Nutanix has many benefits, one of which is that their value proposition is based on software. Local controller VM’s on each node give commodity local storage(if you can call Fusion-IO commodity) SAN like abilities in the form of their file system NDFS. Where I am going with this? Hadoop.
[Read more…]