Prior to working at Nutanix I wrote an article just before VMworld 2012 VMware and Nutanix Give Wings to Hadoop. Now with my feet grounded in the company I think I deliver to you some technical content on how to put some wind beneath your wings on virtualizing Hadoop.
One of the bottleneck necks with virtualizing Hadoop is it’s distributed nature. Traditional SANs usually don’t scale as servers are added, usually two storage controllers are to be shared by all the servers. This model is not conducive for a Terasort benchmark. Nutanix virtualizes the storage controller so as you add servers the controllers don’t become the bottleneck.
The second issue with Hadoop is that there is lots of temporary data created when doing Map Reduce jobs. Putting these temporary files onto a system to incur RAID or a replication penalties wouldn’t make a whole bunch of sense. It’s only after the Reduce job ends that the results to stored on a highly reliable filesystem. If the job fails before finishing, the task is just reran on another node in the Hadoop cluster.
Hadoop’s Distributed File System (HDFS) uses the concept of data locality much like the the Nutanix Distributed File System. Hadoop breaks up the Map Reduce jobs into “Splits”. Hadoop ideally wants each split to be stored on 1 node to save network traffic which helps with efficiency.
The above image tells the story on how Nutanix comes to the rescue. Nutanix uses a concept called Replication Factor(RF) to enable resiliency. If you’re familiar with Hadoop they use the same concept. A RF of 2 means there are two copies of the data on separate nodes within in the cluster. A RF of 1 means there will only be one copy of the data. The great thing about a RF of 1 is that you get the speed of staying on the PCIe bus without the limitations. With RF of 1 you are still able to vMotion your virtual machine around without the cost of doing a storage vMotion.
The following command will allow you to set the RF to 1 on your container\NFS volume on Nutanix. The ncli can be ran from anyone of the Storage Controller Virtual Machines or you can download the needed files from the admin console. Id is the ID of the container.
ncli ctr edit id=
The replication factor can also be applied at the Persistent Cache Tier. This setting is system wide so you will want to exercise extreme caution and understand the impact of before setting it.
ncli ctr edit id=
The below website explains the impact of these changes when it comes to IO flowing thru the Nutanix Cluster.
More information on running Hadoop Virtualized on Nutanix, please download the reference architecture posted under Technical Guides.