Intel 3D XPoint and NVM Express – Data Locality Matters For Hyper-Coverged Systems

Today data locality within Nutanix provides many benefits around adding new hardware into the cluster, performance, reducing network congestion and ensuring performance is consistent. Some people will argue about 10 Gb networking is fast enough and latency isn’t a problem. Maybe impact today is not as noticeable on low workloads but network congestion can measured and noticed on high performing systems today. Its only going to get worse with new technologies like 3D XPoint and NVM Express coming to market.

Nutanix only ever sends 1 write over the network where other vendors without data locality potential could be sending 2 writes over the network and also serving remote reads across the network as well. As you look at the density and the performance in 3D XPoint it becomes evident that data locality is going to a must check box feature.

3D XPoint Improvements over NAND

You can also see below that NVM Express can drive 6 GB/s! not 6 Gb. The 10 Gb will become a bottlneck even a 1 write across the network let alone 2 writes and reads going across the network.

To get a full insight of 3D XPoint and NVM Experss and the impact of data locality watch the below video from Intel.


Betting Against Intel and Moore’s Law for Storage

When I hear about companies essential betting against Moore’s Law it makes me think of crazy thoughts like me giving up peanut butter or beer. Too crazy right?! Take a gaze at what Intel is currently doing for storage vendors below:


Some of the more interesting ones in the hyper-converged space are detailed below.

Increased number of execution buffers and execution units:
More parallel instructions, more threads all leading to more throughput

CRC instructions:
For data protection, every enterprise vendor should be checking their data for data consistency and prevention against bit rot.

Intel Advanced Vector Extensions for wide integer vector operations(XOR, P+Q)
Nutanix uses this ability to calculate XOR for EC-X (erasure coding). It uses subcycles of a cpu on a bit of data which really helps has Nutanix parallels the operaion against all of the CPU’s in the cluster. Other vendors could use this for RAID 6.

Intel Secure Hash Algorithm Extensions
Nutanix uses these extensions to accelerate SHA-1 fingerprinting for inline and post-process dedupe. Making use of the Intel SHA Extensions on processors is designed to provide a performance increase over current single buffer software implementations using general purpose instructions.

Below is video that talks about how Nutanix does Dedupe

Transactional Synchronization Extensions (TSX)
Adds hardware transactional memory support, speeding up execution of multi-threaded software through lock elision. Multi-thread apps like the Nutanix Controller virtual machine can enjoy at 30-40% boost when mutiple threads are used. This provides a big boast for in-RAM caching.

Reduced VM Entry / Exit Latency
This helps the virtual machine, in this case the virtual storage controller it never really has to exit into the virtual machine manager due to a shadowing process. The end result is low latency and the penalty for user space vs kernel is removed from the table. Also happens to be one of the reasons why you can virtualize giant Oracle databases.

Intel VT-d
Improve reliability and security through device isolation using hardware assisted remapping and improved I/O performance and availability by direct assignment of devices. Nutanix directly maps SSDs and HDDs and removes the yo-yo affect of going through the hypervisor.

Intel Quick Assist


Intel has a Quick Assist card that will do full offload for compression and SHA-1 hashes. Guess what? This features on this card will be going on the CPU in the future. Nutanix could use this card today but chooses not to for service ability reasons but you can bet your bottom dollar that we’ll use it once the feature is baked onto the CPU’s.

To top everything else above, the next Intel CPU’s will deliver 44 cores per 2 socket server and 88 threads with hyper-threading!

If you want to watch a full break down of all the features, you can watch this video with Intel at Tech Field Day


Nutanix and EVO:RAIL\VSAN – Data Placement

Nutanix and VSAN\EVO:RAIL are different in many ways. One such way is how data is spread out through the cluster.

• VSAN is a distributed object file system
• VSAN metadata lives with the vm, each vm has it’s own witness
• Nutanix is a distributed file system
• Nutanix metadata is global

VSAN\EVO:RAIL will break up its objects (VMDK’s) into components. Those components get placed evenly among the cluster. I am not sure on the algorithm but it appears to be capacity based. Once the components are placed on a node they stay there until:

• They are deleted
• The 255 GB component (default size) fills up and another one is created
• The Node goes offline and a rebuild happens
• Maintenance mode is issued and someone selects the evacuate data option.

So in a fresh brand new cluster things are pretty evenly distributed.


VSAN distributes data with the use of components

VSAN distributes data with the use of components

Nutanix uses data locality as the main principle in placement of all initial data. One copy is written locally, one copy remotely. As more writes occur the secondary copy of the data keeps getting spread evenly across the cluster. Reads stay local to the node. Nutanix uses extent and extent groups as the mechanism to coalesce the data (4 MB).

A new Nutanix cluster or one running for a long time, things are kept level and balanced based on a percentage of overall capacity. This method accounts for clusters with mixed nodes\needs. More here.



Nutanix places copies of data with the use of extent groups.

So you go to expand your cluster…

With VSAN after you add a node (compute, SSD, HDD) to a cluster and you vMotion workload over to the new node what happens? Essential nothing. The additional capacity would get added to the cluster but there is no additional performance benefit. The VM’s that are moved to the new node continue to hit the same resources across the cluster. The additional flash and HDD sit there idle.


Impact of adding a new node with VSAN and moving virtual machines over.

Impact of adding a new node with VSAN and moving virtual machines over.

When you add a node to Nutanix and vMotion workloads over they start writing locally and get to benefit from the additional flash resources right away. Not only is this important from a performance perspective, it also keeps available data capacity level in the event of a failure.



Impact of adding a new node with Nutanix and moving virtual machines over.

Since data is spread evenly across the cluster in the event of hard drive failing all of the nodes in Nutanix can help with rebuilding the data. With VSAN only the nodes containing the components can help with the rebuild.

Note: Nutanix rebuilds cold data to cold data (HDD to HDD), VSAN rebuilds data into the SSD Cache. If you lose a SSD with VSAN all backing HDD need to be rebuilt. The data from HDD on VSAN will flood into the cluster SSD tier and will affect performance. This is one of the reasons I believe why 13 RAID controllers were pulled from the HCL. I do find it very interesting because one of the RAID controllers pulled is one that Nutanix uses today.

Nutanix will always write the minimum two copies of data in the cluster regardless of the state of the clusters. If it can’t the guest won’t get the acknowledgment. When VSAN has a host that is absent it will write only 1 copy if the other half of the components are on the absent host. At some point VSAN will know it has written too much with only 1 copy and start the component rebuild before the 60 minute timer. I don’t know the exact algorithm here either, it’s just what I have observed after shutting a host down. I think this is one of the reasons that VSAN recommends writing 3 copies of data.

[Update: VMware changed the KB article after this post. It was 3 copies of data and has been adjusted to 2 copies (FT > 0) Not sure what changed on their side. There is no explanation for the change in the KB.]

Data locality has an important role to play in performance, network congestion and in availability.

More on Nutanix – EVO

Learn more about Nutanix


EVO RAIL: Status Quo for Nutanix

boatSome will make a big splash about the launch of EVO RAIL but the reality is that things remain status quo. While I do work for Nutanix and I am admittedly biased, the fact is that Nutanix as a company was formed in 2009 and has been selling since 2011. VSAN and now EVO RAIL is a validation of what Nutanix has been doing over the last 5 years. In this case, high tide lifts all boats.

Nutanix will continue to partner with VMware for all solutions, just like VDI, RDS, SRM, Server Virt, Big data applications like Splunk and private cloud. Yes we will compete with VSAN but I think the products are worlds apart mostly due to architectural decisions. Nutanix helps to sell vSphere and enable all the solutions that VMware provides today. Nutanix has various models that serve Tier 1 SQL\Oracle all the way down to the remote branch where you might want only a hand full of VM’s. Today EVO RAIL is only positioned to serve only Tier 2, Test/Dev and VDI. The presentation I sat in on as a vExpert confirmed Teir 1 was not a current use case. I do feel that this is mistake for EVO RAIL. By not being able to address Tier 1 which I would include VDI in the use case, you end up creating silos in the data center which is everything that SDDC should be trying to eliminate.

Nutanix Uses Cases

Some of the available Nutanix Uses Cases


Nutanix is still King of Scale but I am interested to hear more about EVO RACK which still in tech preview. EVO RAIL in version 1.0 will only scale to 16 nodes\servers or 4 appliances. Nutanix doesn’t really have a limit but tends to follow hypervisor limits, most Nutanix RA’s are around 48 nodes from a failure domain perspective.

Some important differences between Nutanix and EVO RAIL:

* Nutanix starts at 3 nodes, EVO RAIL starts at 4 nodes.

* Nutanix uses Hot Optimized Tiering based on data analysis and cache from RAM which can be deduped, EVO RAIL uses caching from SSD(70% of all SDD is used for cache).

* You can buy 1 Nutanix node at a time, EVO RAIL only is sold with 4 nodes at a time. Though I think this has do with trying to keep a single sku. The SMB in the market will find it had to make this jump though. On the Enterprise side you need to be able to have different node types if your compute\capacity doesn’t match up.

* Nutanix can scale with different node types ranging in different levels of storage and compute, EVO RAIL today is a hard locked configuration. You are unable to even change the amount of RAM from the OEM vendor. CPU’s are only 6 core which leads to needing more nodes = more licenses.

* EVO RAIL is only spec’d for 250 desktops\100 general server VM’s per appliance. Nutanix can deliver 440 desktops per 2U appliance with a medium Login VSI workload and 200 general server VM’s when enabling features like inline dedupe on the 3460 series. In short we have no limits if you don’t have CPU\RAM contention.


* Nutanix has 1 Storage Controller(VM) per host that takes cares of VM Cailber Snapshots, inline compression, inline Dedupe, Map Reduce Dedupe, Map Reduce compression, Analytics, Cluster Health, Replication, hardware support. EVO Rail will have a EVO management software(web server), vCenter VM, Log insight VM and a VM from the OEM Vendor for hardware support and vSphere replication VM if needed.

* Nutanix is able to have separation between compute and storage clusters. EVO RAIL is one large compute cluster with only storage container. By having separation you can have smaller compute clusters and still enjoy one giant volume. This is really just an issue of having flexibility on design.

* Nutanix can run with any license of vSphere, EVO RAIL license is Enterprise Plus. I am not sure how that will affect pricing. I suspect the OEM will be made to keep it at normal prices because if would affect the rest of their business.

* Nutanix can manage multiple large\small cluster with Prism Central. EVO RAIL has no multi-cluster management.

* Nutanix you get to use all of the available hard drives for all of the data out of the box. EVO RAIL you have to increase the stripe width to take advantage of all the available disks when data is moved from cache
to hard disk.

* Nutanix offers both Analysis and built in troubleshooting tools in the Virtual Storage Controller. You don’t have to add another VM in to provide the services.

Chad Sakac mentioned in one of his articles “my application stack has these rich protection/replication and availability SLAs – because it’s how it was built before we started thinking about CI models””, that you might not pick EVO RAIL and go to a Vblock. I disagree on the CI part. Nutanix has the highest levels of data protection today. Synchronous writes, bit rot prevention, all data is check summed, data is continuously scrubbed in low periods, Nutanix based snapshots for backup and DR.

It’s a shame that EVO RAIL went with the form factor they did. VSAN can lose up to 3 nodes at any one time which is good but in the current design it will need5 copies of data to ensure that a block going down will not cause data loss when you go to scale the solution. I think they should have stayed with a 1 node – 2 U solution. Nutanix has a feature called Availability Domains that allows us to survive a whole block going down and the cluster can still function. This feature doesn’t require any additional storage capacity to use the feature, just the minimum two copies of data.

More information on Availability Domains can be found on the Nutanix Bible


* Nutanix can Scale past 32 nodes, VSAN is supported for 32 nodes but yet EVO RAIL is only supported for 16 nodes. I don’t know why they made this decision.

* Prism Element has no limits to the number objects that it can manage. EVO RAIL is still limited by the number of components. I believe that the limited hardware specs are being used to limit the number components so this does not become an issue in the field.

* Nutanix when you add a node you can enjoy the performance benefits right away. EVO RAIL you have to wait until new VM’s are created to make use of the new flash and hard drives(or a perform a maintenance operation). Lot of this has to do on how Nutanix controls the placement of data, data locality helps with this.

I think the launch of EVO RAIL shows how important hardware still is when achieving 5 9’s of availability. Look out dual headed storage architectures, your lunch just got a little smaller again.


Scale-Out Storage – In the Hypervisor Kernel or in a VM?

A new tech note from Nutanix discussing architectural considerations with implementing a converged, scale-­‐out storage fabric that run across a cluster of nodes. This paper focuses on high availability and resiliency for virtualizing business critical applications. The paper covers running storage services embedded in the hypervisor kernel and as virtual machine in the user space.

Scale-Out Storage – In the Hypervisor Kernel or in a VM?