Archives for February 2014


When To Use: Nutanix Shadow Clones vs VCAI

First kick at the can with google hangouts. Apparently I din’t setup the webinar on air correctly. But the content made it out of alive!

Post any questions below.


Nutanix Cache Money: When to Spend, When to Save

Spend wisely

Spend wisely

Every Nutanix virtual storage controller has local cache to server the workloads running directly on the nodes. A question that comes up is if the local cache should be increased. No ever complained about having too much cache but being a hyper converged appliance we want to keep the RAM available for the running workloads if needed. I would never just recommend giving every controller virtual machine(CVM) 50 GB or 80 GB of RAM and see where that gets you.

The cache on the CVM is automatically adjusted when the RAM of the CVM is increased. I recommend increasing the CVM memory in 2 GB increments and track the effectiveness of the change. Even starting with 16 GB of RAM in a system that has 256 GB of RAM available is only ~6% of the RAM resources available.

Nutanix CVM Resources starting points




Inline Dedupe

Memory Size

Increase to 16 GB

Increase to 24 GB

Memory Reservation

Increase to 16 GB

Increase to 24 GB

Base (Non-Dedupe)

Go to any CVM IP address and check the startgate diagnostic page http::2009 and use the below guidelines before increasing your RAM on the CVM. You may need to allow access to the 2009 port if you’re accessing the page from a different subnet. This is covered in the setup guide.

Extent Cache

Amount of CVM RAM

Extent Cache Hits

Extent Cache Usage


16 GB

70% – 95%

> 3200 MB

Increase CVM RAM to 18 GB

18 GB

70% – 95%

> 4000 MB

Increase CVM RAM to 20 GB

NOTE: Going higher than 20 GB of RAM on a CVM will automatically start allowing RAM to be used for dedupe. If don’t enable dedupe past 20 GB of RAM you will be wasting RAM resources. You can prevent this from happening by the use of GFLAGs. It’s best to contact support on how to limit RAM being used for dedupe if you feel your workload won’t benefit from it.

Using the Prism UI you can assess if more RAM will help hit rate ratio. Cache from Dedupe is referred to as content cache. The content cache spans over RAM and flash. It is possible to have a high hit rate ratio and have little being served from RAM.

In the Analysis section of the UI check to see how much physical RAM is making up the content cache and what your return on it is.
Screen Shot 2014-02-24 at 4.57.10 PM

If the memory being saved is over 50% of the physical memory being used and the hit rate ratio is above 90%. You can bump up the CVM Memory.

NOTE: For both extent cache and content cache it is possible to have a low hit rate ratio and high usage of resource and still benefit from more RAM. In a really busy system the workload may be too large and might be getting cycled thru the cache before it can hit a consecutive time. It’s our recommendation to increase the CVM memory if you know your maximum limit for CPU on the host. Available memory can help the running workload instead of sitting idle.

Hopefully this helps in giving some context before changing settings in your environment.

Learn more about Nutanix with The Nutanix Bible


Teradici PCoIP Hardware Accelerator (APEX) – Is It Working?

For whatever reason Teradici decided to scrap the APEX name and are going with PCoIP Hardware Accelerator.

If you have one in your box, how do you tell if it’s working? You can view information about each virtual machine the Hardware Accelerator is currently monitoring by typing the following command at the ESXi prompt:

/opt/teradici/pcoip-ctrl ‑V


Using the Hardware Accelerator Dashboard

Teradici has also developed a Hardware Accelerator dashboard interface to simplify demonstrations, evaluations, and proof-of-concept trials. The dashboard interface monitors CPU usage and performance, as shown below.


NX7110 - GPU\PCoIP Hardware Accelerator Node

NX7110 – GPU\PCoIP Hardware Accelerator Node

Model NX-7110
Server Compute Dual Intel Ivy Bridge E5-2680v2, 20 cores / 2.8 GHz
Storage Capacity 2x 400 GB SSD, 6x 1 TB HDDs
Memory Configurable; 128 GB or 256 GB
VM Density Users – GRID K1 vSGA: 8-64, GRID K2 vSGA: 6-48, GRID K2 vDGA: 6 *
Network Connections 2x 10 GbE, 2x 1 GbE, 1x 10/100 BASE-T RJ45
Certifications UL, CSA, CE, VCCI, KCC, C-Tick, EAC
GPU | PCoIP 3x PCIe expansion slots;
Up to 2x GRID K1, 3x GRID K2, 1x APEX

* GPU density dependent on use case

Appliance Specifications

Dimensions Height: 3.5’’ (88mm), Width: 17.2’’ (437mm), Depth:31’’ (787mm)
Weight 54 lbs. (24.5kg) stand-alone / 73 lbs. (33.1kg) package
System Cooling 5x80mm heavy duty fans with PWM fan speed controls
Operating Environment Op Temp Rng: 50°-95°F (10°-35°C)
Non-Op Temp Rng: -40°-158°F (-40°- 70°C)
Op Humidity Rng (non-condensing): 8-90%
Non-Op Humidity Rng: 5-95%
Power Consumption 1200W maximum, 900W typical
Power Supply
(Dual supply)
1.1kW Out @100-120V, 12.0-10.0A, 50-60HZ;
1.62kW Out @180-240V, 11.3- 8.5A, 50-60Hz
Thermal Dissipation 4090 BTU/hr maximum, 3070 BTU/hr typical
Operating Requirements Input Voltage: 100-240V AC auto-range, Input Frequency: 50-60Hz

Nutanix Disk Self-Healing: Laser Surgery vs The Scalpel

When it comes to healing I think most people would agree with:

• Corrective action should be taken right away
• Fixing the underlying issue shouldn’t cause something else to fail

Wouldn’t it make sense to take that same philosophy and apply it to your infrastructure?

Nutanix being a converged distributed solution is designed for failure. It’s not a matter of if, but a matter of when. Distributed systems are extremely complex for some of the following reasons:
• Implementing consistent, distributed, fine-grained metadata.
• No single node has complete knowledge.
• Isolating faulty peers through consensus.
• Handle communications to peers running older or newer software during a rolling upgrade;

It’s these above reasons why you need to have safe guards in place. This post will tell how Nutanix is able to heal from drive failure and node failure while allowing current workloads to carry on like nothing was out of the ordinary.

Corrective Action

Nutanix Failed Disk Drive = Recovery starts immediately

Nutanix Node Failure = Recovery is stared in 60 seconds

Low Impact

Running workloads are not forced to go under the knife looking for the failed component. Nutanix prioritizes internal replication traffic. Each node has a queue called Admission Control. VM IO (front-end adapter) and maintenance tasks have a 75/25 split on each node in the cluster.

When a drive or node goes down the metadata is quickly scanned to see what workloads have been affected. This work is evenly distributed amongst all the nodes in the cluster running a map reduce job. The replication tasks are queued into a cluster-wide background task scheduler and trickle fed to the nodes as their resources permit. More nodes, faster rebuild. By default 10 tasks per second out of the 25 can be used for internal replication. This number can be changed but is recommended to leave it alone. Internal replication tasks have a higher priority then auto-tiering but auto-tiering won’t be starved. Remember Nutanix has the ability to write to a remote node but prefers to write locally for speed.

In a four-node cluster, we will have 40 parallel replication tasks on the cluster. As of today data is moved around in what we call an extent group. An extent group is of size 4MB, this mean we are trying to replicate maximum – 40*4 = 160MB of data at any give time. Nutanix did use an extent group that was 16MB for a while but found it could move more data using a smaller rate and have less impact on the IO path. This type of knowledge only comes through product maturity with the code base.

Since data is equally spread out on the cluster, and the bandwidth reserved per disk for replication is 75MB/s (including reading the replica and writing the replica), the maximum bandwidth across nodes will be 75 * number of disks, which is significantly higher than 160MB. So we can assume that the max replication throughput will be 160MB/s.

6000 family specs

6000 family specs

If where to take a 32-node Nutanix cluster (512 TB RAW) made up 6260’s at 50% capacity using 4 * 4 TB drives. This is how we would figure out the rebuild time of a drive:

32 nodes * 40 MB = 1280 MB/s of rebuild throughput. One disk is not the bottle neck as the data is placed on all of the drives based on proper redundancy.

4 TB at 50% = 2 TB to actual data to rebuild = ~28 minutes to rebuild a 4 TB drive under heavy load. If you had to rebuild the whole node it would take ~1.8 hours.

Low loaded clusters will complete these tasks faster, and therefore get back to parity quicker.


One of the biggest concerns is rebuild times. If a 4TB HDD fails, the average rebuild speed could take days. The failure of a second HDD could up the rebuild times to a week or so … and there is vulnerability when the disks are being rebuilt.

This fast, low impact rebuild is only possible when you spread out the data across the cluster when writing instead of limiting yourself to certain disks. You also need a mechanism to level the data in the cluster when it becomes unbalanced. Nutanix provides both of these core services for distributed converged storage. Not having a mechanism to balance your storage can put your cluster at significant risk for data loss and negative performance impacts.

How many times can you go under the knife?