Feb
02

Nutanix Disk Self-Healing: Laser Surgery vs The Scalpel

When it comes to healing I think most people would agree with:

• Corrective action should be taken right away
• Fixing the underlying issue shouldn’t cause something else to fail

Wouldn’t it make sense to take that same philosophy and apply it to your infrastructure?

Nutanix being a converged distributed solution is designed for failure. It’s not a matter of if, but a matter of when. Distributed systems are extremely complex for some of the following reasons:
• Implementing consistent, distributed, fine-grained metadata.
• No single node has complete knowledge.
• Isolating faulty peers through consensus.
• Handle communications to peers running older or newer software during a rolling upgrade;

It’s these above reasons why you need to have safe guards in place. This post will tell how Nutanix is able to heal from drive failure and node failure while allowing current workloads to carry on like nothing was out of the ordinary.

Corrective Action

Nutanix Failed Disk Drive = Recovery starts immediately

Nutanix Node Failure = Recovery is stared in 60 seconds

Low Impact

Running workloads are not forced to go under the knife looking for the failed component. Nutanix prioritizes internal replication traffic. Each node has a queue called Admission Control. VM IO (front-end adapter) and maintenance tasks have a 75/25 split on each node in the cluster.

When a drive or node goes down the metadata is quickly scanned to see what workloads have been affected. This work is evenly distributed amongst all the nodes in the cluster running a map reduce job. The replication tasks are queued into a cluster-wide background task scheduler and trickle fed to the nodes as their resources permit. More nodes, faster rebuild. By default 10 tasks per second out of the 25 can be used for internal replication. This number can be changed but is recommended to leave it alone. Internal replication tasks have a higher priority then auto-tiering but auto-tiering won’t be starved. Remember Nutanix has the ability to write to a remote node but prefers to write locally for speed.

In a four-node cluster, we will have 40 parallel replication tasks on the cluster. As of today data is moved around in what we call an extent group. An extent group is of size 4MB, this mean we are trying to replicate maximum – 40*4 = 160MB of data at any give time. Nutanix did use an extent group that was 16MB for a while but found it could move more data using a smaller rate and have less impact on the IO path. This type of knowledge only comes through product maturity with the code base.

Since data is equally spread out on the cluster, and the bandwidth reserved per disk for replication is 75MB/s (including reading the replica and writing the replica), the maximum bandwidth across nodes will be 75 * number of disks, which is significantly higher than 160MB. So we can assume that the max replication throughput will be 160MB/s.

6000 family specs

6000 family specs


If where to take a 32-node Nutanix cluster (512 TB RAW) made up 6260’s at 50% capacity using 4 * 4 TB drives. This is how we would figure out the rebuild time of a drive:

32 nodes * 40 MB = 1280 MB/s of rebuild throughput. One disk is not the bottle neck as the data is placed on all of the drives based on proper redundancy.

4 TB at 50% = 2 TB to actual data to rebuild = ~28 minutes to rebuild a 4 TB drive under heavy load. If you had to rebuild the whole node it would take ~1.8 hours.

Low loaded clusters will complete these tasks faster, and therefore get back to parity quicker.

RAID?


One of the biggest concerns is rebuild times. If a 4TB HDD fails, the average rebuild speed could take days. The failure of a second HDD could up the rebuild times to a week or so … and there is vulnerability when the disks are being rebuilt.

This fast, low impact rebuild is only possible when you spread out the data across the cluster when writing instead of limiting yourself to certain disks. You also need a mechanism to level the data in the cluster when it becomes unbalanced. Nutanix provides both of these core services for distributed converged storage. Not having a mechanism to balance your storage can put your cluster at significant risk for data loss and negative performance impacts.

How many times can you go under the knife?

scapel

Comments

  1. Samuel Rothenbuehler says:

    Hi Dwayne

    Thanks for your great post! Just a quick question regarding the Curator tasks: Your calculation implies that the 4MB extend groups are processed in one second (e.g. four node cluster: 4*10*4 = 160MB/s). Is that a simplification or does a task to replicate one extend group really always take one second?

    Sam

    P.S.: I’m working for a Nutanix partner (Amanox) in Switzerland.

    • dlessner says:

      Hi Samuel

      It does take about 1 sec to complete. When the extent groups were 16 MB it varied. The lower extent group work better overall in the system especially when the load is high. Plan to have a video of a rebuild taking place, it’s on my to do list anyways.

      Thanks for reading.

  2. Hi,

    I’m a little bit confused on your demonstration :

    You say as postula :

    “Since data is equally spread out on the cluster, and the bandwidth reserved per disk for replication is 75MB/s (including reading the replica and writing the replica), the maximum bandwidth across nodes will be 75 * number of disks, which is significantly higher than 160MB. So we can assume that the max replication throughput will be 160MB/s.”

    Explication for 160MB/s is pretty straight forward in this case : 10 tasks/sec * 4 nodes * 4 MB

    But after you use 32 nodes by 40MB.

    40MB is comming from 4MB * 10 tasks/sec, Am I correct ? as I said it was just a little confusion, just want to be sure 😉

    • Hi Thomas

      Thanks for reading. The 75 MB/s is per disk – so if you have 4 HDD the top speed per node at the HDD layer would be 300MB/s – out of that 300 MB/s, 40 MB/s per node can be used for rebuilds. The rest when under heavy load can be used for our auto-tering and reads/writes if there is flash miss.

Trackbacks

  1. […] A Nutanix cluster is limited roughly 40 MB/s per node for rebuilding HDD’s. Nutanix can achieve linear rebuild times because the data is evenly spread-out through the cluster.More info here. […]

  2. […] Nutanix Distributed File System is designed for hardware failure and is self-healing. Always-on operation includes detection of silent data corruption and repair of errors around data […]

Speak Your Mind

*