Why losing a disk on Nutanix is no big deal (*No Humans Required)

When Acropolis DFS detects an accumulation of errors for a particular disk (e.g., I/O errors or bad sectors) it is the Hades service running the Controller VM. The purpose of Hades is to simplify the break-fix procedures for disks and to automate several tasks that previously required manual user actions. Hades aids in fixing failing devices before the device become unrecoverable.

Nutanix has a unified component called Stargate that manages the responsibility of receiving and processing data. All read and write requests are sent to the Stargate process running on that node. Once Stargate sees delays in responses to I/O requests to a disk, it marks a disk offline. Hades then automatically removes the disk from the data path and runs smartctl checks against it. If the checks pass, Hades then automatically marks the disk online and returns it to service. If Hades’ smartctl checks fail, or if Stargate marks a disk offline three times within one hour (regardless of the smartctl check results), Hades automatically removes the disk from the cluster, and following occurs:

• The disk is marked for removal within the cluster Zeus configuration.
• This disk is unmounted.
• The Red LED of the disk is turned on to provide a visual indication of the failure.
• The cluster automatically begins to create new replicas of any data that is stored on the disk.

The failed disk is marked as a tombstoned Disk to prevent it from being used again without manual intervention.

When disk is marked offline, an alert is triggered, and is immediately removed from the storage pool by the system. Curator then identifies all extents stored on the failed disk, and Acropolis DSF is then prompted to re-replicate copies of the associated replicas to restore the desired replication factor. By the time the Nutanix administrators become aware of the disk failure via Prism, SNMP trap, or email notification, Acropolis DSF will be well on its way to healing the cluster.

Acropolis DSF data rebuild architecture provides faster rebuild times and no performance impact to workloads supported by the Nutanix cluster when compared to traditional RAID data protection schemes. RAID groups or sets typically comprise a small number of drives. When a RAID set performs a rebuild operation, typically one disk is selected to be the rebuild target. The other disks that comprise the RAID set must divert enough resources to quickly rebuild the data on the failed disk. This can lead to performance penalties for workloads served by the degraded RAID set. Acropolis DSF can distribute remote copies found on any individual disk among the remaining disks in the Nutanix cluster. Therefore Acropolis DSF replication operations can happen as background processes with no impact to cluster operations or performance. Acropolis DSF can access all disks in the cluster at any given time as a single, unified pool of storage resources. This architecture provides a very advantageous consequence. As the cluster size grows, the length of time needed to recover from a disk failure decreases as every node in the cluster participates in the replication. Since the data needed to rebuild a disk is distributed throughout the cluster, more disks are involved in the rebuild process. This increases the speed at which the affected extents are re-replicated.

It’s important to note that Nutanix also keeps the performance consistent during the rebuild operations. For hybrid systems Nutanix rebuilds cold data to cold data so large hard drives do not flood the cache of the SSD’s. For all flash systems Nutanix has quality of service implemented for backend I/O to prevent user I/O from being impacted.

In addition to a many-to-many rebuild approach to data availability, the Acropolis DFS data rebuild architecture ensures that all healthy disks are available for use all of the time. Unlike most traditional storage arrays, there’s no need for “hot-spare” or standby drives in a Nutanix cluster. Since data can be rebuilt to any of the remaining healthy disks, reserving physical resources for failures is unnecessary. Once healed, you can lose the next drive/node.


  1. Nice post. Would be great to do one on the next steps i.e. What happens when a new disk is added to replace the failed disk.

  2. Irwin Fletcher says:

    Thanks for the susinct description of this process. It would also be great to see how this process compares to VSAN or other competitors.

Speak Your Mind