Under the Covers of a Distributed Virtual Computing Platform – Part 3: Metadata

Part 1 was the overview of the magic of the Nutanix Distributed File System(NDFS).
Part 2 was an overview of Zookeeper in regards maintaining configuration across a distributed cluster built for virtual workloads.
Part 3 is the reason why Nutanix can scale to infinity, a distributed metadata layer make up of Medusa and Apache Cassandra.

Before starting at Nutanix I wrote a brief article on Medusa, Nutanix: Medusa and No Master. Medusa is a Nutanix abstraction layer that sits in front of a NoSQL database that holds the metadata of all data in the cluster. The database is distributed across all nodes in the cluster, using a modified form of Apache Cassandra. As virtual machines move around the nodes(servers) in the cluster they know where all their data is sitting. The ability to quickly know where all the data is sitting is why hard drive failures, node failures and even whole blocks* can fail and the cluster can carry on.

When a file reaches 512K in size, the cluster creates a vDisk to hold the data. Files small than 512K will be stored inside of Cassandra. Cassandra runs on all nodes of the cluster. These nodes communicate with each other once a second, using the Gossip protocol, ensuring that the state of the database is current on all nodes.

A vDisk is a subset of available storage within a container. The cluster automatically creates and manages vDisks within an NFS container. A general rule is that you will see a vDisk for every vmdk since most times they are over 512K. While the vDisk is abstracted away from the virtualization admin it’s important to understand. vDisk’s are how Nutanix is able to present vast amounts of storage to virtual machines with only having a subset of the total amount on anyone node.

 vDisks are Block-level devices for VMDKs,  These are mapped seamlessly through the Nutanix NFS Datastore

vDisks are Block-level devices for VMDKs,
These are mapped seamlessly through the Nutanix NFS Datastore

vDisks are made up of extents and extents groups which help to serialize the data to disk. This process also helps avoid misalignment issues with older operating systems. All of the blocks that make up a vDisk are maintained my Medusa. As workloads migrate between flash and HDD automatically, consistency is maintained across the cluster. If hot data is in flash on one node in the cluster, it’s replica is also in flash on another node, vice versa if the data is stored on HDD.

Cassandra does depend on Zeus to gather information about the cluster configuration.

* You need three blocks before you can survive a whole block going down, we call this feature block awareness.


  1. Nice man keep the great articles coming.

  2. First off all great articles, keep up the good work!

    I’m wondering, if I understand correctly a block can have up to 4 nodes and a node is exchangeable between blocks. A node runs individually and only shares power through the block.
    So how does the node know in which block it is placed and have all blocks unique identifiers, which both seem necessary to me to become “block aware”.

    Another thing what i’m wondering if I have multiple blocks divided over two sites in a stretched environment, can I make sure replication between blocks is done between blocks in different sites, so when I site fails I still have my date and I’am able to recover my VM’s?

    • dlessner says:

      Hi Rob

      Thanks for the questions. Once a node is placed in the chassis and added to the cluster you wouldn’t be able to move it unless you updated the configuration. Today we don’t support strenched clusters. You would have to you our async replication.

      I am Canada and aysnc is ok for most part. The EMEA Nutanix SE have your back though. They are pushing engineering for sync replication.

Speak Your Mind