Part 1 was the overview of the magic of the Nutanix Distributed File System(NDFS).
Part 2 was an overview of Zookeeper in regards maintaining configuration across a distributed cluster built for virtual workloads.
Part 3 is the reason why Nutanix can scale to infinity, a distributed metadata layer make up of Medusa and Apache Cassandra.
Before starting at Nutanix I wrote a brief article on Medusa, Nutanix: Medusa and No Master. Medusa is a Nutanix abstraction layer that sits in front of a NoSQL database that holds the metadata of all data in the cluster. The database is distributed across all nodes in the cluster, using a modified form of Apache Cassandra. As virtual machines move around the nodes(servers) in the cluster they know where all their data is sitting. The ability to quickly know where all the data is sitting is why hard drive failures, node failures and even whole blocks* can fail and the cluster can carry on.
When a file reaches 512K in size, the cluster creates a vDisk to hold the data. Files small than 512K will be stored inside of Cassandra. Cassandra runs on all nodes of the cluster. These nodes communicate with each other once a second, using the Gossip protocol, ensuring that the state of the database is current on all nodes.
A vDisk is a subset of available storage within a container. The cluster automatically creates and manages vDisks within an NFS container. A general rule is that you will see a vDisk for every vmdk since most times they are over 512K. While the vDisk is abstracted away from the virtualization admin it’s important to understand. vDisk’s are how Nutanix is able to present vast amounts of storage to virtual machines with only having a subset of the total amount on anyone node.
vDisks are made up of extents and extents groups which help to serialize the data to disk. This process also helps avoid misalignment issues with older operating systems. All of the blocks that make up a vDisk are maintained my Medusa. As workloads migrate between flash and HDD automatically, consistency is maintained across the cluster. If hot data is in flash on one node in the cluster, it’s replica is also in flash on another node, vice versa if the data is stored on HDD.
Cassandra does depend on Zeus to gather information about the cluster configuration.
* You need three blocks before you can survive a whole block going down, we call this feature block awareness.