Under the Covers of a Distributed Virtual Computing Platform – Part 3: Metadata

Part 1 was the overview of the magic of the Nutanix Distributed File System(NDFS).
Part 2 was an overview of Zookeeper in regards maintaining configuration across a distributed cluster built for virtual workloads.
Part 3 is the reason why Nutanix can scale to infinity, a distributed metadata layer make up of Medusa and Apache Cassandra.

Before starting at Nutanix I wrote a brief article on Medusa, Nutanix: Medusa and No Master. Medusa is a Nutanix abstraction layer that sits in front of a NoSQL database that holds the metadata of all data in the cluster. The database is distributed across all nodes in the cluster, using a modified form of Apache Cassandra. As virtual machines move around the nodes(servers) in the cluster they know where all their data is sitting. The ability to quickly know where all the data is sitting is why hard drive failures, node failures and even whole blocks* can fail and the cluster can carry on.

When a file reaches 512K in size, the cluster creates a vDisk to hold the data. Files small than 512K will be stored inside of Cassandra. Cassandra runs on all nodes of the cluster. These nodes communicate with each other once a second, using the Gossip protocol, ensuring that the state of the database is current on all nodes.

A vDisk is a subset of available storage within a container. The cluster automatically creates and manages vDisks within an NFS container. A general rule is that you will see a vDisk for every vmdk since most times they are over 512K. While the vDisk is abstracted away from the virtualization admin it’s important to understand. vDisk’s are how Nutanix is able to present vast amounts of storage to virtual machines with only having a subset of the total amount on anyone node.

 vDisks are Block-level devices for VMDKs,  These are mapped seamlessly through the Nutanix NFS Datastore

vDisks are Block-level devices for VMDKs,
These are mapped seamlessly through the Nutanix NFS Datastore

vDisks are made up of extents and extents groups which help to serialize the data to disk. This process also helps avoid misalignment issues with older operating systems. All of the blocks that make up a vDisk are maintained my Medusa. As workloads migrate between flash and HDD automatically, consistency is maintained across the cluster. If hot data is in flash on one node in the cluster, it’s replica is also in flash on another node, vice versa if the data is stored on HDD.

Cassandra does depend on Zeus to gather information about the cluster configuration.

* You need three blocks before you can survive a whole block going down, we call this feature block awareness.


Under the Covers of a Distributed Virtual Computing Platform – Part 1: Built For Scale and Agility

Lots of talk in the industry about how had software defined storage first and who was using what components. I don’t want to go down that rat hole since it’s all marketing and it won’t help you at the end of the day to enable your business. I want to really get into the nitty gritty of the Nutanix Distributed Files System(NDFS). NDFS has been in production for over a year and half with good success, take read of the article on the Wall Street Journal.

Below are core services and components that make NDFS tick. There are actually over 13 services, for example our replication is distributed across all the nodes to provide speed and low impact on the system. The replication service is called Cerebro which we will get to in this series.
Nuntaix Distrubuted File System

This isn’t some home grown science experiment, the engineers that wrote the code come from Google, Facebook, Yahoo where this components where invented. It’s important to realize that all components are replaceable or future proofed if you will. The services\libraries provide the API’s so as newest innovations happen in the community, Nutanix is positioned to take advantage.

All the services mentioned above run on multiple nodes in cluster a master-less fashion to provide availability. The nodes talk over 10 GbE and are able to scale in a linear fashion. There is no performance degradation as you add nodes. Other vendors have to use InfiniBand because they don’t share the metadata cross all of the nodes. Those vendors end up putting a full copy of the metadata on each node, this eventually will cause them to hit a performance cliff and the scaling stops. Each Nutanix node acts a storage controller allowing you to do things like have a datastore of 10,000 VM’s without any performance impact.

While the diagram can look a little daunting, rest assured the complexity has been abstracted away for the end user. It’s a radical shift in data center architecture and will be fun breaking it down.