Part 1 was the overview of the magic of the Nutanix Distributed File System(NDFS).
Part 2 was an overview of Zookeeper in regards maintaining configuration across a distributed cluster built for virtual workloads.
Part 3 is the reason why Nutanix can scale to infinity, a distributed metadata layer make up of Medusa and Apache Cassandra.
Part 4: Stargate is the main point of contact for the Nutanix cluster. All read and write requests are sent to the Stargate for processing. Stargate checksums data before writing it and verifies it upon reading. Data integrity is number 1.
Stargate has 6 components that make service:
Receives read/write requests from the ESXi host. It keeps tracks on incoming writes and helps to localize all traffic in the cluster for performance. The front-end adapter lets your 3+N storage controllers to work together to elevate hot spots and even run mixed workloads.
Determines which requests to forward to vDisk controllers, based on the type of request and
number of outstanding requests. Admission control provides the balancing act between doing Guest IO versus maintenance tasks, replication, snapshots and continually data scrubbing.
Responds to incoming requests based on whether it is random or sequential Random IO is sent to the oplog. Sequential requests are sent directly to the extent store, unless they are short, small requests.These requests are treated as random IO and sent to the oplog as well. The vDisk controller plays the first step in helping to serialize all the writes so you don’t have to worry about disk alignment. vDisk controller also plays a critical role in the Nutanix Hot Optimized Tiering, moving data up and down tiers thru adaptive algorithms and bring data closer to the compute. The vDisk Controller is the poster child for virtualized Hadoop.
Receives random write requests, which are copied to an oplog on another node for high availability. The oplog is stored on a subset of the PCIe-SSD\Enterprise Flash to provide a faster write acknowledgement. Data in the oplog drains to the extent store in the order in which it was received from the vDisk controller. The oplog is basically all of the new “Flash Bolt On” that are coming onto the market today but you get, snapshots, backup, redundancy without having to change the guest VM or worry about a million and one gotchas. It just works.
Responds to requests by reading and writing to the physical disks on the node and sends a copy to an extent store on another node for high availability
All requests sent to the extent store are preceded by a metadata query in Medusa.
Holds a copy of the most recently returned read requests. The extent cache is stored in memory, local cache that sits right beside the guest VM never having to leave the confines of the server. . Normally the Extent Cache is 2 GB on each node but this can be changed based on specific workloads. I don’t have any customers that have changed it but I know as you scale the Extent Cache the performance scales as well. I’ve seen engineering change the Extent Cache to 64 GB and achieve 160,000 IOPS of Random Reads. The power of software, not being restricted to predetermined factory values.
I hope this helps a bit. Lots more to dive down into the weeds with so please post your questions and thoughts.