Dec
14

AOS 5.0 – Adapt Not React – Performance

In AOS 5.0 is Adaptive replica selection is intelligent data placement for the extent store. Rather than use a random selection placement decisions are based on this capacity and queue length, these metrics are used to create a weighted random selection. The current algorithm was great for spreading all of the work load around for fast rebuilds but could cause issues with heterogeneous clusters. With mixed clusters with different tiers size, CPU strength, and running various workloads could have some nodes could be taxed more than others. It also didn’t take in to account the need for rebuilding data if the affected nodes had heavy running workloads.

This new algorithm can prevent weaker nodes from getting overburden and their hot tier from filling up and reduce the risk of having busy disks. It can also allow for lower utilized nodes to send their replicas to each other and allow busier nodes to have less replica traffic being delivered to them. If we take the example of our storage only nodes we can ensure that replicas will go to the storage only nodes while we’re not sending replicas to other computer-based nodes. This new algorithm also reduces the need to run auto balancing from a capacity perspective. By reducing the need to react we also reserve CPU cycles for workloads and save on wear and tear of the drives.
In a rudimentary static placement systems this ability to have adaptive replicas would also solve the problem of moving data that then blows up your cache.

The two less used nodes send their replication traffic to each other. The high-performing node is not impacted by incoming replica traffic.

The two less used nodes send their replication traffic to each other. The high-performing node is not impacted by incoming replica traffic.

Since we have a high performing NoSQL database collecting disk usage and performance stats for each disc we can use those stats to create a fitness value. If we can collect stats for a disc we assume the worst case and place a low number for the probability. If we can’t grab stats there is likely chance that something bad is happening to that disc. The disks once assigned a fitness value can be selected by a weighted random lottery to prevent some nodes taking all of the traffic.

As the product continues to mature were trying to avoid problems from even happening. Whether VDI, Splunk, SAP, SharePoint, SQL your workloads can get very consistent high performance on top of data locality.

The doctor says prevention is always the best medicine.

Speak Your Mind

*