Client Tuning Recommendations for ABS (Acropolis Block Services)

Client Tuning Recommendations for ABS (Acropolis Block Services)

o For large block sequential workloads, with I/O sizes of 1 MB or larger, it’s beneficial to increase the iSCSI MaxTransferLength from 256 KB to 1 MB.

* Windows: Details on the MaxTransferLength setting are available at the following link: https://blogs.msdn.microsoft.com/san/2008/07/27/microsoft-iscsi-software-initiator-and-isns-server-timers-quick-reference/.

* Linux: Settings in the /etc/iscsi/iscsid.conf file; node.conn[0].iscsi.MaxRecvDataSegmentLength

o For workloads with large storage queue depth requirements, it can be beneficial to increase the initiator and device iSCSI client queue depths.

* Windows: Details on the MaxPendingRequests setting are available at the following link: https://blogs.msdn.microsoft.com/san/2008/07/27/microsoft-iscsi-software-initiator-and-isns-server-timers-quick-reference/.

* Linux: Settings in the /etc/iscsi/iscsid.conf file; Initiator limit: node.session.cmds_max (Default: 128); Device limit: node.session.queue_depth (Default: 32)

For more best practices download the ABS best practice guide


Nutanix AFS – Domain Activation

Well if it’s not DNS stealing hours of your life, the next thing to make your partner angry as you miss family supper is Active Directory(AD). In more complex AD setups you may find your self going to the command line to attach your AFS instance to AD.

Some important requirements to remember:

    While a deployment could fail due to AD, the FSVM(file server VMs) still get deployed. You can do the join domain process from the UI or NCLI afterwards.


    The user attaching to the domain must be a domain admin or have similar rights. Why? The join domain process will create 1 computer account in the default Computers OU and create A service principal name (SPN) for DNS. If you don’t use the default Computers OU you will have to use the organizational-unit option from NCLI to change it to the appropriate OU. The computer account can be created in a specified container by using a forward slash mark to denote hierarchies (for example, organizational_unit/inner_organizational_unit).



    Command was

    ncli> fs join-domain uuid=d9c78493-d0f6-4645-848e-234a6ef31acc organizational-unit="stayout/afs" windows-ad-domain-name=tenanta.com preferred-domain-controller=tenanta-dc01.tenanta.com windows-ad-username=bob windows-ad-password=dfld#ld(3&jkflJJddu

    AFS needs at least 1 writable DC to complete the domain join. After the domain join is can authenticate using a local read only DC. Timing (latency) may cause problems here. To pick an individual DC you can use preferred-domain-controller from the NCLI.

NCLI Join-Domain Options

file-server | fs : Minerva file server

join-domain : Join the File Server to the Windows AD domain specified.

Required Argument(s):
uuid : UUID of the FileServer
windows-ad-domain-name : The windows AD domain the file server is
associated with.
windows-ad-username : The name of a user account with administrative
privileges in the AD domain the file server is associated with.
windows-ad-password : The password for the above Windows AD account

Optional Argument(s):
organizational-unit : An Organizational unit container is where the AFS
machine account will be created as part of domain join
operation. Default container OU is "computers". Examples:
Engineering, Department/Engineering.
overwrite : Overwrite the AD user account.
preferred-domain-controller : Preferred domain controller to use for
all join-domain operations.

NOTE: preferred-domain-controller needs to be FQDN

If you need to do further troubleshooting you can ssh into one of the FSVMs and run

afs get_leader

Then navigate to the /data/logs and look at the minerva logs.

Shouldn't be an issue in most environments but I've included used ports just in case.

Required AD Permissions

Delegating permissions in an Active Directory (AD) enables the administrator to assign permissions in the directory to unprivileged domain users. For example, to enable a regular user to join machines to the domain without knowing the domain administrator credentials.

Adding the Delegation
To enable a user to join and remove machines to and from the domain:
- Open the Active Directory Users and Computers (ADUC) console as domain administrator.
- Right-click to the CN=Computer container (or desired alternate OU) and select "Delegate control".
- Click "Next".
- Click "Add" and select the required user and click "Next".
- Select "Create a custom task to delegate".
- Select "Only the following objects in the folder" and check "Computer objects" from the list.
- Additionally select the options "Create selected objects in the folder" and "Delete selected objects in this folder". Click "Next".
- Select "General" and "Property-specific", select the following permissions from the list:
- Reset password
- Read and write account restrictions
- Read and write DNS host name attributes
- Validated write to DNS host name
- Validated write to service principal name
- Write servicePrincipalName
- Write Operating System
- Write Operating System Version
- Write OperatingSystemServicePack
- Click "Next".
- Click "Finish".
After that, wait for AD replication to finish and then the delegated user can use its credentials to join AFS to a domain.

Domain Port Requirements

The following services and ports are used by AFS file server for Active Directory communication.

UDP and TCP Port 88
Forest level trust authentication for Kerberos
UDP and TCP Port 53
DNS from client to domain controller and domain controller to domain controller
UDP and TCP Port 389
LDAP to handle normal queries from client computers to the domain controllers
UDP and TCP Port 123
NTP traffic for the Windows Time Service
UDP and TCP Port 464
Kerberos Password Change for replication, user and computer authentication, and trusts
UDP and TCP Port 3268 and 3269
Global Catalog from client to domain controllers
UDP and TCP Port 445
SMB protocol for file replication
UDP and TCP Port 135
Port-mapper for RPC communication
UDP and TCP High Ports
Randomly allocated TCP high ports for RPC from ports 49152 to ports 65535


    AOS 5.0 – Adapt Not React – Performance

    In AOS 5.0 is Adaptive replica selection is intelligent data placement for the extent store. Rather than use a random selection placement decisions are based on this capacity and queue length, these metrics are used to create a weighted random selection. The current algorithm was great for spreading all of the work load around for fast rebuilds but could cause issues with heterogeneous clusters. With mixed clusters with different tiers size, CPU strength, and running various workloads could have some nodes could be taxed more than others. It also didn’t take in to account the need for rebuilding data if the affected nodes had heavy running workloads.

    This new algorithm can prevent weaker nodes from getting overburden and their hot tier from filling up and reduce the risk of having busy disks. It can also allow for lower utilized nodes to send their replicas to each other and allow busier nodes to have less replica traffic being delivered to them. If we take the example of our storage only nodes we can ensure that replicas will go to the storage only nodes while we’re not sending replicas to other computer-based nodes. This new algorithm also reduces the need to run auto balancing from a capacity perspective. By reducing the need to react we also reserve CPU cycles for workloads and save on wear and tear of the drives.
    In a rudimentary static placement systems this ability to have adaptive replicas would also solve the problem of moving data that then blows up your cache.

    The two less used nodes send their replication traffic to each other. The high-performing node is not impacted by incoming replica traffic.

    The two less used nodes send their replication traffic to each other. The high-performing node is not impacted by incoming replica traffic.

    Since we have a high performing NoSQL database collecting disk usage and performance stats for each disc we can use those stats to create a fitness value. If we can collect stats for a disc we assume the worst case and place a low number for the probability. If we can’t grab stats there is likely chance that something bad is happening to that disc. The disks once assigned a fitness value can be selected by a weighted random lottery to prevent some nodes taking all of the traffic.

    As the product continues to mature were trying to avoid problems from even happening. Whether VDI, Splunk, SAP, SharePoint, SQL your workloads can get very consistent high performance on top of data locality.

    The doctor says prevention is always the best medicine.


    Docker Datacenter 2.0 for Virtual Admins

    Just a short video walking thru how easy it is to get an environment up and running with Docker Datacenter 2.0 on top of AHV.

    High level points:

    * If you can deploy an VM you can setup Docker Datacenter
    * Management of new docker hosts is easliy done with pre-generated code to paste into new hosts
    * Docker Datacenter has the ability to run both services and compose apps side by side in the same Docker Datacenter environment

    Later this week I hope to have a post talking about the integration with Docker Datacenter and the Docker trusted registry.


      SRM and Commvault Health Check

      The NCC health check pds_share_vms_check verifies that the protection domains do not share any VMs. It would be good practice to run this healh check after configuring either SRM or using Intellisnap from Commvault. It’s one of over 200 hundred checks NCC provides.

      This check is available from the NCC 2.2.5 release and is part of the full health check that you can run by using the following command:

      nutanix@cvm$ ncc health_checks run_all

      You can also run this check separately by using the following command:

      nutanix@cvm$ ncc health_checks data_protection_checks protection_domain_checks pds_share_vms_check

      A protection domain is a group of VMs that you can replicate together on a desired schedule.

      A VM can be part of two protection domains if the following conditions are met:

      A protection domain (Async DR or Metro Availaibility) is created, and the VM is added as a protected entity of this protection domain. The vstore containing the VM is protected by using ncli or by an external third-party product such as Commvault or SRM. Protecting a vstore automatically creates a protection domain. These protection mechanisms are mutually exclusive, which means that the backups of the VM might fail if the VM is in 2 protection domains.


      If the check returns a FAIL status, the reported VMs need to be removed from some of the listed protection domains, so that they remain only inside one protection domain.
      If your using metro availability you may have move the VM to another container or stop protecting the vstore.


      Making A Better Distributed System – Nutanix Degraded Node Detection


      Distributed systems are hard, there no doubt about that. One of the major problems is what to do when a node is unhealthy and can be affecting performance of the overall cluster. Fail hard, fail fast is distributed system principle but how do you go about detecting an issue before even a failure occurs? AOS 4.5.3, 4.6.2 and 4.7 will includes the Nutanix implementation of degraded node detection and isolation. A bad performing hardware component or network issue can be a death of thousands cuts versus a failure which is pretty cut and dry. If a remote CVM is not performing well it can affect the acknowledgement of writes coming from other hosts and other factors may affect performance like:

      * Significant network bandwidth reduction
      * Network packet drops
      * CPU Soft lockups
      * Partially bad disks
      * Hardware issues

      The list of issues can even be unknown so Nutanix Engineering has come with a score systems that uses votes to make sure everything can be compared.
      Services running on each node of the cluster will publish scores/votes for services running on other nodes. Peer health scores will be computed based on various metrics like RPC latency, RPC failures/timeouts, Network latency etc. If services running on one node are consistently receiving bad scores for large period (~10 mins), then other peers will convict that node as degraded node.

      Walk, Crawl, Run – Degraded Node Expectations:

      A node will not be marked as degraded if current cluster Fault Tolerance (FT) level is less than desired value. Upgrades and break fix actions will not be allowed while a node is in the degraded state. A node will only be marked as degraded if we get bad peer health scores for 10 minutes. In AOS 4.5.3, the first shipping AOS release to include this feature, the default settings are that degraded node logging will be enabled but degraded node action will be disabled. In AOS 4.7 and AOS 4.6.2 additional user controls will be provided to select an “action policy” for when a degraded node is detected. Options should include No Action, Reboot CVM or Shutdown Node). While the peers scoring is always on, the action is side is disabled for the first release as ultra conservative approach.

      In AOS 4.5.3 if the degraded node action setting is enabled leadership of critical services will not be hosted on the degraded node. A degraded node will be put into maintenance mode and CVM will be rebooted. Services will not start on this CVM upon reboot. An Alert will be generated for degraded node.

      In AOS 4.7 and AOS 4.6.2 additional user controls will be provided to select an “action policy” for when a degraded node is detected. Options should include No Action, Reboot CVM or Shutdown Node

      To enable the degraded node action setting use the NCLI command:

      nutanix@cvm:~$ ncli cluster edit-params disable-degraded-node-monitoring=false

      The feature will further increase the availability and resilience for Nutanix customers. While top performance numbers grab the headlines, remember the first step is to have a running cluster.

      AI for the control plane………… Maybe we’ll get out voted for our jobs!


      Updated Best Practices – Nutanix DR and Backup & vSphere + Commvault

      Two best practices have been updated on this week. The Nutanix DR and Backup Best Practices is located in the Support Portal.

      <DR and Backup Best Practices>

      The update was around bandwidth sizing and added a link to Wolframalpha which spits out the sizing formula for you.

      The vSphere and Commvault Best Practice Guide added some guidance around IntelliSnap and sizing. At this time IntelliSnap and Metro is not supported but streaming is a fully supported option.



      Enterprise Cloud for SMB: Invisible Operations

      SMB (Small Medium Business) is one of the worst terms that is in the information technology industry. The acronym implies that there is a big difference between what a small company would need versus a large company in terms of requirements. I personally think that is couldn’t be further from the truth. Whether you are company of 500 people or 50,000 people you still want the highest levels of service and uptime for your customers. Both small and large companies want operational efficiency, fractional consumption and reduce security risk in the sector that they operate in. You can also make the point the smaller business has to more efficient than bigger companies because they don’t have same economies of scale their bigger enterprise versions.

      While I don’t think the vast majority of SMB’s are saying “I need to get a cloud strategy”, they are saying things like, “how can finish all of this work on my plate before the weekend?”, “How do I get to the use feature X so I can put off buying more capacity?”, “How can I spend more time with the customers instead waiting for this upgrade to finish?”.
      Nutanix is giving the benefits of Public cloud to business but on your terms to answers a lot of the above questions. The time spent on reading complex HCL’s and performing an nuanced cha cha in upgrading separate pieces of infrastructure just so you can stay on support doesn’t provide any real value to the business. Along with a mounting headache a lot of these upgrade activities bring a lot of risk when these tasks are not automated using a health check before you proceed. People want to talk about hardware failure rates but I most often it’s us as humans that bring the most risk in regards to downtime.

      A Gartner study projected that through 2015, “80 percent of outages impacting mission-critical services will be caused by people and process issues, and more than 50 percent of those outages will be caused by change, configuration, release integration and hand-off issues.” With these high numbers it’s easy see why people look to cloud to reduce the risk. Nutanix’s commitment to one-click everything including upgrades for the hypervisor, Acropolis software, BIOS/BMC and the hard drives can save countless hours and contributes to uptime.

      A lot of time is also spent on the management layers which really only exists to run applications. The virtualization management layer has become a sore sport in terms of training and maintenance to keep the lights on. Nutanix ‘s control plane, Prism run on every node and eliminates the need for an outside management layer. When you combine Prism with AHV you don’t have to worry about installing, managing and supporting a product to provide services like analytics, call home and live migration. No extra SQL or Oracle licensing in supporting your management layer.

      Nutanix’s ability to achieve one-click everything is possibly because it was designed for web-scale. Scale allows self-healing, ability to have clusters in mixed versions/states to handle complicated upgrade scenarios, add capacity independently of the hypervisor, and allow seamless patching for security updates. A key enabler is the ability to handle metadata in a dynamic fashion, from 3 nodes to 1000 nodes, all Nutanix customers reap the benefits. Let’s take a look at the upgrade use cases that highlight why Nutanix is the best choice of all business sizes including SMB.


      So a new version of Acropolis was released and you want to get the updated SCMA (Security Configuration Management Automation) and the new performance improvements. What will be the impact to the running VM’s?

      Below is a Nutanix Cluster in a healthy state. All VM’s are writing 1 copy locally and 1 remote copy. As more writes come into the system the remote copies are evenly distributed across the cluster because of the intelligent metadata afforded by web-scale technologies.


      Controller VM on Node 1 goes down for an upgrade


      The SQL VM will already have knowledge of the controller VM’s and it’s business as usually. No VM’s or data needs to be moved for an upgrade so the process is fast and efficient. The system is designed to transparently handle these gracefully. In the event of an upgrade/failure, I/Os will be re-directed to other controller VMs within the cluster. Nutanix ALWAYS writes the minimum 2 copies of data which is not the same for other Hyper-Converged Infrastructure (HCI) vendors. By enforcing availability the new working set can also invalidate old data on the node that is being upgraded. This protects you from data loss if a drive goes down in the system during the upgrade and allows for quicker rebuild if the something bad happens with the node that is being upgrade.

      This clearly better than a 3 Tier Architecture were typical if you lose 1 controller your down to 50% of your storage performance. Other HCI vendors will give you the option to move all of the data off the system first or use RAID to overcome the oversight. If you take the use case of moving all data off the node how long will that take?

      Other HCI vendors requiring to move data off the node


      By moving data off the node the flash tier can quickly fill up and queueing will occur. Performance will impacted and not to mention the time to copy off TB’s of data. Some vendors that use RAID have no current way of rebuilding the data on the upgrading/failed node. The point to make here is that having elastic metadata like Nutanix allows you to self-heal and have low impact when carrying out maintenance operations.

      So what happens if the node never comes back up for Nutanix?


      Nutanix has the ability to rebuild data at the same tier at which failed at. This allow the cold storage tier not to impact the performance tier which would then have to be down migrated with ILM. HDD rebuilds to other HDDs and SSDs will rebuild to SSDs. This control and limiting of rebuilds helps to prevent network congestion especially on clusters that are only running 1 GB.

      While getting down into the weeds can highlight why Nutanix is different from other traditional and other HCI vendors the point is Nutanix allows you elevate your IT staff. You can forget the non-value tasks and start to focus on bringing delight to your customers. The person that might have 5 different hats, like security, networking, storage, virtualization and backup can now find a glimmer of hope in their daily work lives and start to think about work-life balance. Nutanix as the platform for Enterprise Cloud is perfect of any size, especially the small and medium sized business.


      Impact of Nutanix VSS Hardware Support

      When 4.6 was released I wrote about how the newly added VSS support with Nutanix Guest Tools (NGT) was the gem of the release. It was fairly big compliment considering some of the important updates that were in the release like cross hypervisor DR and another giant leap in performance.

      I finally set some time aside to test the impact of taking a application consistent snapshot with VMware Tools vs the Nutanix VSS Hardware Support.

      vmware-vss-qWhen an application consistent snapshot workflow without NGT on ESXi, we take an ESXi snapshot so VMware tools can be used to quiesce the file system. Every time we take an ESXi snapshot, it results in creation of delta disks, During this process ESXi “stuns” the VM to remap virtual disks to these delta files. The amount of stun depends on the number of virtual disks that are attached to the VM and speed in which the delta disks can be created (capability of the underlying storage to process NFS meta-data update operations + releasing/creating/acquiring lock files for all the virtual disks). In this time, the VM is totally unresponsive. No application will run inside the VM, and pings to the VM will fail.

      We then delete the snapshot (after backing up the files via hardware snap on the Nutanix side) which results in another set of stuns (deleting a snapshot causes two stuns, one fixed time stun + another stun based on the number of virtual disks). This essentially means that we are causing two or three stuns in rapid succession. These stuns cause meta-data updates in addition to the flushing of data during the VSS snapshot operations.

      Customers have reported in set of VMs running Microsoft clustering, these VMs can be voted out due to heartbeat failure. VMware gives customer guidance on increasing timers if your using Microsoft clustering to get around this situation.

      To test this out I used HammerDB with a SQL 2014 running on Windows 2012R2. The tests were run on ESXi 6.0 with hardware version 11.


      VMware Tools with VSS based Snapshot
      I was going to try to stitch the images together because of the time it took but decided to leave as is.


      The total process took ~4 minutes.

      NGT with VSS Hardware Support based Snapshot
      NGT based VSS snapshots don’t cause VM stuns. The application will be stunned temporarily within Windows to flush the data, but pings and other things should work.


      The total process took ~1 minute.


      NGT with VSS hardware support is the Belle of the Ball! While there is no fixed number to explain the max stun times. It depends on how heavy the workload is but what we can see is the effect of not using NGT for application consistent snapshot and it’s pretty big. The collapsing of ESXi snapshots cause additional load and should be avoided if possible. NGT offers hypervisor agnostic approach and currently works with AHV as well.

      Note: Hypervisor snapshot consolidation is better in ESXi 6 than ESXi 5.5.

      Thanks to Karthik Chandrasekaran and Manan Shah for all their hard work and contribution to this blog post.


      Save Your Time With Nutanix Automatic Support

      Best Industry Support

      The feature known as Pulse is enabled by default and sends cluster status information automatically to Nutanix customer support. After you have completed initial setup, created a cluster, and opened ports 80 or 8443 in your firewall, AOS sends a Pulse message from each cluster once every 24 hours. Each message includes cluster configuration and health status that can be used by Nutanix Support to address any cluster operation issues.

      AOS can also send automatic alert email notifications to Nutanix Support by default through ports 80 or 8443. Like Pulse, any configured firewall must have these ports open. Some examples of conditions that will automatically generate a proactive case with Nutanix support with a Priority Level P4.

      The Stargate process is down for more than 3 hours
      Curator scan fails
      Hardware Clock Failure
      Faulty RAM module
      Power Supply failure
      Unable to fetch IPMI SDR repository (IPMI Error)
      HyperV networking
      System operations
      Disk Capacity > 90%
      Bad Drive

      You can optionally use your own SMTP server to send Pulse and alert notifications. If you do not or cannot configure an SMTP server, another option is to implement an HTTP proxy as part of your overall support scheme.

      While the best thing is never to a get a call, 2nd best is not waiting in line to open a ticket. Have a great week!