Jul
04

Will DR and Backup power AHV sales to 50%?

.Next has come and gone like your favorite holiday. Tons of hustle and bustle with great euphoric feelings followed by hitting a wall and being extremely tired. The Nutanix conference was chalked filled with announcements but the most powering to me were the ones related to DR and Backup. AHV being built with Cloud in mind is surging but adoption has been slowed by 3rd Party backup support. You can have this great automated hypevisor with best in class management but if you can’t back it up easily it will curb adoption.

This number will grow rapidly now with all of the backup and DR

This number will grow rapidly now with all of the backup and DR options

So before .Next 2017 DR and Backup Options for AHV included:
Commvault with support for IntelliSnapp
• Time Stream – Native DR with Time Stream
o 1 node backup with the NX-1155
o Backup/DR to Storage Only Clusters
o Cloud Connect to AWS and Azure
o DR/Backup to a full cluster.
• Any backup software with agents

After .Next 2017 announcements for AHV Backup and DR support include:
HYCU from Comtrade – Rapidly deployed software using turn-key appliances. Great choice if you have some existing hardware that you can use or place onto Nutanix. Point, click, done. Check out more here.
Rubrik – A hardware based appliances that do the heavy lifting for you. Check out more here.
Veeam – Probably best known for making backup easy on ESXi have announced support for AHV later this year. Nutanix added Veeam as a Strategic Technology Partner within the Nutanix Elevate Alliance Partner Program. Going Green!
Druva – Nutanix users can now take full advantage of Druva Phoenix’s unique cloud-first approach, with centralized data management and security. Only ESXi today and agents with AHV but agentless support is coming. More here.
• Backup and DR to full Nutanix Clusters get near-sync to achieve very low RPO. Read more on near-sync here.
Xi Cloud Services, a native cloud extension to the Nutanix Enterprise Cloud Platform that powers more than 6000 end-customers around the globe. This announcement marks another significant step towards the realization of our Enterprise Cloud vision – delivering a true cloud experience for any application, in any deployment model, using an open platform approach. For the first time, Nutanix software will be able to be consumed as a cloud service.

Maybe 50% for AHV is a lofty goal but I can see 40% by next year for new sales as people focus on their business rather than the day to day to headaches. With a very strong backing in backup and DR AHV growth will flourish.

Jun
30

The Down Low on Near-Sync On Nutanix

Nutanix refers to its current implementation of redirect-on-write snapshots as vDisk based snapshots. Nutanix has continued to improve on its implementation of snapshots by adding in Light-Weight Snapshots (LWS) to provide near-sync replication. LWS uses markers instead of creating full snapshots for RPO 15 minutes and under. LWS further reduce overhead with managing metadata and remove overhead associated high number of frequent caused by long snapshot chains. The administrator doesn’t have to worry about setting a policy between using vDisk snapshots or LWS. Acropolis Operating System (AOS) will transition between the two forms of replication based on the RPO and available bandwidth. If the network can’t handle the low RPO replication will transition out of near-sync. When the network is OK again to meet the near-sync requirements AOS will start using LWS again. In over-subscribed networks, near-sync can provide almost the same level protection a synchronous replication without impacting the running workload.

The administrator only need to set the RPO, no knowledge of near-sync is needed.

The administrator only need to set the RPO, no knowledge of near-sync is needed.

The tradeoff is that all changes are handled in SSD when near-sync is enabled. Due to this trade off Nutanix reserves a percentage of SSD space to be used by LWS when it’s enabled.

near-sync

In the above diagram, first a vDisk based snapshot is taken and replicated to the remote site. Once the fully replication is complete, LWS will begin at the set schedule. If there is no remote site setup LWS will happen locally right way. If you have the bandwidth available life is good but that’s not always the case in the real world. If you miss your RPO target repeatedly it will automatically transition back to vDisk based snapshots. Once vDisk based snapshots meets occurs fast enough it will automatically transition back to near-sync. Both transitioning out and into near-sync is controlled by advanced settings called gflags.
One the destination side AOS creates hydration points. Hydration points is a way for the LWS to transition into a vDisk based snapshot. The process for inline hydration is to:

1. Create a staging area for each VM (CG) that’s protected by the production domain
2. The staging area is essentially a directory with a set of vdisks for the VM.
3. Afterwards, any new incoming LWS’s will be applied to the same set of vdisks.
4. And the staging area can be snapshotted from time to time and then you would have individual vdisk-backed snapshots.

The source side doesn’t need to hydrate as a vDisk based snapshot is taken every hour.

Have questions? Please leave a comment.

Jun
29

ROBO Deployments & Operations Best Practices on Nutanix

The Nutanix platform’s self-healing design reduces operational and support costs, such as unnecessary site visits and overtime. With Nutanix, you can proactively schedule projects and site visits on a regular cadence, rather than working around emergencies. Prism, our end-to-end infrastructure management tool, streamlines remote cluster operations via one-click upgrades, while also providing simple orchestration for multiple cluster upgrades. Following the best practices in this new document ensures that your business services are quickly restored in the event of a disaster. The Nutanix Enterprise Cloud Platform makes deploying and operating remote and branch offices as easy as deploying to the public cloud, but with control and security on your own terms.

One section I would like to call out in the doc is how to seed your customer data if your dealing with poor WAN links.

Seed Procedure

The following procedure lets you use seed cluster (SC) storage capacity to bypass the network replication step. In the course of this procedure, the administrator stores a snapshot of the VMs on the SC while it’s installed in the ROBO site, then physically ships it to the main datacenter.

Install and configure application VMs on a ROBO cluster.
Create a protection domain (PD) called PD1 on the ROBO cluster for the VMs and volume groups.
Create an out-of-band snapshot S1 for the PD on ROBO with no expiration.
Create an empty PD called PD1 (same name used in step 2) on the SC.
Deactivate PD1 on the SC.
Create remote sites on the ROBO cluster and the SC.
Retrieve snapshot S1 from the ROBO cluster to the SC (via Prism on the SC).
Ship the SC to the datacenter.
ReIP the SC.
Create remote sites on the SC cluster and on the datacenter main cluster (DC1).
Create PD1 (same name used in steps 2 and 4) on DC1.
Deactivate PD1 on DC1.
Retrieve S1 from the SC to DC1 (via Prism on DC1). Prism generates an alert here, but though it appears to be a full data replication, the SC transferred metadata information only.
Create remote sites on DC1 and the ROBO cluster.
Set up a replication schedule for PD1 on the ROBO cluster in Prism.
Once the first scheduled replication is successful, you can delete snapshot S1 to reclaim space.

To get all of the best practices please download the full document here, https://portal.nutanix.com/#/page/solutions/details?targetId=BP-2083-ROBO-Deployment:BP-2083-ROBO-Deployment

Jun
28

Rubrik and AHV: Say No to Proxies

rubrik
The last couple of years I am a huge fan of backup software that removes the need for having proxies. Rubrik provides a proxy-less backup solution by using the Nutanix Data Services Virtual IP address to talk directory to each individual virtual disk that it needs to back up.
Rubrik and Nutanix have some key advantages with this solution:
• AOS 5.1+ with version 3 API’s provides change region tracking for quick efficient backup with no hypervisor based snap. This allow for quick and efficient snapshots.
• With AHV and data locality, Rubrik can grab the most recently changed data without flooding the network which can happen when the copy and VM might not live on the same host. For Nutanix the reads happen locally.
• Rubrik has access to ever virtual disk by making an iSCSI connection to bypass the need of proxies.
• AOS can redirect the 2nd RF copy away from a node with it’s advanced data placement if the backup load becomes too great during a backup window. Thus protecting your mission critical apps that running 24-7.
• Did I mention no proxies? 🙂

Stop by the Rubrik booth and catch their session if your at .Next this week.

Jun
20

Backing Up AFS Home Shares with Commvault

You cannot back up an Acropolis File Services (AFS) home shares with CommVault software until you change a setting on AFS. You need to let Commvault have access to the home share without the use of reparse ponts. A home share is the repository for the user’s personal files and is distributes the top-level directories across all of the file server VMs for performance and ease of management. The home share contains reparse point attributes in its top level directories to help with referrals. Since CommVault automatically skips these directories for backup because of the reparse points we make the below change.

AFS can disable reparse points for registered client(s) and reparse points is enabled for other clients which are not registered. I would list all of your proxies and media agents with this command.

Run this command on any file server VM

scli smbcli set –section=global –param=”backup hosts” –value=”10.20.6.100″

Feb
17

IP Fail-Over with AFS

A short video showing the client IP address moving around the cluster to quickly restore connectivity for your users running on Acropolis File Services.

Feb
07

Nutanix AFS – Maximums

Nutanix AFS Maximums – Tested limits. (ver 2.0.2)
Configurable Item Maximum Value
Number of Connections per FSVM 250 for 12 GB of memory
500 for 16 GB of memory
1000 for 24 GB of memory
1500 for 32 GB of memory
2000 for 40 GB of memory
2500 for 60 GB of memory
4000 for 96 GB of memory
Number of FSVMs 16 or equal to the number of CVMs (choose the lowest number)
Max RAM per FSVM 96 GB (tested)
Max vCPUs per FSVM 12
Data size for home share 200 TB per FSVM
Data size for General Purpose Share 40 TB
Share Name 80 characters
File Server Name 15 characters
Share Description 80 characters
Windows Previous Version 24 (1 per hour) adjustable with support
Throttle Bandwith limit 2048 MBps
Data Protection Bandwith limit 2048 MBps
Max recovery time objective for Async DR 60 minutes

s-l300

Jan
08

Client Tuning Recommendations for ABS (Acropolis Block Services)

Client Tuning Recommendations for ABS (Acropolis Block Services)

o For large block sequential workloads, with I/O sizes of 1 MB or larger, it’s beneficial to increase the iSCSI MaxTransferLength from 256 KB to 1 MB.

* Windows: Details on the MaxTransferLength setting are available at the following link: https://blogs.msdn.microsoft.com/san/2008/07/27/microsoft-iscsi-software-initiator-and-isns-server-timers-quick-reference/.

* Linux: Settings in the /etc/iscsi/iscsid.conf file; node.conn[0].iscsi.MaxRecvDataSegmentLength

o For workloads with large storage queue depth requirements, it can be beneficial to increase the initiator and device iSCSI client queue depths.

* Windows: Details on the MaxPendingRequests setting are available at the following link: https://blogs.msdn.microsoft.com/san/2008/07/27/microsoft-iscsi-software-initiator-and-isns-server-timers-quick-reference/.

* Linux: Settings in the /etc/iscsi/iscsid.conf file; Initiator limit: node.session.cmds_max (Default: 128); Device limit: node.session.queue_depth (Default: 32)

For more best practices download the ABS best practice guide

Jan
05

Nutanix AFS – Domain Activation

Well if it’s not DNS stealing hours of your life, the next thing to make your partner angry as you miss family supper is Active Directory(AD). In more complex AD setups you may find your self going to the command line to attach your AFS instance to AD.

Some important requirements to remember:

    While a deployment could fail due to AD, the FSVM(file server VMs) still get deployed. You can do the join domain process from the UI or NCLI afterwards.

    joindoamin

    The user attaching to the domain must be a domain admin or have similar rights. Why? The join domain process will create 1 computer account in the default Computers OU and create A service principal name (SPN) for DNS. If you don’t use the default Computers OU you will have to use the organizational-unit option from NCLI to change it to the appropriate OU. The computer account can be created in a specified container by using a forward slash mark to denote hierarchies (for example, organizational_unit/inner_organizational_unit).

    example

    stayoutad

    Command was

    ncli> fs join-domain uuid=d9c78493-d0f6-4645-848e-234a6ef31acc organizational-unit="stayout/afs" windows-ad-domain-name=tenanta.com preferred-domain-controller=tenanta-dc01.tenanta.com windows-ad-username=bob windows-ad-password=dfld#ld(3&jkflJJddu

    AFS needs at least 1 writable DC to complete the domain join. After the domain join is can authenticate using a local read only DC. Timing (latency) may cause problems here. To pick an individual DC you can use preferred-domain-controller from the NCLI.

NCLI Join-Domain Options

Entity:
file-server | fs : Minerva file server

Action:
join-domain : Join the File Server to the Windows AD domain specified.

Required Argument(s):
uuid : UUID of the FileServer
windows-ad-domain-name : The windows AD domain the file server is
associated with.
windows-ad-username : The name of a user account with administrative
privileges in the AD domain the file server is associated with.
windows-ad-password : The password for the above Windows AD account

Optional Argument(s):
organizational-unit : An Organizational unit container is where the AFS
machine account will be created as part of domain join
operation. Default container OU is "computers". Examples:
Engineering, Department/Engineering.
overwrite : Overwrite the AD user account.
preferred-domain-controller : Preferred domain controller to use for
all join-domain operations.

NOTE: preferred-domain-controller needs to be FQDN

If you need to do further troubleshooting you can ssh into one of the FSVMs and run

afs get_leader

Then navigate to the /data/logs and look at the minerva logs.

Shouldn't be an issue in most environments but I've included used ports just in case.


Required AD Permissions

Delegating permissions in an Active Directory (AD) enables the administrator to assign permissions in the directory to unprivileged domain users. For example, to enable a regular user to join machines to the domain without knowing the domain administrator credentials.

Adding the Delegation
---------------------
To enable a user to join and remove machines to and from the domain:
- Open the Active Directory Users and Computers (ADUC) console as domain administrator.
- Right-click to the CN=Computer container (or desired alternate OU) and select "Delegate control".
- Click "Next".
- Click "Add" and select the required user and click "Next".
- Select "Create a custom task to delegate".
- Select "Only the following objects in the folder" and check "Computer objects" from the list.
- Additionally select the options "Create selected objects in the folder" and "Delete selected objects in this folder". Click "Next".
- Select "General" and "Property-specific", select the following permissions from the list:
- Reset password
- Read and write account restrictions
- Read and write DNS host name attributes
- Validated write to DNS host name
- Validated write to service principal name
- Write servicePrincipalName
- Write Operating System
- Write Operating System Version
- Write OperatingSystemServicePack
- Click "Next".
- Click "Finish".
After that, wait for AD replication to finish and then the delegated user can use its credentials to join AFS to a domain.


Domain Port Requirements

The following services and ports are used by AFS file server for Active Directory communication.

UDP and TCP Port 88
Forest level trust authentication for Kerberos
UDP and TCP Port 53
DNS from client to domain controller and domain controller to domain controller
UDP and TCP Port 389
LDAP to handle normal queries from client computers to the domain controllers
UDP and TCP Port 123
NTP traffic for the Windows Time Service
UDP and TCP Port 464
Kerberos Password Change for replication, user and computer authentication, and trusts
UDP and TCP Port 3268 and 3269
Global Catalog from client to domain controllers
UDP and TCP Port 445
SMB protocol for file replication
UDP and TCP Port 135
Port-mapper for RPC communication
UDP and TCP High Ports
Randomly allocated TCP high ports for RPC from ports 49152 to ports 65535

    Dec
    14

    AOS 5.0 – Adapt Not React – Performance

    In AOS 5.0 is Adaptive replica selection is intelligent data placement for the extent store. Rather than use a random selection placement decisions are based on this capacity and queue length, these metrics are used to create a weighted random selection. The current algorithm was great for spreading all of the work load around for fast rebuilds but could cause issues with heterogeneous clusters. With mixed clusters with different tiers size, CPU strength, and running various workloads could have some nodes could be taxed more than others. It also didn’t take in to account the need for rebuilding data if the affected nodes had heavy running workloads.

    This new algorithm can prevent weaker nodes from getting overburden and their hot tier from filling up and reduce the risk of having busy disks. It can also allow for lower utilized nodes to send their replicas to each other and allow busier nodes to have less replica traffic being delivered to them. If we take the example of our storage only nodes we can ensure that replicas will go to the storage only nodes while we’re not sending replicas to other computer-based nodes. This new algorithm also reduces the need to run auto balancing from a capacity perspective. By reducing the need to react we also reserve CPU cycles for workloads and save on wear and tear of the drives.
    In a rudimentary static placement systems this ability to have adaptive replicas would also solve the problem of moving data that then blows up your cache.

    The two less used nodes send their replication traffic to each other. The high-performing node is not impacted by incoming replica traffic.

    The two less used nodes send their replication traffic to each other. The high-performing node is not impacted by incoming replica traffic.

    Since we have a high performing NoSQL database collecting disk usage and performance stats for each disc we can use those stats to create a fitness value. If we can collect stats for a disc we assume the worst case and place a low number for the probability. If we can’t grab stats there is likely chance that something bad is happening to that disc. The disks once assigned a fitness value can be selected by a weighted random lottery to prevent some nodes taking all of the traffic.

    As the product continues to mature were trying to avoid problems from even happening. Whether VDI, Splunk, SAP, SharePoint, SQL your workloads can get very consistent high performance on top of data locality.

    The doctor says prevention is always the best medicine.