Aug
    07

    Battle Royale: View Composer VS Instant-Clones – Deploy

    Horizon 7 added Instant-Clones with the ability to clone a full desktop in 4-5 secs. What is the catch? Not really a catch, but no explanation that it takes a bit of time to prep the desktops. For testing purposes, I decided to clone 100 desktops with View Composer and 100 desktops with Instant Clones.

    For these tests I used NX 3460-G4, Win 10, 2 vCPU, 2 GB of RAM

    Impact of cloning 100 desktops with View Composer

    100View5

    You can see hypervisor IOPS and disk IOPS. The impact is really shown on what is happening on the backend and CPU used to create the desktops. So roughly 16,000 IOPS to create the desktops with Composer.

    Impact of cloning 100 desktops with Instant-Clones

    instant-clone1009
    You can see an initial bump in IOPS due to the replica that has to be copied without VAAI. The replica also has to get fingerprinted with does take some time. In my testing it took about eight minutes. The reduction in IOPS is amazing. While you still need performance for running the desktops, you don’t have to worry about provisioning destroying your performance. Disk IOPS was ~ only 1200 IOPS at its peak.

    Summary VC vs Instant Clone

    Deploy 100 Desktops
    View Composer: 5 min
    Instant Clone: 14 min —– virtual disk digest – 8.22 min
    —– Clone 100 desktops 1.4 min

    While the overall process took longer the impact is a lot better with Instant-Clones. With hundreds of desktops Instant-Clones is powerful tool to have in your back pocket. Once Instant-Clones gets GPU support I think they will really take off as the default choice. If you have related questions to performance I encourage you to talk to your Nutanix SE and they can get put you in touch with the Solution and Performance Team at Nutanix.

    Related Articles

    Tale of Two Lines

    Aug
    01

    The Tale Of Two Lines: Instant-Clones on Nutanix

    There was a part of me that wanted to hate on Instant Clones that are new in Horizon 7 but the fact is they’re worth the price of admission. Instant-clones has very low overhead to provide true on-demand desktops or as VMware is tagging it, Just-In-Time desktops.

    On-demand desktops with View Composer..... not happening

    On-demand desktops with View Composer….. not happening

    In my health care days the non-president desktops and shift change always resulted it some blunt force trauma around 7 am and 7 pm when staff would start their day. They only real way to counter balance the added load of login storms was to make sure the desktops were pre-built. This of course means you need so have some desktops sitting around doing nothing waiting for the these two time periods in the day, or use generic logins and then the user never disconnects which was another bag of problems.

    Instant-clones ability to clone a live running VM by simply quiescing the VM is really amazing. Have you ever changed the name of the a desktop and then windows tells you to reboot? If your like me your try to do 5 or 6 other things before you have to reboot which usually ends up in a mess. Instant-clones uses a feature called clone prep to add the VM to AD and change it’s name, all while not having to reboot the VM. When you see a power on operation inside of vCenter it’s actually just quiescing the desktop so there is very low overhead.

    The steps during Clone Prep. MS does not support Clone Prep but they didn't for View Composer so I don't see it being any different.

    The steps during Clone Prep. MS does not support Clone Prep but they didn’t for View Composer so I don’t see it being any different.

    When I went to test instant-clones I wanted to see if on-demand desktops was actually possible without destroying node densities. I had two test runs with Login VSI, 1 run with 400 knowledge users with all the desktops pre-deployed and 1 run with 400 knowledge users but I only started with 50 desktops. I had set the desktop pool to always have at least 30 free desktops until the pool got to 400 desktops.

    Instant-clones delivers on-demand desktops with very low overhead.

    Instant-clones delivers on-demand desktops with very low overhead.

    The darker blue line represents the on-demand test and you can see that the impact over 400 hundred users is pretty small. This is pretty remarkable from a CPU and memory consumption on boot that is being almost eliminated.

    It’s not all unicorns and rainbows however, instant clones does have some limitations in the first release:

    No dedicated Desktop Pools
    No RDS Desktop or Application Pools
    Limited SVGA Support – Fixed max resolution & number of monitors
    No 3D Rendering / GPU Support
    No Sysprep support – Single SID across pool
    No VVOL or VAAI NFS Hardware Clones support (Smaller desktops pools may take longer to provision)
    No Powershell
    No Multi-VLAN Support in a single Pool
    No Reusable Computer Accounts
    No Persistent Disks – Use Writable Volumes \ Flex App \ Unidesk \ RES …….

    vMotion Is supported

    Like anything use case will dictate when this gets used but its a powerful tool inside of Horizon. I plan to show some of the differences between View Composer and Instant Clones in my next posts. Also keep in mind that you still need high IO to service your desktops. Size for the peaks or face the wrath of your end users.

    Jul
    28

    The Impact On App Layering On Your VDI Environment

    I was testing instant clones in Horizon 7 and it was pretty much a requirement to use some form of application virtualization and get your user data stored off the desktops. My decision on what to select for for testing was based on that I had already had ProfileUnity from Liquidware Labs and App Volumes is bundled in View at the higher layers. I wanted to see the impact of layering on CPU and login times. I has also used UberAgent to collect some of the results. While testing I would run one test run with UberAgent to collect login times and then one with UberAgent agent turned off to collect CPU metrics.

    I used three separate applications, each in their own layer.

    * Gimp 2.8
    * iTunes 10
    * VLC

    I used AppVolumes 2.11 since 3.0 is kind of dead in the water and not recommend for existing customers so I can’t see a lot of people using it till the next release. ProUnity was version 6.5

    I first did a base run with no App Stacks or Flex Apps but with a roaming profile being stored on Acropolis File Services. The desktops were running horizon 7 agent and office 2013 and were instant clones. The desktops were Windows 10 with 2 vCPU and 2 GB of RAM. When you see the % listed is a factor of both CPUs.

    Base Run
    baserun

    So not to bad 14 secs login, probably some clean up I could do to make it faster but also not that realistic if your thinking about enterprise desktop so I was happy with this.

    I did test with 1 layer at a time until I used all of the 3 applications. There was a gradual increase in CPU and login time for each layer. The CPU cost comes from the agent and attaching the vmdk to the desktop.

    App Volumes with 3 AppStacks

    3appstacks

    So with 3 layers the CPU jumped by ~20% and the login time went up ~9 secs with App Volumes.

    3 Flex Apps

    3appstacks

    flexapp

    With 3 Flex Apps CPU jumped a bit and login times went up ~4 sec.


    Overall Review

    layeringreview

    What does this all mean?

    Well if you have users that only disconnect and reconnect and rarely log out then this means absolutely nothing for the most part. If you have a user base that gets fresh new desktops all of the time and things like large shift changes then it means your densities will go down. I like to say “Looking is for free, and touching is going to cost you”. Overall I still feel this is a small price to pay to have a successful VDI deployment and layering will help out the process.

    Jul
    19

    Securing the Supply Chain with Nutanix and Docker #dockercon2016

    I was watching the below video from DockerCon 2016 and there was lots of striking similarities between what Nutanix and Docker is doing secure working environment for the Enterprise Cloud. There is no sense turning the alarm on for your house and then not locking the doors. You need to close all the gaps for your infrastructure and the applications that live on top of it.

    The most interesting part of the session for me was the section on security scanning and gating. Docker has Security Scanning which is available as an add-on to Docker hosted private repositories on both Docker Cloud and Docker Hub. Scans run each time a build pushes a new image to your private repository. They also run when you add a new image or tag. Most scans complete within an hour, however large repositories may take up to 24 hours to scan. The scan traverses each layer of the image, identifies the software components in each layer, and indexes the SHA of each component.
    docker-scanniing
    The scan compares the SHA of each component against the Common Vulnerabilities and Exposures (CVE) database. The CVE is a “dictionary” of known information security vulnerabilities. When the CVE database is updated, the service reviews the indexed components for any that match the new vulnerability. If the new vulnerability is detected in an image, the service sends an email alert to the maintainers of the image.

    A single component can contain multiple vulnerabilities or exposures and Docker Security Scanning reports on each one. You can click an individual vulnerability report from the scan results and navigate to the specific CVE report data to learn more about it.

    On the Nutanix side of the fence all code is scanned with 2 different vulnerability scanners at every step of the development life-cycle. To top that off Nutanix already apply s an intrinsic baseline, and we already monitor and self-heal that baseline with SCMA the Security Configuration Management Automation and leverage the SaltStack framework so that your production systems can Self-Heal from any deviation and are always in compliance. Features like two factor authentication (2FA) and cluster lockdown further enhance the security posture. The cluster-wide setting can forward all logs to a central host as well. All CVEs related to the product are tracked and provide an internal turn around time of 72 hours for critical patches! There is some added time on getting a release cut but it fast and everything is tested as whole instead of a one off change that could have a domino a effect.

    When evaluating infrastructure and development environments for a security-conscious environment, it’s imperative to choose one that is built with a security-first approach that continually iterate on patching new threats thereby reducing the attack surface. Docker is doing some great work on this front.


      Jul
      14

      Nutanix Acropolis File Services – Required 2 Networks

      When configuring Acropolis File Services you may be prompted with the following message:

      “File server creation requires two unique networks to be configured beforehand.”

      The reason is you two managed networks for AFS. I’ve seen this come up a lot lately so I thought I would explain the why. While it may change over time this is the current design.

      fs-tor

      The above diagram shows one file server VM running on a node, but you can put multiple file server VMs on a node for multitenancy.

      The file server VM has two network interfaces. The first interface is a static address used for the local file server VM service that talks to the Minerva CVM service running on the Controller VM. The Minerva CVM service uses this information to manage deployment and failover; it also allows control over one-click upgrades and maintenance. Having local awareness from the CVM enables the file server VM to determine if a storage fault has occurred and, if so, if action should be taken to rectify it. The local address also lets the file server VM claim vDisks for failover and failback. The file server VM service sends a heartbeat to its local Minerva CVM service each second, indicating its state and that it’s alive.
      The second network interface on the file server VM, also referred to as the public interface, allows clients to service SMB requests. Based on the resource called, the file server VM determines whether to service the request locally or to use DFS to refer the request to the appropriate file server VM that owns the resource. This second network can be dynamically reassigned to other file server VM’s for high availability.

      If you need help setting up the two managmed networks there is KB article on portal.nutanix.com -> KB3406

      Jul
      13

      Backing up AFS with Commvault

      This is by no means a best practice guide for AFS and Commvault but I wanted to make sure that Commvault could be used to backup Acropolis File Services (AFS). If want more details on AFS I suggest reading this great post on the Nutanix website.

      Once I applied the file server license to CommServe I was off to the races. I had 400 users spread out on 3 file server VMs making up the file server called eucfs.tenanta.com. The file server had two shares but I was focused on backing up the user share.

      commvault-afs-users

      400users

      I found performance could be increased by adding more readers for the backup job. My media agent was last configured with 8 vCPU and it seemed to be the bottleneck. If I were to give the media agent more CPU I am sure I would have had a even faster backup time.

      commvault-readers-nutaix-afspng

      I was able to get almost 600 GB/Hour which I am told is a good number for file backup. There looks like there is lots of room to improve though. The end goal will be to try and backup a million files and see what happens over the course of time.

      600

      Like all good backup stories, it’s all about the restores and it appears to drill down real nice.

      commvault-users-restore-afs4

      Jul
      12

      Just In Time Desktops (Instant Clones) on Nutanix

      JIT desktops are supported on Nutanix. One current limitation of JIT is that it doesn’t support VAAI for NFS Hardware Clones. The great part for Nutanix customers is that were VAAI clones stop, shadow clones kicks into affect! So if you want to keep a lower amount of RAM for configured for the View Storage Accelerator your perfectly OK in doing that.

      The Nutanix Distributed Filesystem has a feature called ‘Shadow Clones’ which allows for distributed caching of particular vDisks or VM data which is in a ‘multi-reader’ scenario. A great example of this is during a VDI deployment many ‘linked clones’ will be forwarding read requests to a central master or ‘Base VM’. In the case of VMware View this is called the replica disk and is read by all linked clones. This will also work in any scenario which may be a multi-reader scenario (eg. deployment servers, repositories, App Volumes, etc.)

      You can read more about Shadow CLones in this Tech Note -> HERE

      An Introduction to Instant Clones -> HERE

        Jul
        11

        SRM and Commvault Health Check

        The NCC health check pds_share_vms_check verifies that the protection domains do not share any VMs. It would be good practice to run this healh check after configuring either SRM or using Intellisnap from Commvault. It’s one of over 200 hundred checks NCC provides.

        This check is available from the NCC 2.2.5 release and is part of the full health check that you can run by using the following command:

        nutanix@cvm$ ncc health_checks run_all

        You can also run this check separately by using the following command:

        nutanix@cvm$ ncc health_checks data_protection_checks protection_domain_checks pds_share_vms_check

        A protection domain is a group of VMs that you can replicate together on a desired schedule.

        A VM can be part of two protection domains if the following conditions are met:

        A protection domain (Async DR or Metro Availaibility) is created, and the VM is added as a protected entity of this protection domain. The vstore containing the VM is protected by using ncli or by an external third-party product such as Commvault or SRM. Protecting a vstore automatically creates a protection domain. These protection mechanisms are mutually exclusive, which means that the backups of the VM might fail if the VM is in 2 protection domains.

        Solution

        If the check returns a FAIL status, the reported VMs need to be removed from some of the listed protection domains, so that they remain only inside one protection domain.
        If your using metro availability you may have move the VM to another container or stop protecting the vstore.

        Jul
        09

        Making A Better Distributed System – Nutanix Degraded Node Detection

        55679934

        Distributed systems are hard, there no doubt about that. One of the major problems is what to do when a node is unhealthy and can be affecting performance of the overall cluster. Fail hard, fail fast is distributed system principle but how do you go about detecting an issue before even a failure occurs? AOS 4.5.3, 4.6.2 and 4.7 will includes the Nutanix implementation of degraded node detection and isolation. A bad performing hardware component or network issue can be a death of thousands cuts versus a failure which is pretty cut and dry. If a remote CVM is not performing well it can affect the acknowledgement of writes coming from other hosts and other factors may affect performance like:

        * Significant network bandwidth reduction
        * Network packet drops
        * CPU Soft lockups
        * Partially bad disks
        * Hardware issues

        The list of issues can even be unknown so Nutanix Engineering has come with a score systems that uses votes to make sure everything can be compared.
        Services running on each node of the cluster will publish scores/votes for services running on other nodes. Peer health scores will be computed based on various metrics like RPC latency, RPC failures/timeouts, Network latency etc. If services running on one node are consistently receiving bad scores for large period (~10 mins), then other peers will convict that node as degraded node.

        Walk, Crawl, Run – Degraded Node Expectations:

        A node will not be marked as degraded if current cluster Fault Tolerance (FT) level is less than desired value. Upgrades and break fix actions will not be allowed while a node is in the degraded state. A node will only be marked as degraded if we get bad peer health scores for 10 minutes. In AOS 4.5.3, the first shipping AOS release to include this feature, the default settings are that degraded node logging will be enabled but degraded node action will be disabled. In AOS 4.7 and AOS 4.6.2 additional user controls will be provided to select an “action policy” for when a degraded node is detected. Options should include No Action, Reboot CVM or Shutdown Node). While the peers scoring is always on, the action is side is disabled for the first release as ultra conservative approach.

        In AOS 4.5.3 if the degraded node action setting is enabled leadership of critical services will not be hosted on the degraded node. A degraded node will be put into maintenance mode and CVM will be rebooted. Services will not start on this CVM upon reboot. An Alert will be generated for degraded node.

        In AOS 4.7 and AOS 4.6.2 additional user controls will be provided to select an “action policy” for when a degraded node is detected. Options should include No Action, Reboot CVM or Shutdown Node

        To enable the degraded node action setting use the NCLI command:

        nutanix@cvm:~$ ncli cluster edit-params disable-degraded-node-monitoring=false

        The feature will further increase the availability and resilience for Nutanix customers. While top performance numbers grab the headlines, remember the first step is to have a running cluster.

        AI for the control plane………… Maybe we’ll get out voted for our jobs!

        Jul
        07

        Updated Best Practices – Nutanix DR and Backup & vSphere + Commvault

        Two best practices have been updated on this week. The Nutanix DR and Backup Best Practices is located in the Support Portal.

        <DR and Backup Best Practices>

        The update was around bandwidth sizing and added a link to Wolframalpha which spits out the sizing formula for you.

        The vSphere and Commvault Best Practice Guide added some guidance around IntelliSnap and sizing. At this time IntelliSnap and Metro is not supported but streaming is a fully supported option.

        <link>