Archives for July 2016

Jul
28

The Impact On App Layering On Your VDI Environment

I was testing instant clones in Horizon 7 and it was pretty much a requirement to use some form of application virtualization and get your user data stored off the desktops. My decision on what to select for for testing was based on that I had already had ProfileUnity from Liquidware Labs and App Volumes is bundled in View at the higher layers. I wanted to see the impact of layering on CPU and login times. I has also used UberAgent to collect some of the results. While testing I would run one test run with UberAgent to collect login times and then one with UberAgent agent turned off to collect CPU metrics.

I used three separate applications, each in their own layer.

* Gimp 2.8
* iTunes 10
* VLC

I used AppVolumes 2.11 since 3.0 is kind of dead in the water and not recommend for existing customers so I can’t see a lot of people using it till the next release. ProUnity was version 6.5

I first did a base run with no App Stacks or Flex Apps but with a roaming profile being stored on Acropolis File Services. The desktops were running horizon 7 agent and office 2013 and were instant clones. The desktops were Windows 10 with 2 vCPU and 2 GB of RAM. When you see the % listed is a factor of both CPUs.

Base Run
baserun

So not to bad 14 secs login, probably some clean up I could do to make it faster but also not that realistic if your thinking about enterprise desktop so I was happy with this.

I did test with 1 layer at a time until I used all of the 3 applications. There was a gradual increase in CPU and login time for each layer. The CPU cost comes from the agent and attaching the vmdk to the desktop.

App Volumes with 3 AppStacks

3appstacks

So with 3 layers the CPU jumped by ~20% and the login time went up ~9 secs with App Volumes.

3 Flex Apps

3appstacks

flexapp

With 3 Flex Apps CPU jumped a bit and login times went up ~4 sec.


Overall Review

layeringreview

What does this all mean?

Well if you have users that only disconnect and reconnect and rarely log out then this means absolutely nothing for the most part. If you have a user base that gets fresh new desktops all of the time and things like large shift changes then it means your densities will go down. I like to say “Looking is for free, and touching is going to cost you”. Overall I still feel this is a small price to pay to have a successful VDI deployment and layering will help out the process.

Jul
19

Securing the Supply Chain with Nutanix and Docker #dockercon2016

I was watching the below video from DockerCon 2016 and there was lots of striking similarities between what Nutanix and Docker is doing secure working environment for the Enterprise Cloud. There is no sense turning the alarm on for your house and then not locking the doors. You need to close all the gaps for your infrastructure and the applications that live on top of it.

The most interesting part of the session for me was the section on security scanning and gating. Docker has Security Scanning which is available as an add-on to Docker hosted private repositories on both Docker Cloud and Docker Hub. Scans run each time a build pushes a new image to your private repository. They also run when you add a new image or tag. Most scans complete within an hour, however large repositories may take up to 24 hours to scan. The scan traverses each layer of the image, identifies the software components in each layer, and indexes the SHA of each component.
docker-scanniing
The scan compares the SHA of each component against the Common Vulnerabilities and Exposures (CVE) database. The CVE is a “dictionary” of known information security vulnerabilities. When the CVE database is updated, the service reviews the indexed components for any that match the new vulnerability. If the new vulnerability is detected in an image, the service sends an email alert to the maintainers of the image.

A single component can contain multiple vulnerabilities or exposures and Docker Security Scanning reports on each one. You can click an individual vulnerability report from the scan results and navigate to the specific CVE report data to learn more about it.

On the Nutanix side of the fence all code is scanned with 2 different vulnerability scanners at every step of the development life-cycle. To top that off Nutanix already apply s an intrinsic baseline, and we already monitor and self-heal that baseline with SCMA the Security Configuration Management Automation and leverage the SaltStack framework so that your production systems can Self-Heal from any deviation and are always in compliance. Features like two factor authentication (2FA) and cluster lockdown further enhance the security posture. The cluster-wide setting can forward all logs to a central host as well. All CVEs related to the product are tracked and provide an internal turn around time of 72 hours for critical patches! There is some added time on getting a release cut but it fast and everything is tested as whole instead of a one off change that could have a domino a effect.

When evaluating infrastructure and development environments for a security-conscious environment, it’s imperative to choose one that is built with a security-first approach that continually iterate on patching new threats thereby reducing the attack surface. Docker is doing some great work on this front.


    Jul
    14

    Nutanix Acropolis File Services – Required 2 Networks

    When configuring Acropolis File Services you may be prompted with the following message:

    “File server creation requires two unique networks to be configured beforehand.”

    The reason is you two managed networks for AFS. I’ve seen this come up a lot lately so I thought I would explain the why. While it may change over time this is the current design.

    fs-tor

    The above diagram shows one file server VM running on a node, but you can put multiple file server VMs on a node for multitenancy.

    The file server VM has two network interfaces. The first interface is a static address used for the local file server VM service that talks to the Minerva CVM service running on the Controller VM. The Minerva CVM service uses this information to manage deployment and failover; it also allows control over one-click upgrades and maintenance. Having local awareness from the CVM enables the file server VM to determine if a storage fault has occurred and, if so, if action should be taken to rectify it. The local address also lets the file server VM claim vDisks for failover and failback. The file server VM service sends a heartbeat to its local Minerva CVM service each second, indicating its state and that it’s alive.
    The second network interface on the file server VM, also referred to as the public interface, allows clients to service SMB requests. Based on the resource called, the file server VM determines whether to service the request locally or to use DFS to refer the request to the appropriate file server VM that owns the resource. This second network can be dynamically reassigned to other file server VM’s for high availability.

    If you need help setting up the two managmed networks there is KB article on portal.nutanix.com -> KB3406

    Jul
    13

    Backing up AFS with Commvault

    This is by no means a best practice guide for AFS and Commvault but I wanted to make sure that Commvault could be used to backup Acropolis File Services (AFS). If want more details on AFS I suggest reading this great post on the Nutanix website.

    Once I applied the file server license to CommServe I was off to the races. I had 400 users spread out on 3 file server VMs making up the file server called eucfs.tenanta.com. The file server had two shares but I was focused on backing up the user share.

    commvault-afs-users

    400users

    I found performance could be increased by adding more readers for the backup job. My media agent was last configured with 8 vCPU and it seemed to be the bottleneck. If I were to give the media agent more CPU I am sure I would have had a even faster backup time.

    commvault-readers-nutaix-afspng

    I was able to get almost 600 GB/Hour which I am told is a good number for file backup. There looks like there is lots of room to improve though. The end goal will be to try and backup a million files and see what happens over the course of time.

    600

    Like all good backup stories, it’s all about the restores and it appears to drill down real nice.

    commvault-users-restore-afs4

    Jul
    12

    Just In Time Desktops (Instant Clones) on Nutanix

    JIT desktops are supported on Nutanix. One current limitation of JIT is that it doesn’t support VAAI for NFS Hardware Clones. The great part for Nutanix customers is that were VAAI clones stop, shadow clones kicks into affect! So if you want to keep a lower amount of RAM for configured for the View Storage Accelerator your perfectly OK in doing that.

    The Nutanix Distributed Filesystem has a feature called ‘Shadow Clones’ which allows for distributed caching of particular vDisks or VM data which is in a ‘multi-reader’ scenario. A great example of this is during a VDI deployment many ‘linked clones’ will be forwarding read requests to a central master or ‘Base VM’. In the case of VMware View this is called the replica disk and is read by all linked clones. This will also work in any scenario which may be a multi-reader scenario (eg. deployment servers, repositories, App Volumes, etc.)

    You can read more about Shadow CLones in this Tech Note -> HERE

    An Introduction to Instant Clones -> HERE

      Jul
      11

      SRM and Commvault Health Check

      The NCC health check pds_share_vms_check verifies that the protection domains do not share any VMs. It would be good practice to run this healh check after configuring either SRM or using Intellisnap from Commvault. It’s one of over 200 hundred checks NCC provides.

      This check is available from the NCC 2.2.5 release and is part of the full health check that you can run by using the following command:

      nutanix@cvm$ ncc health_checks run_all

      You can also run this check separately by using the following command:

      nutanix@cvm$ ncc health_checks data_protection_checks protection_domain_checks pds_share_vms_check

      A protection domain is a group of VMs that you can replicate together on a desired schedule.

      A VM can be part of two protection domains if the following conditions are met:

      A protection domain (Async DR or Metro Availaibility) is created, and the VM is added as a protected entity of this protection domain. The vstore containing the VM is protected by using ncli or by an external third-party product such as Commvault or SRM. Protecting a vstore automatically creates a protection domain. These protection mechanisms are mutually exclusive, which means that the backups of the VM might fail if the VM is in 2 protection domains.

      Solution

      If the check returns a FAIL status, the reported VMs need to be removed from some of the listed protection domains, so that they remain only inside one protection domain.
      If your using metro availability you may have move the VM to another container or stop protecting the vstore.

      Jul
      09

      Making A Better Distributed System – Nutanix Degraded Node Detection

      55679934

      Distributed systems are hard, there no doubt about that. One of the major problems is what to do when a node is unhealthy and can be affecting performance of the overall cluster. Fail hard, fail fast is distributed system principle but how do you go about detecting an issue before even a failure occurs? AOS 4.5.3, 4.6.2 and 4.7 will includes the Nutanix implementation of degraded node detection and isolation. A bad performing hardware component or network issue can be a death of thousands cuts versus a failure which is pretty cut and dry. If a remote CVM is not performing well it can affect the acknowledgement of writes coming from other hosts and other factors may affect performance like:

      * Significant network bandwidth reduction
      * Network packet drops
      * CPU Soft lockups
      * Partially bad disks
      * Hardware issues

      The list of issues can even be unknown so Nutanix Engineering has come with a score systems that uses votes to make sure everything can be compared.
      Services running on each node of the cluster will publish scores/votes for services running on other nodes. Peer health scores will be computed based on various metrics like RPC latency, RPC failures/timeouts, Network latency etc. If services running on one node are consistently receiving bad scores for large period (~10 mins), then other peers will convict that node as degraded node.

      Walk, Crawl, Run – Degraded Node Expectations:

      A node will not be marked as degraded if current cluster Fault Tolerance (FT) level is less than desired value. Upgrades and break fix actions will not be allowed while a node is in the degraded state. A node will only be marked as degraded if we get bad peer health scores for 10 minutes. In AOS 4.5.3, the first shipping AOS release to include this feature, the default settings are that degraded node logging will be enabled but degraded node action will be disabled. In AOS 4.7 and AOS 4.6.2 additional user controls will be provided to select an “action policy” for when a degraded node is detected. Options should include No Action, Reboot CVM or Shutdown Node). While the peers scoring is always on, the action is side is disabled for the first release as ultra conservative approach.

      In AOS 4.5.3 if the degraded node action setting is enabled leadership of critical services will not be hosted on the degraded node. A degraded node will be put into maintenance mode and CVM will be rebooted. Services will not start on this CVM upon reboot. An Alert will be generated for degraded node.

      In AOS 4.7 and AOS 4.6.2 additional user controls will be provided to select an “action policy” for when a degraded node is detected. Options should include No Action, Reboot CVM or Shutdown Node

      To enable the degraded node action setting use the NCLI command:

      nutanix@cvm:~$ ncli cluster edit-params disable-degraded-node-monitoring=false

      The feature will further increase the availability and resilience for Nutanix customers. While top performance numbers grab the headlines, remember the first step is to have a running cluster.

      AI for the control plane………… Maybe we’ll get out voted for our jobs!

      Jul
      07

      Updated Best Practices – Nutanix DR and Backup & vSphere + Commvault

      Two best practices have been updated on this week. The Nutanix DR and Backup Best Practices is located in the Support Portal.

      <DR and Backup Best Practices>

      The update was around bandwidth sizing and added a link to Wolframalpha which spits out the sizing formula for you.

      The vSphere and Commvault Best Practice Guide added some guidance around IntelliSnap and sizing. At this time IntelliSnap and Metro is not supported but streaming is a fully supported option.

      <link>

      Jul
      06

      Chad Sakac talks about EMC selling Nutanix with Dell Technologies

      What will happen with Dell XC when EMC and Dell come together? Chad Sakac talks about it at the 18:40 mark from the ThinkAhead IT conference.

      From NextConf 2016
      Nutanix and Dell OEM relationship: Dell’s Alan Atkinson spoke to attendees about extending the OEM relationship and continuing to help our joint customers (including Williams) on their journeys to Enterprise Cloud in confidence.