Aug
01

The Tale Of Two Lines: Instant-Clones on Nutanix

There was a part of me that wanted to hate on Instant Clones that are new in Horizon 7 but the fact is they’re worth the price of admission. Instant-clones has very low overhead to provide true on-demand desktops or as VMware is tagging it, Just-In-Time desktops.

On-demand desktops with View Composer..... not happening

On-demand desktops with View Composer….. not happening

In my health care days the non-president desktops and shift change always resulted it some blunt force trauma around 7 am and 7 pm when staff would start their day. They only real way to counter balance the added load of login storms was to make sure the desktops were pre-built. This of course means you need so have some desktops sitting around doing nothing waiting for the these two time periods in the day, or use generic logins and then the user never disconnects which was another bag of problems.

Instant-clones ability to clone a live running VM by simply quiescing the VM is really amazing. Have you ever changed the name of the a desktop and then windows tells you to reboot? If your like me your try to do 5 or 6 other things before you have to reboot which usually ends up in a mess. Instant-clones uses a feature called clone prep to add the VM to AD and change it’s name, all while not having to reboot the VM. When you see a power on operation inside of vCenter it’s actually just quiescing the desktop so there is very low overhead.

The steps during Clone Prep. MS does not support Clone Prep but they didn't for View Composer so I don't see it being any different.

The steps during Clone Prep. MS does not support Clone Prep but they didn’t for View Composer so I don’t see it being any different.

When I went to test instant-clones I wanted to see if on-demand desktops was actually possible without destroying node densities. I had two test runs with Login VSI, 1 run with 400 knowledge users with all the desktops pre-deployed and 1 run with 400 knowledge users but I only started with 50 desktops. I had set the desktop pool to always have at least 30 free desktops until the pool got to 400 desktops.

Instant-clones delivers on-demand desktops with very low overhead.

Instant-clones delivers on-demand desktops with very low overhead.

The darker blue line represents the on-demand test and you can see that the impact over 400 hundred users is pretty small. This is pretty remarkable from a CPU and memory consumption on boot that is being almost eliminated.

It’s not all unicorns and rainbows however, instant clones does have some limitations in the first release:

No dedicated Desktop Pools
No RDS Desktop or Application Pools
Limited SVGA Support – Fixed max resolution & number of monitors
No 3D Rendering / GPU Support
No Sysprep support – Single SID across pool
No VVOL or VAAI NFS Hardware Clones support (Smaller desktops pools may take longer to provision)
No Powershell
No Multi-VLAN Support in a single Pool
No Reusable Computer Accounts
No Persistent Disks – Use Writable Volumes \ Flex App \ Unidesk \ RES …….

vMotion Is supported

Like anything use case will dictate when this gets used but its a powerful tool inside of Horizon. I plan to show some of the differences between View Composer and Instant Clones in my next posts. Also keep in mind that you still need high IO to service your desktops. Size for the peaks or face the wrath of your end users.

Jul
28

The Impact On App Layering On Your VDI Environment

I was testing instant clones in Horizon 7 and it was pretty much a requirement to use some form of application virtualization and get your user data stored off the desktops. My decision on what to select for for testing was based on that I had already had ProfileUnity from Liquidware Labs and App Volumes is bundled in View at the higher layers. I wanted to see the impact of layering on CPU and login times. I has also used UberAgent to collect some of the results. While testing I would run one test run with UberAgent to collect login times and then one with UberAgent agent turned off to collect CPU metrics.

I used three separate applications, each in their own layer.

* Gimp 2.8
* iTunes 10
* VLC

I used AppVolumes 2.11 since 3.0 is kind of dead in the water and not recommend for existing customers so I can’t see a lot of people using it till the next release. ProUnity was version 6.5

I first did a base run with no App Stacks or Flex Apps but with a roaming profile being stored on Acropolis File Services. The desktops were running horizon 7 agent and office 2013 and were instant clones. The desktops were Windows 10 with 2 vCPU and 2 GB of RAM. When you see the % listed is a factor of both CPUs.

Base Run
baserun

So not to bad 14 secs login, probably some clean up I could do to make it faster but also not that realistic if your thinking about enterprise desktop so I was happy with this.

I did test with 1 layer at a time until I used all of the 3 applications. There was a gradual increase in CPU and login time for each layer. The CPU cost comes from the agent and attaching the vmdk to the desktop.

App Volumes with 3 AppStacks

3appstacks

So with 3 layers the CPU jumped by ~20% and the login time went up ~9 secs with App Volumes.

3 Flex Apps

3appstacks

flexapp

With 3 Flex Apps CPU jumped a bit and login times went up ~4 sec.


Overall Review

layeringreview

What does this all mean?

Well if you have users that only disconnect and reconnect and rarely log out then this means absolutely nothing for the most part. If you have a user base that gets fresh new desktops all of the time and things like large shift changes then it means your densities will go down. I like to say “Looking is for free, and touching is going to cost you”. Overall I still feel this is a small price to pay to have a successful VDI deployment and layering will help out the process.

Jul
09

Making A Better Distributed System – Nutanix Degraded Node Detection

55679934

Distributed systems are hard, there no doubt about that. One of the major problems is what to do when a node is unhealthy and can be affecting performance of the overall cluster. Fail hard, fail fast is distributed system principle but how do you go about detecting an issue before even a failure occurs? AOS 4.5.3, 4.6.2 and 4.7 will includes the Nutanix implementation of degraded node detection and isolation. A bad performing hardware component or network issue can be a death of thousands cuts versus a failure which is pretty cut and dry. If a remote CVM is not performing well it can affect the acknowledgement of writes coming from other hosts and other factors may affect performance like:

* Significant network bandwidth reduction
* Network packet drops
* CPU Soft lockups
* Partially bad disks
* Hardware issues

The list of issues can even be unknown so Nutanix Engineering has come with a score systems that uses votes to make sure everything can be compared.
Services running on each node of the cluster will publish scores/votes for services running on other nodes. Peer health scores will be computed based on various metrics like RPC latency, RPC failures/timeouts, Network latency etc. If services running on one node are consistently receiving bad scores for large period (~10 mins), then other peers will convict that node as degraded node.

Walk, Crawl, Run – Degraded Node Expectations:

A node will not be marked as degraded if current cluster Fault Tolerance (FT) level is less than desired value. Upgrades and break fix actions will not be allowed while a node is in the degraded state. A node will only be marked as degraded if we get bad peer health scores for 10 minutes. In AOS 4.5.3, the first shipping AOS release to include this feature, the default settings are that degraded node logging will be enabled but degraded node action will be disabled. In AOS 4.7 and AOS 4.6.2 additional user controls will be provided to select an “action policy” for when a degraded node is detected. Options should include No Action, Reboot CVM or Shutdown Node). While the peers scoring is always on, the action is side is disabled for the first release as ultra conservative approach.

In AOS 4.5.3 if the degraded node action setting is enabled leadership of critical services will not be hosted on the degraded node. A degraded node will be put into maintenance mode and CVM will be rebooted. Services will not start on this CVM upon reboot. An Alert will be generated for degraded node.

In AOS 4.7 and AOS 4.6.2 additional user controls will be provided to select an “action policy” for when a degraded node is detected. Options should include No Action, Reboot CVM or Shutdown Node

To enable the degraded node action setting use the NCLI command:

nutanix@cvm:~$ ncli cluster edit-params disable-degraded-node-monitoring=false

The feature will further increase the availability and resilience for Nutanix customers. While top performance numbers grab the headlines, remember the first step is to have a running cluster.

AI for the control plane………… Maybe we’ll get out voted for our jobs!

Jun
29

Nutanix Security Configuration Management Automation at Work #DOD #PCI

A short video of someone changing the security settings for a Apache Tomcat directory and files. It really could be anything, dropping a firewall, opening a port and the list goes on. The video shows how often the settings are being checked and then we manually run the automation framework to check over 600 DOD/PCI level requirements in minutes.

Jun
27

Nutanix Search to Find, Build, Create and Improve

To streamline access to features, Nutanix lets you quickly search for data points and reduces the clicks required to find information through the search function. Prism Pro delivers a web-like search engine experience for your Nutanix environment. Administrators can simply enter common tasks and entities into the search bar to perform searches. The interface displays the returned results in four vertical columns, each representing a different type of result relating to the search query.
The four columns present a list of entities, top analytics about the entities, appropriate actions, related alerts, and help topics that relate to the entities. The help topics provide links to online Nutanix documentation that can help explain features and clarify how to configure them or perform corrective actions.

search

The search function offers autocomplete to help administrators identify or complete the string that they want to search for.

auto

Nutanix embodies a radically new approach to enterprise infrastructure—one that simplifies every step of the infrastructure life cycle, from buying and deploying to managing, scaling, and supporting.

Read more about managing your infrastructure with Prism Pro from Brian Suhr

May
12

Impact of Nutanix VSS Hardware Support

When 4.6 was released I wrote about how the newly added VSS support with Nutanix Guest Tools (NGT) was the gem of the release. It was fairly big compliment considering some of the important updates that were in the release like cross hypervisor DR and another giant leap in performance.

I finally set some time aside to test the impact of taking a application consistent snapshot with VMware Tools vs the Nutanix VSS Hardware Support.

vmware-vss-qWhen an application consistent snapshot workflow without NGT on ESXi, we take an ESXi snapshot so VMware tools can be used to quiesce the file system. Every time we take an ESXi snapshot, it results in creation of delta disks, During this process ESXi “stuns” the VM to remap virtual disks to these delta files. The amount of stun depends on the number of virtual disks that are attached to the VM and speed in which the delta disks can be created (capability of the underlying storage to process NFS meta-data update operations + releasing/creating/acquiring lock files for all the virtual disks). In this time, the VM is totally unresponsive. No application will run inside the VM, and pings to the VM will fail.

We then delete the snapshot (after backing up the files via hardware snap on the Nutanix side) which results in another set of stuns (deleting a snapshot causes two stuns, one fixed time stun + another stun based on the number of virtual disks). This essentially means that we are causing two or three stuns in rapid succession. These stuns cause meta-data updates in addition to the flushing of data during the VSS snapshot operations.

Customers have reported in set of VMs running Microsoft clustering, these VMs can be voted out due to heartbeat failure. VMware gives customer guidance on increasing timers if your using Microsoft clustering to get around this situation.

To test this out I used HammerDB with a SQL 2014 running on Windows 2012R2. The tests were run on ESXi 6.0 with hardware version 11.

sqlvm

VMware Tools with VSS based Snapshot
I was going to try to stitch the images together because of the time it took but decided to leave as is.
VMware-VSS-1vmwaretools

VMware-VSS-2vmwaretools

The total process took ~4 minutes.

NGT with VSS Hardware Support based Snapshot
NGT based VSS snapshots don’t cause VM stuns. The application will be stunned temporarily within Windows to flush the data, but pings and other things should work.

NGT-VSS-Snapshot

The total process took ~1 minute.

Conclusion

NGT with VSS hardware support is the Belle of the Ball! While there is no fixed number to explain the max stun times. It depends on how heavy the workload is but what we can see is the effect of not using NGT for application consistent snapshot and it’s pretty big. The collapsing of ESXi snapshots cause additional load and should be avoided if possible. NGT offers hypervisor agnostic approach and currently works with AHV as well.

Note: Hypervisor snapshot consolidation is better in ESXi 6 than ESXi 5.5.

Thanks to Karthik Chandrasekaran and Manan Shah for all their hard work and contribution to this blog post.

Apr
27

SAP Best Practices and Sizing on Nutanix

SAP-NETWEAVERAt the heart of SAP Business Suite is the SAP ERP application, which is supplemented by SAP
CRM, SAP SRM, SAP PLM, and SAP SCM. From financial accounting through manufacturing, logistics, sales, marketing, and human resources, SAP Business Suite manages all the key mission-critical business processes that occur each day in companies around the world. SAP NetWeaver is the technical foundation for many SAP applications; it is a solution stack of SAP’s technology products.

Deploying and operating SAP Business Suite applications in your environment is not a trivial task. Nutanix enterprise cloud platforms provide the reliability, predictability, and performance that the SAP Business Suite demands, all with an efficient and elegant management interface.

The Nutanix platform offers SAP customers a range of benefits, including:

• Lower risk and cost on the first hyperconverged platform SAP-certified for NetWeaver applications.
• A turnkey validated framework that dramatically reduces the time to deploy your SAP
applications.
• Mission-critical availability with a self-healing foundation and VM-centric data protection, including support for the top enterprise backup solutions.
• Flexibility to choose among industry-leading SAP-supported hypervisors.
• Simplified operations, including application- and VM-level metrics alongside single-click
provisioning and upgrades.
• Reduced TCO from infrastructure right-sized for your SAP workload.
• A best-in-class worldwide support system whose knowledge and commitment to customer service has earned the Omega NorthFace Scoreboard Award for three consecutive years.

Read the Solution Note for best practices with both Hyper-V and VMware and sizing guidelines => SAP Solution Note

Apr
24

Save Your Time With Nutanix Automatic Support

Best Industry Support

The feature known as Pulse is enabled by default and sends cluster status information automatically to Nutanix customer support. After you have completed initial setup, created a cluster, and opened ports 80 or 8443 in your firewall, AOS sends a Pulse message from each cluster once every 24 hours. Each message includes cluster configuration and health status that can be used by Nutanix Support to address any cluster operation issues.

AOS can also send automatic alert email notifications to Nutanix Support by default through ports 80 or 8443. Like Pulse, any configured firewall must have these ports open. Some examples of conditions that will automatically generate a proactive case with Nutanix support with a Priority Level P4.

The Stargate process is down for more than 3 hours
Curator scan fails
Hardware Clock Failure
Faulty RAM module
Power Supply failure
Unable to fetch IPMI SDR repository (IPMI Error)
HyperV networking
System operations
Disk Capacity > 90%
Bad Drive

You can optionally use your own SMTP server to send Pulse and alert notifications. If you do not or cannot configure an SMTP server, another option is to implement an HTTP proxy as part of your overall support scheme.

While the best thing is never to a get a call, 2nd best is not waiting in line to open a ticket. Have a great week!

Mar
16

The Benefits of Enterprise Cloud & Hyperconverged Infrastructure by @stu

Wikibon Senior Analyst Stu Miniman (@stu) talking about how hyper-convergence is delivering simplicity in transferring to cloud and how that requires operational change.

Feb
17

AHV – Most Secure Hypervisor by Default

Cybersecurity threats grow and change every day, demanding perpetual vigilance and adaptation to the shifting security landscape. However, upgrading security in a traditional three-tier architecture is so time consuming and expensive, often involving multiple separate vendors, that some enterprises put off innovation. In light of competing security concerns—the need to reclaim resources for innovation versus the need to keep costs down—corporate and government environments demand a simpler approach: one vendor, with technology secured by design, and automated security compliance and reporting.
Nutanix has created a security development life cycle (SecDL) that addresses security at every layer in the deployment cycle, rather than applying it at the end as an afterthought. The SecDL implements security culture from top to bottom, ensuring that it is a foundational part of the design. SecDL reduces the time it takes to update code, which mitigates the risk of zero-day exploits.

Security is usually the last thing to get love when your under pressure. You will lack security if you get your system to work. With SCMA you don't have to decide between security and a working system anymore.

Security is usually the last thing to get love when your under pressure. You will ease security if you get your system to work. With SCMA you don’t have to decide between security and a working system anymore.

Because traditional manual configuration and checks cannot keep up with the ever-growing list of security requirements, Nutanix provides Security Technical Implementation Guides (STIGs) that use machine-readable code to automate compliance against rigorous common standards. Today, Nutanix tracks over 1,700 security entities across storage and the Acropolis Hypervisor (AHV). With Nutanix Security Configuration Management Automation (SCMA) introduced in the Acropolis Operating system 4.6, you can quickly and continually assess and remediate your platform to ensure that it meets or exceeds all regulatory requirements.

As regulations become more cumbersome and threats continue to proliferate, a fully tested platform with security at the forefront is the best choice for meeting tomorrow’s challenges today. The Xtreme Computing Platform (XCP) shrinks the compliance auditing window from months to minutes, allowing you to focus instead on the applications that drive the business.

SCMA also covers frustrating maintenance scenarios in which you upgrade your storage or hypervisor software only to find that the new software has overwritten your careful configuration work, forcing you to go through all the settings again from scratch. Returning to the baseline manually is slow and error-prone, often causing significant problems, particularly when dealing with major release upgrades. Companies have had to delay upgrading their systems to preserve security compliance, even when an upgrade would offer new features required to support the business. Nutanix SCMA means that businesses don’t have to shoulder the burden of interoperability testing or go through cumbersome steps to manually inspect and revert the upgraded system to a known good state.

With SCMA, you can schedule Nutanix STIGs to run hourly, daily, weekly, or monthly. The automation checks have the lowest system priority within the virtual storage controller, ensuring that security checks do not interfere with platform performance.
Nutanix has embedded five STIGs covering Nutanix storage and AHV in the product. These STIGs are:

o Acropolis Virtual Storage Controller STIG
o Nutanix Prism Web Server STIG (for tomcat)
o Nutanix Prism Proxy Server STIG (for Apache)
o Nutanix JRE8 STIG
o Acropolis Hypervisor STIG

With both the storage and they hypervisor meeting the highest levels of security out of the box I think it’s safe to say from day 1 to the life of the cluster that you have the most secure platform for your workloads. You can’t simply do one without the other and be secure. It’s this end to end life cycle (SecDL) that makes Nutanix so different from other vendors on the market today.

The hamster wheel of keeping your environment secure just had it's last spin with AOS 4.6.

The hamster wheel of keeping your environment secure just had it’s last spin with AOS 4.6.