Jul
28

The Impact On App Layering On Your VDI Environment

I was testing instant clones in Horizon 7 and it was pretty much a requirement to use some form of application virtualization and get your user data stored off the desktops. My decision on what to select for for testing was based on that I had already had ProfileUnity from Liquidware Labs and App Volumes is bundled in View at the higher layers. I wanted to see the impact of layering on CPU and login times. I has also used UberAgent to collect some of the results. While testing I would run one test run with UberAgent to collect login times and then one with UberAgent agent turned off to collect CPU metrics.

I used three separate applications, each in their own layer.

* Gimp 2.8
* iTunes 10
* VLC

I used AppVolumes 2.11 since 3.0 is kind of dead in the water and not recommend for existing customers so I can’t see a lot of people using it till the next release. ProUnity was version 6.5

I first did a base run with no App Stacks or Flex Apps but with a roaming profile being stored on Acropolis File Services. The desktops were running horizon 7 agent and office 2013 and were instant clones. The desktops were Windows 10 with 2 vCPU and 2 GB of RAM. When you see the % listed is a factor of both CPUs.

Base Run
baserun

So not to bad 14 secs login, probably some clean up I could do to make it faster but also not that realistic if your thinking about enterprise desktop so I was happy with this.

I did test with 1 layer at a time until I used all of the 3 applications. There was a gradual increase in CPU and login time for each layer. The CPU cost comes from the agent and attaching the vmdk to the desktop.

App Volumes with 3 AppStacks

3appstacks

So with 3 layers the CPU jumped by ~20% and the login time went up ~9 secs with App Volumes.

3 Flex Apps

3appstacks

flexapp

With 3 Flex Apps CPU jumped a bit and login times went up ~4 sec.


Overall Review

layeringreview

What does this all mean?

Well if you have users that only disconnect and reconnect and rarely log out then this means absolutely nothing for the most part. If you have a user base that gets fresh new desktops all of the time and things like large shift changes then it means your densities will go down. I like to say “Looking is for free, and touching is going to cost you”. Overall I still feel this is a small price to pay to have a successful VDI deployment and layering will help out the process.

Jul
09

Making A Better Distributed System – Nutanix Degraded Node Detection

55679934

Distributed systems are hard, there no doubt about that. One of the major problems is what to do when a node is unhealthy and can be affecting performance of the overall cluster. Fail hard, fail fast is distributed system principle but how do you go about detecting an issue before even a failure occurs? AOS 4.5.3, 4.6.2 and 4.7 will includes the Nutanix implementation of degraded node detection and isolation. A bad performing hardware component or network issue can be a death of thousands cuts versus a failure which is pretty cut and dry. If a remote CVM is not performing well it can affect the acknowledgement of writes coming from other hosts and other factors may affect performance like:

* Significant network bandwidth reduction
* Network packet drops
* CPU Soft lockups
* Partially bad disks
* Hardware issues

The list of issues can even be unknown so Nutanix Engineering has come with a score systems that uses votes to make sure everything can be compared.
Services running on each node of the cluster will publish scores/votes for services running on other nodes. Peer health scores will be computed based on various metrics like RPC latency, RPC failures/timeouts, Network latency etc. If services running on one node are consistently receiving bad scores for large period (~10 mins), then other peers will convict that node as degraded node.

Walk, Crawl, Run – Degraded Node Expectations:

A node will not be marked as degraded if current cluster Fault Tolerance (FT) level is less than desired value. Upgrades and break fix actions will not be allowed while a node is in the degraded state. A node will only be marked as degraded if we get bad peer health scores for 10 minutes. In AOS 4.5.3, the first shipping AOS release to include this feature, the default settings are that degraded node logging will be enabled but degraded node action will be disabled. In AOS 4.7 and AOS 4.6.2 additional user controls will be provided to select an “action policy” for when a degraded node is detected. Options should include No Action, Reboot CVM or Shutdown Node). While the peers scoring is always on, the action is side is disabled for the first release as ultra conservative approach.

In AOS 4.5.3 if the degraded node action setting is enabled leadership of critical services will not be hosted on the degraded node. A degraded node will be put into maintenance mode and CVM will be rebooted. Services will not start on this CVM upon reboot. An Alert will be generated for degraded node.

In AOS 4.7 and AOS 4.6.2 additional user controls will be provided to select an “action policy” for when a degraded node is detected. Options should include No Action, Reboot CVM or Shutdown Node

To enable the degraded node action setting use the NCLI command:

nutanix@cvm:~$ ncli cluster edit-params disable-degraded-node-monitoring=false

The feature will further increase the availability and resilience for Nutanix customers. While top performance numbers grab the headlines, remember the first step is to have a running cluster.

AI for the control plane………… Maybe we’ll get out voted for our jobs!

Jun
29

Nutanix Security Configuration Management Automation at Work #DOD #PCI

A short video of someone changing the security settings for a Apache Tomcat directory and files. It really could be anything, dropping a firewall, opening a port and the list goes on. The video shows how often the settings are being checked and then we manually run the automation framework to check over 600 DOD/PCI level requirements in minutes.

Jun
27

Nutanix Search to Find, Build, Create and Improve

To streamline access to features, Nutanix lets you quickly search for data points and reduces the clicks required to find information through the search function. Prism Pro delivers a web-like search engine experience for your Nutanix environment. Administrators can simply enter common tasks and entities into the search bar to perform searches. The interface displays the returned results in four vertical columns, each representing a different type of result relating to the search query.
The four columns present a list of entities, top analytics about the entities, appropriate actions, related alerts, and help topics that relate to the entities. The help topics provide links to online Nutanix documentation that can help explain features and clarify how to configure them or perform corrective actions.

search

The search function offers autocomplete to help administrators identify or complete the string that they want to search for.

auto

Nutanix embodies a radically new approach to enterprise infrastructure—one that simplifies every step of the infrastructure life cycle, from buying and deploying to managing, scaling, and supporting.

Read more about managing your infrastructure with Prism Pro from Brian Suhr

May
12

Impact of Nutanix VSS Hardware Support

When 4.6 was released I wrote about how the newly added VSS support with Nutanix Guest Tools (NGT) was the gem of the release. It was fairly big compliment considering some of the important updates that were in the release like cross hypervisor DR and another giant leap in performance.

I finally set some time aside to test the impact of taking a application consistent snapshot with VMware Tools vs the Nutanix VSS Hardware Support.

vmware-vss-qWhen an application consistent snapshot workflow without NGT on ESXi, we take an ESXi snapshot so VMware tools can be used to quiesce the file system. Every time we take an ESXi snapshot, it results in creation of delta disks, During this process ESXi “stuns” the VM to remap virtual disks to these delta files. The amount of stun depends on the number of virtual disks that are attached to the VM and speed in which the delta disks can be created (capability of the underlying storage to process NFS meta-data update operations + releasing/creating/acquiring lock files for all the virtual disks). In this time, the VM is totally unresponsive. No application will run inside the VM, and pings to the VM will fail.

We then delete the snapshot (after backing up the files via hardware snap on the Nutanix side) which results in another set of stuns (deleting a snapshot causes two stuns, one fixed time stun + another stun based on the number of virtual disks). This essentially means that we are causing two or three stuns in rapid succession. These stuns cause meta-data updates in addition to the flushing of data during the VSS snapshot operations.

Customers have reported in set of VMs running Microsoft clustering, these VMs can be voted out due to heartbeat failure. VMware gives customer guidance on increasing timers if your using Microsoft clustering to get around this situation.

To test this out I used HammerDB with a SQL 2014 running on Windows 2012R2. The tests were run on ESXi 6.0 with hardware version 11.

sqlvm

VMware Tools with VSS based Snapshot
I was going to try to stitch the images together because of the time it took but decided to leave as is.
VMware-VSS-1vmwaretools

VMware-VSS-2vmwaretools

The total process took ~4 minutes.

NGT with VSS Hardware Support based Snapshot
NGT based VSS snapshots don’t cause VM stuns. The application will be stunned temporarily within Windows to flush the data, but pings and other things should work.

NGT-VSS-Snapshot

The total process took ~1 minute.

Conclusion

NGT with VSS hardware support is the Belle of the Ball! While there is no fixed number to explain the max stun times. It depends on how heavy the workload is but what we can see is the effect of not using NGT for application consistent snapshot and it’s pretty big. The collapsing of ESXi snapshots cause additional load and should be avoided if possible. NGT offers hypervisor agnostic approach and currently works with AHV as well.

Note: Hypervisor snapshot consolidation is better in ESXi 6 than ESXi 5.5.

Thanks to Karthik Chandrasekaran and Manan Shah for all their hard work and contribution to this blog post.

Apr
27

SAP Best Practices and Sizing on Nutanix

SAP-NETWEAVERAt the heart of SAP Business Suite is the SAP ERP application, which is supplemented by SAP
CRM, SAP SRM, SAP PLM, and SAP SCM. From financial accounting through manufacturing, logistics, sales, marketing, and human resources, SAP Business Suite manages all the key mission-critical business processes that occur each day in companies around the world. SAP NetWeaver is the technical foundation for many SAP applications; it is a solution stack of SAP’s technology products.

Deploying and operating SAP Business Suite applications in your environment is not a trivial task. Nutanix enterprise cloud platforms provide the reliability, predictability, and performance that the SAP Business Suite demands, all with an efficient and elegant management interface.

The Nutanix platform offers SAP customers a range of benefits, including:

• Lower risk and cost on the first hyperconverged platform SAP-certified for NetWeaver applications.
• A turnkey validated framework that dramatically reduces the time to deploy your SAP
applications.
• Mission-critical availability with a self-healing foundation and VM-centric data protection, including support for the top enterprise backup solutions.
• Flexibility to choose among industry-leading SAP-supported hypervisors.
• Simplified operations, including application- and VM-level metrics alongside single-click
provisioning and upgrades.
• Reduced TCO from infrastructure right-sized for your SAP workload.
• A best-in-class worldwide support system whose knowledge and commitment to customer service has earned the Omega NorthFace Scoreboard Award for three consecutive years.

Read the Solution Note for best practices with both Hyper-V and VMware and sizing guidelines => SAP Solution Note

Apr
24

Save Your Time With Nutanix Automatic Support

Best Industry Support

The feature known as Pulse is enabled by default and sends cluster status information automatically to Nutanix customer support. After you have completed initial setup, created a cluster, and opened ports 80 or 8443 in your firewall, AOS sends a Pulse message from each cluster once every 24 hours. Each message includes cluster configuration and health status that can be used by Nutanix Support to address any cluster operation issues.

AOS can also send automatic alert email notifications to Nutanix Support by default through ports 80 or 8443. Like Pulse, any configured firewall must have these ports open. Some examples of conditions that will automatically generate a proactive case with Nutanix support with a Priority Level P4.

The Stargate process is down for more than 3 hours
Curator scan fails
Hardware Clock Failure
Faulty RAM module
Power Supply failure
Unable to fetch IPMI SDR repository (IPMI Error)
HyperV networking
System operations
Disk Capacity > 90%
Bad Drive

You can optionally use your own SMTP server to send Pulse and alert notifications. If you do not or cannot configure an SMTP server, another option is to implement an HTTP proxy as part of your overall support scheme.

While the best thing is never to a get a call, 2nd best is not waiting in line to open a ticket. Have a great week!

Mar
16

The Benefits of Enterprise Cloud & Hyperconverged Infrastructure by @stu

Wikibon Senior Analyst Stu Miniman (@stu) talking about how hyper-convergence is delivering simplicity in transferring to cloud and how that requires operational change.

Feb
17

AHV – Most Secure Hypervisor by Default

Cybersecurity threats grow and change every day, demanding perpetual vigilance and adaptation to the shifting security landscape. However, upgrading security in a traditional three-tier architecture is so time consuming and expensive, often involving multiple separate vendors, that some enterprises put off innovation. In light of competing security concerns—the need to reclaim resources for innovation versus the need to keep costs down—corporate and government environments demand a simpler approach: one vendor, with technology secured by design, and automated security compliance and reporting.
Nutanix has created a security development life cycle (SecDL) that addresses security at every layer in the deployment cycle, rather than applying it at the end as an afterthought. The SecDL implements security culture from top to bottom, ensuring that it is a foundational part of the design. SecDL reduces the time it takes to update code, which mitigates the risk of zero-day exploits.

Security is usually the last thing to get love when your under pressure. You will lack security if you get your system to work. With SCMA you don't have to decide between security and a working system anymore.

Security is usually the last thing to get love when your under pressure. You will ease security if you get your system to work. With SCMA you don’t have to decide between security and a working system anymore.

Because traditional manual configuration and checks cannot keep up with the ever-growing list of security requirements, Nutanix provides Security Technical Implementation Guides (STIGs) that use machine-readable code to automate compliance against rigorous common standards. Today, Nutanix tracks over 1,700 security entities across storage and the Acropolis Hypervisor (AHV). With Nutanix Security Configuration Management Automation (SCMA) introduced in the Acropolis Operating system 4.6, you can quickly and continually assess and remediate your platform to ensure that it meets or exceeds all regulatory requirements.

As regulations become more cumbersome and threats continue to proliferate, a fully tested platform with security at the forefront is the best choice for meeting tomorrow’s challenges today. The Xtreme Computing Platform (XCP) shrinks the compliance auditing window from months to minutes, allowing you to focus instead on the applications that drive the business.

SCMA also covers frustrating maintenance scenarios in which you upgrade your storage or hypervisor software only to find that the new software has overwritten your careful configuration work, forcing you to go through all the settings again from scratch. Returning to the baseline manually is slow and error-prone, often causing significant problems, particularly when dealing with major release upgrades. Companies have had to delay upgrading their systems to preserve security compliance, even when an upgrade would offer new features required to support the business. Nutanix SCMA means that businesses don’t have to shoulder the burden of interoperability testing or go through cumbersome steps to manually inspect and revert the upgraded system to a known good state.

With SCMA, you can schedule Nutanix STIGs to run hourly, daily, weekly, or monthly. The automation checks have the lowest system priority within the virtual storage controller, ensuring that security checks do not interfere with platform performance.
Nutanix has embedded five STIGs covering Nutanix storage and AHV in the product. These STIGs are:

o Acropolis Virtual Storage Controller STIG
o Nutanix Prism Web Server STIG (for tomcat)
o Nutanix Prism Proxy Server STIG (for Apache)
o Nutanix JRE8 STIG
o Acropolis Hypervisor STIG

With both the storage and they hypervisor meeting the highest levels of security out of the box I think it’s safe to say from day 1 to the life of the cluster that you have the most secure platform for your workloads. You can’t simply do one without the other and be secure. It’s this end to end life cycle (SecDL) that makes Nutanix so different from other vendors on the market today.

The hamster wheel of keeping your environment secure just had it's last spin with AOS 4.6.

The hamster wheel of keeping your environment secure just had it’s last spin with AOS 4.6.

Feb
16

Nutanix Volume Groups become 1st Class Citizens with 4.6

The Nutanix story around replication and snapshots is great but when Volume Groups first was released to support MS Exchange on ESXi, volume groups didn’t make the cut for DR. Since the 1st release of Volume Groups they have taken a life on their own have been great at supporting older applications like Windows Failover Clustering

What is a Volume Group?

A volume group is a collection of logically related vDisks called volumes. Each volume group is identified by a UUID. Each disk of the volume group also has a UUID, and a name, and is supported by a file on DSF. Disks in a volume group are also provided with integer IDs to specify the ordering of disks. For external attachment through iSCSI, the iSCSI target name identifies the volume group, and the LUN number identifies the disk in the group.

Volume groups are managed independently of the VMs to which volumes must be explicitly attached or detached. A volume group may be configured for either exclusive or shared access.

With 4.6 Volume Groups are now inside of Prism. If your using volume groups with AHV the disks will automatically attached to the guest vm.

In side of Prism:
volumegroups-1

Setting up a Volume Group:

2016-02-15_14-46-16

XCP allows users to recover individual VMs and volume groups from snapshots. You can either replace the existing active VM with the snapshot copy or create a separate clone of a snapshot preserving the active VM. Depending on the snapshot settings in use, the recovered VM is either crash-consistent or application-consistent when it comes back online. Restored volume groups come up in a crash-consistent state. When you restore a volume group, it maintains its application-specific settings, so reattachment is easy. If you do clone a Volume group the UIDD will change.

Volume Groups and VM’s can be in the same protection domain for snapshots and replication.
vgdr

Another Nutanix feature made easy by Prism.