Dec
12

Running IT: Docker and Cilium for Enterprise Network Security for Micro-Services

Well I think 40 min is about as long as I can last watching a IT related video while running after that I need music! This time I watched another video from DockerCon, Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF

Skip to 7:23: The quick overview of the presentation is that managing IP Tables to lock down micro-services isn’t going to scale and will be almost impossible to manage. Cilium is open source software for providing and transparently securing network connectivity and load balancing between application workloads such as application containers or processes. Cilium operates at Layer 3/4 to provide traditional networking and security services as well as Layer 7 to protect and secure use of modern application protocols such as HTTP, gRPC and Kafka. BPF is used a lot of the big web-scale properties like Facebook and Netflix to secure their environment and to provide troubleshooting. Like anything with a lot of options there is a lot of ways to shoot yourself in the foot so Cilium provides the wrapper to get it easily deployed and configured.

The presentation uses that example of locking down a Kafka cluster via layer 7 instead of having the whole API left wind open which would happen if your were only using IP tables. Kafka is used for building real-time pipelines and streaming apps. Kafka is horizontally scalable and fault-tolerant so it’s a good choice to run it in docker. Kakfa is used by 1/3 of Fortune 500 companies.

Cilium Architecture

Cilium Integrates with:

Docker
Kubernetes
Mesos

Cilium runs as a agent on every host.
Cilium can provide policy for Host to Docker micro-service and even between two containers on the same host.

The demo didn’t pan out but the 2nd half of the presentation talks about Cilium using BPF with XDP. XDP is a further step in evolution and enables to run a specific flavor of BPF programs from the network driver with direct access to the packet’s DMA buffer. This is, by definition, the earliest possible point in the software stack, where programs can be attached to in order to allow for a programmable, high performance packet processor in the Linux kernel networking data path.

Since XDP can happen earlier on at the nic versus iptables with ipset, CPU can be saved, rules load faster and latency under load is a lot better with XDP.

Dec
05

Handling Network Partition with Near-Sync

Near-Sync is GA!!!

Part 1: Near-Sync Primer on Nutanix
Part 2: Recovery Points and Schedules with Near-Sync

Perform the following procedure, if network partition (network isolation) between the primary and remote site occurs.

Following scenarios may occur if the network partition occurs.

1.Network between primary site (site A) and remote site (site B) is restored and both the sites are working.
Primary site tries to transition into NearSync automatically between site A and site B. No manual intervention is required.

2.Site B is not working or destroyed (for whatever reason). If you create a new site (site C) and want to establish sub-hourly schedule from A to C.
Configure sub-hourly schedule from A to C.
The configuration between A to C should succeed. No other manual intervention is required.

3.Site A is not working or destroyed (for whatever reason). If you create a new site (site C) and try to configure sub-hourly schedule from B to C.
Activate the protection domain on site B and set up the schedule between site B and site C.

Dec
01

Supported Anti-Virus Offload for Nutanix Native File Services(AFS)


As the list grows with releases I will try to keep this updated.

As of AFS 2.2.1 supported AV ICAP based vendors:

McAfee Virus Scan Enterprise for Storage 1.2.0

Symantec Protection Engine 7.9.0

Kaspersky Security 10

Sophos Antivirus

Nutanix recommends the following file extensions for user profiles are added to the exclusion list when using the AFS Antivirus scanning:
.dat
.ini
.pol

Symantec Pre-Req

Each Symantec ICAP server needs the hot fix (SPE_7.9.0_HF03.zip) installed from http://www.symantec.com/docs/TECH216348.

Kaspersky Pre-Req
When running the Database Update task with the network folder as an update source, you might encounter an error after entering credentials.

Solution

To resolve, download and install the critical fix 13017 provided by Kaspersky

Download Link:

https://support.kaspersky.com/13017

Nov
19

Nutanix Additional Cluster Health Tooling: Panacea

There are over 450 health checks in the Cluster Health UI inside of Prism Element. To provide additional help a new script called “panacea” had been added. Panacea is bundled with NCC 3.5 and later to provide a user-friendly interface for very advanced troubleshooting. The Nutanix Support team can take these logs and correlate results so you don’t have to wait for the problem to reoccur again before fixing the issue.

The ability to quickly track retransmissions with a very low granularity for a distrusted system is very important. I am hoping in the future this new tooling will play into Nutanix’s ability for degraded node detection. Panacea can be ran for a specific time interval during which logs will be analyzed, possible options are:
–last_no_of_hours
–last_no_of_days
–start_time
–end_time

Login to any CVM within the cluster and the command can be ran from home/nutanix/ncc/panacea/

The below output is from using the tool when digging for network information.

Network outage can cause degraded performance. Cluster network outage
detection is based on following schemes:
1) Cassandra Paxos Request Timeout Exceptions/Message Drops
2) CVM Degraded node scoring
3) Ping latency

In some cases, intermittent network issue might NOT be reflected in ping latency, but it does have impact on TCP throughput and packet
retransmission, leading to more request timeout exceptions.

TCP Retransmission:
——————-
By default, Panacea tracks the TCP connections(destination port 7000) used by Cassandra between peer CVMs. This table displays stats of
packet Retransmissions per min in TCP socket. Frequent retransmission could cause delay in application, and may reflect the congestion status on the host or in the network.
1) Local: Local CVM IP address
2) Remote: Remote CVM IP address
3) Max/Mean/Min/STD: number of retransmissions/min, calcuated from
samples where retransmission happened.
4) %: Value distribution, % of samples is less than the value
= 25, 50, and 75
5) Ratio: N/M, N = number of samples where retransmission happened
M = total samples in the entire data set

+————–+————–+——-+——+——+——+——+——+——+———+
| Local | Remote | Max | Mean | Min | STD | 25% | 50% | 75% | Ratio |
+————–+————–+——-+——+——+——+——+——+——+———+
| XX.X.XXX.110 | XX.X.XXX.109 | 19.00 | 1.61 | 1.00 | 1.90 | 1.00 | 1.00 | 2.00 | 133/279 |
| XX.X.XXX.111 | XX.X.XXX.109 | 11.00 | 2.41 | 1.00 | 1.54 | 1.00 | 2.00 | 3.00 | 236/280 |
| XX.X.XXX.112 | XX.X.XXX.109 | 12.00 | 2.40 | 1.00 | 1.59 | 1.00 | 2.00 | 3.00 | 235/279 |
| XX.X.XXX.109 | XX.X.XXX.110 | 32.00 | 3.04 | 1.00 | 2.70 | 1.00 | 2.00 | 4.00 | 252/279 |
| XX.X.XXX.111 | XX.X.XXX.110 | 9.00 | 1.51 | 1.00 | 1.02 | 1.00 | 1.00 | 2.00 | 152/280 |
| XX.X.XXX.112 | XX.X.XXX.110 | 11.00 | 2.21 | 1.00 | 1.31 | 1.00 | 2.00 | 3.00 | 231/279 |
| XX.X.XXX.109 | XX.X.XXX.111 | 9.00 | 2.01 | 1.00 | 1.20 | 1.00 | 2.00 | 2.00 | 202/279 |
| XX.X.XXX.110 | XX.X.XXX.111 | 10.00 | 2.70 | 1.00 | 1.68 | 1.00 | 2.00 | 3.00 | 244/279 |
| XX.X.XXX.112 | XX.X.XXX.111 | 4.00 | 1.46 | 1.00 | 0.76 | 1.00 | 1.00 | 2.00 | 135/279 |
| XX.X.XXX.109 | XX.X.XXX.112 | 5.00 | 1.56 | 1.00 | 0.85 | 1.00 | 1.00 | 2.00 | 150/279 |
| XX.X.XXX.110 | XX.X.XXX.112 | 6.00 | 2.05 | 1.00 | 1.18 | 1.00 | 2.00 | 3.00 | 234/279 |
| XX.X.XXX.111 | XX.X.XXX.112 | 16.00 | 3.26 | 1.00 | 2.24 | 2.00 | 3.00 | 4.00 | 261/280 |
+————–+————–+——-+——+——+——+——+——+——+———+

Most of the 450 Cluster Health checks inside of Prism with automatic alerting

CVM | CPU
CPU Utilization

Load Level

Node Avg Load – Critical

CVM | Disk
Boot RAID Health

Disk Configuration

Disk Diagnostic Status

Disk Metadata Usage

Disk Offline Status

HDD Disk Usage

HDD I/O Latency

HDD S.M.A.R.T Health Status

Metadata Disk Mounted Check

Metro Vstore Mount Status

Non SED Disk Inserted Check

Nutanix System Partitions Usage High

Password Protected Disk Status

Physical Disk Remove Check

Physical Disk Status

SED Operation Status

SSD I/O Latency

CVM | Hardware
Agent VM Restoration

FT2 Configuration

Host Evacuation Status

Node Status

VM HA Healing Status

VM HA Status

VMs Restart Status

CVM | Memory
CVM Memory Pinned Check

CVM Memory Usage

Kernel Memory Usage

CVM | Network
CVM IP Address Configuration

CVM NTP Time Synchronization

Duplicate Remote Cluster ID Check

Host IP Pingable

IP Configuration

SMTP Configuration

Subnet Configuration

Virtual IP Configuration

vCenter Connection Check

CVM | Protection Domain
Entities Restored Check

Restored Entities Protected

CVM | Services
Admin User API Authentication Check

CVM Rebooted Check

CVM Services Status

Cassandra Waiting For Disk Replacement

Certificate Creation Status

Cluster In Override Mode

Cluster In Read-Only Mode

Curator Job Status

Curator Scan Status

Kerberos Clock Skew Status

Metadata Drive AutoAdd Disabled Check

Metadata Drive Detached Check

Metadata Drive Failed Check

Metadata Drive Ring Check

Metadata DynRingChangeOp Slow Check

Metadata DynRingChangeOp Status

Metadata Imbalance Check

Metadata Size

Node Degradation Status

RemoteSiteHighLatency

Stargate Responsive

Stargate Status

Upgrade Bundle Available

CVM | Storage Capacity
Compression Status

Finger Printing Status

Metadata Usage

NFS Metadata Size Overshoot

On-Disk Dedup Status

Space Reservation Status

vDisk Block Map Usage

vDisk Block Map Usage Warning

Cluster | CPU
CPU type on chassis check

Cluster | Disk
CVM startup dependency check

Disk online check

Duplicate disk id check

Flash Mode Configuration

Flash Mode Enabled VM Power Status

Flash Mode Usage

Incomplete disk removal

Storage Pool Flash Mode Configuration

System Defined Flash Mode Usage Limit

Cluster | Hardware
Power Supply Status

Cluster | Network
CVM Passwordless Connectivity Check

CVM to CVM Connectivity

Duplicate CVM IP check

NIC driver and firmware version check

Time Drift

Cluster | Protection Domain
Duplicate VM names

Internal Consistency Groups Check

Linked Clones in high frequency snapshot schedule

SSD Snapshot reserve space check

Snapshot file location check

Cluster | Remote Site
Cloud Remote Alert

Remote Site virtual external IP(VIP)

Cluster | Services
AWS Instance Check

AWS Instance Type Check

Acropolis Dynamic Scheduler Status

Alert Manager Service Check

Automatic Dedup disabled check

Automatic disabling of Deduplication

Backup snapshots on metro secondary check

CPS Deployment Evaluation Mode

CVM same timezone check

CVM virtual hardware version check

Cassandra Similar Token check

Cassandra metadata balanced across CVMs

Cassandra nodes up

Cassandra service status check

Cassandra tokens consistent

Check that cluster virtual IP address is part of cluster external subnet

Checkpoint snapshot on Metro configured Protection Domain

Cloud Gflags Check

Cloud Remote Version Check

Cloud remote check

Cluster NCC version check

Cluster version check

Compression disabled check

Curator scan time elapsed check

Datastore VM Count Check

E-mail alerts check

E-mail alerts contacts configuration

HTTP proxy check

Hardware configuration validation

High disk space usage

Hypervisor version check

LDAP configuration

Linked clones on Dedup check

Multiple vCenter Servers Discovered

NGT CA Setup Check

Oplog episodes check

Pulse configuration

RPO script validation on storage heavy cluster

Remote Support Status

Report Generation Failure

Report Quota Scan Failure

Send Report Through E-mail Failure

Snapshot chain height check

Snapshots space utilization status

Storage Pool SSD tier usage

Stretch Connectivity Lost

VM group Snapshot and Current Mismatch

Zookeeper active on all CVMs

Zookeeper fault tolerance check

Zookeeper nodes distributed in multi-block cluster

vDisk Count Check

Cluster | Storage Capacity
Erasure Code Configuration

Erasure Code Garbage

Erasure coding pending check

Erasure-Code-Delay Configuration

High Space Usage on Storage Container

Storage Container RF Status

Storage Container Space Usage

StoragePool Space Usage

Volume Group Space Usage

Data Protection | Protection Domain
Aged Third-party Backup Snapshot Check

Check VHDX Disks

Clone Age Check

Clone Count Check

Consistency Group Configuration

Cross Hypervisor NGT Installation Check

EntityRestoreAbort

External iSCSI Attachments Not Snapshotted

Failed To Mount NGT ISO On Recovery of VM

Failed To Recover NGT Information

Failed To Recover NGT Information for VM

Failed To Snapshot Entities

Incorrect Cluster Information in Remote Site

Metadata Volume Snapshot Persistent

Metadata Volume Snapshot Status

Metro Availability

Metro Availability Prechecks Failed

Metro Availability Secondary PD sync check

Metro Old Primary Site Hosting VMs

Metro Protection domain VMs running at Sub-optimal performance

Metro Vstore Symlinks Check

Metro/Vstore Consistency Group File Count Check

Metro/Vstore Protection Domain File Count Check

NGT Configuration

PD Active

PD Change Mode Status

PD Full Replication Status

PD Replication Expiry Status

PD Replication Skipped Status

PD Snapshot Retrieval

PD Snapshot Status

PD VM Action Status

PD VM Registration Status

Protected VM CBR Capablity

Protected VM Not Found

Protected VMs Not Found

Protected VMs Storage Configuration

Protected Volume Group Not Found

Protected Volume Groups Not Found

Protection Domain Decoupled Status

Protection Domain Initial Replication Pending to Remote Site

Protection Domain Replication Stuck

Protection Domain Snapshots Delayed

Protection Domain Snapshots Queued for Replication to Remote Site

Protection Domain VM Count Check

Protection Domain fallback to lower frequency replications to remote

Protection Domain transitioning to higher frequency snapshot schedule

Protection Domain transitioning to lower frequency snapshot schedule

Protection Domains sharing VMs

Related Entity Protection Status

Remote Site NGT Support

Remote Site Snapshot Replication Status

Remote Stargate Version Check

Replication Of Deduped Entity

Self service restore operation Failed

Snapshot Crash Consistent

Snapshot Symlink Check

Storage Container Mount

Updating Metro Failure Handling Failed

Updating Metro Failure Handling Remote Failed

VM Registration Failure

VM Registration Warning

VSS Scripts Not Installed

VSS Snapshot Status

VSS VM Reachable

VStore Snapshot Status

Volume Group Action Status

Volume Group Attachments Not Restored

Vstore Replication To Backup Only Remote

Data Protection | Remote Site
Automatic Promote Metro Availability

Cloud Remote Operation Failure

Cloud Remote Site failed to start

LWS store allocation in remote too long

Manual Break Metro Availability

Manual Promote Metro Availability

Metro Connectivity

Remote Site Health

Remote Site Network Configuration

Remote Site Network Mapping Configuration

Remote Site Operation Mode ReadOnly

Remote Site Tunnel Status

Data Protection | Witness
Authentication Failed in Witness

Witness Not Configured

Witness Not Reachable

File server | Host
File Server Upgrade Task Stuck Check

File Server VM Status

Multiple File Server Versions Check

File server | Network
File Server Entities Not Protected

File Server Invalid Snapshot Warning

File Server Network Reachable

File Server PD Active On Multiple Sites

File Server Reachable

File Server Status

Remote Site Not File Server Capable

File server | Services
Failed to add one or more file server admin users or groups

File Server AntiVirus – All ICAP Servers Down

File Server AntiVirus – Excessive Quarantined / Unquarantined Files

File Server AntiVirus – ICAP Server Down

File Server AntiVirus – Quarantined / Unquarantined Files Limit Reached

File Server AntiVirus – Scan Queue Full on FSVM

File Server AntiVirus – Scan Queue Piling Up on FSVM

File Server Clone – Snapshot invalid

File Server Clone Failed

File Server Rename Failed

Maximum connections limit reached on a file server VM

Skipped File Server Compatibility Check

File server | Storage Capacity
FSVM Time Drift Status

Failed To Run File Server Metadata Fixer Successfully

Failed To Set VM-to-VM Anti Affinity Rule

File Server AD Connectivity Failure

File Server Activation Failed

File Server CVM IP update failed

File Server DNS Updates Pending

File Server Home Share Creation Failed

File Server In Heterogeneous State

File Server Iscsi Discovery Failure

File Server Join Domain Status

File Server Network Change Failed

File Server Node Join Domain Status

File Server Performance Optimization Recommended

File Server Quota allocation failed for user

File Server Scale-out Status

File Server Share Deletion Failed

File Server Site Not Found

File Server Space Usage

File Server Space Usage Critical

File Server Storage Cleanup Failure

File Server Storage Status

File Server Unavailable Check

File Server Upgrade Failed

Incompatible File Server Activation

Share Utilization Reached Configured Limit

Host | CPU
CPU Utilization

Host | Disk
All-flash Node Intermixed Check

Host disk usage high

NVMe Status Check

SATA DOM 3ME Date and Firmware Status

SATA DOM Guest VM Check

SATADOM Connection Status

SATADOM Status

SATADOM Wearout Status

SATADOM-SL 3IE3 Wearout Status

Samsung PM1633 FW Version

Samsung PM1633 Version Compatibility

Samsung PM1633 Wearout Status

Samsung PM863a config check

Toshiba PM3 Status

Toshiba PM4 Config

Toshiba PM4 FW Version

Toshiba PM4 Status

Toshiba PM4 Version Compatibility

Host | Hardware
CPU Temperature Fetch

CPU Temperature High

CPU Voltage

CPU-VRM Temperature

Correctable ECC Errors 10 Days

Correctable ECC Errors One Day

DIMM Voltage

DIMM temperature high

DIMM-VRM Temperature

Fan Speed High

Fan Speed Low

GPU Status

GPU Temperature High

Hardware Clock Status

IPMI SDR Status

SAS Connectivity

System temperature high

Host | Memory
Memory Swap Rate

Ram Fault Status

Host | Network
10 GbE Compliance

Hypervisor IP Address Configuration

IPMI IP Address Configuration

Mellanox NIC Mixed Family check

Mellanox NIC Status check

NIC Flapping Check

NIC Link Down

Node NIC Error Rate High

Receive Packet Loss

Transmit Packet Loss

Host | Services
Datastore Remount Status

Node | Disk
Boot device connection check

Boot device status check

Descriptors to deleted files check

FusionIO PCIE-SSD: ECC errors check

Intel Drive: ECC errors

Intel SSD Configuration

LSI Disk controller firmware status

M.2 Boot Disk change check

M.2 Intel S3520 host boot drive status check

M.2 Micron5100 host boot drive status check

SATA controller

SSD Firmware Check

Samsung PM863a FW version check

Samsung PM863a status check

Samsung PM863a version compatibility check

Samsung SM863 SSD status check

Samsung SM863a version compatibility check

Node | Hardware
IPMI connectivity check

IPMI sel assertions check

IPMI sel log fetch check

IPMI sel power failure check

IPMI sensor values check

M10 GPU check

M10 and M60 GPU Mixed check

M60 GPU check

Node | Network
CVM 10 GB uplink check

Inter-CVM connectivity check

NTP configuration check

Storage routed to alternate CVM check

Node | Protection Domain
ESX VM Virtual Hardware Version Compatible

Node | Services
.dvsData directory in local datastore

Advanced Encryption Standard (AES) enabled

Autobackup check

BMC BIOS version check

CVM memory check

CVM port group renamed

Cassandra Keyspace/Column family check

Cassandra memory usage

Cassandra service restarts check

Cluster Services Down Check

DIMM Config Check

DIMMs Interoperability Check

Deduplication efficiency check

Degraded Node check

Detected VMs with non local data

EOF check

ESXi AHCI Driver version check

ESXi APD handling check

ESXi CPU model and UVM EVC mode check

ESXi Driver compatibility check

ESXi NFS hearbeat timeout check

ESXi RAM disk full check

ESXi RAM disk root usage

ESXi Scratch Configuration

ESXi TCP delayed ACK check

ESXi VAAI plugin enabled

ESXi VAAI plugin installed

ESXi configured VMK check

ESXi services check

ESXi version compatibility

File permissions check

Files in a streched VMs should be in the same Storage Container

GPU drivers installed

Garbage egroups check

Host passwordless SSH

Ivy Bridge performance check

Mellanox NIC Driver version check

NFS file count check

NSC(Nutanix Service Center) server FQDN resolution

NTP server FQDN resolution

Network adapter setting check

Non default gflags check

Notifications dropped check

PYNFS dependency check

RC local script exit statement present

Remote syslog server check

SMTP server FQDN resolution

Sanity check on local.sh

VM IDE bus check

VMKNICs subnets check

VMware hostd service check

Virtual IP check

Zookeeper Alias Check

localcli check

vim command check

Nutanix Guest Tools | VM
PostThaw Script Execution Failed

Other Checks
LWS Store Full

LWS store allocation too long

Recovery Point Objective Cannot Be Met

VM | CPU
CPU Utilization

VM | Disk
I/O Latency

Orphan VM Snapshot Check

VM | Memory
Memory Pressure

Memory Swap Rate

VM | Network
Memory Usage

Receive Packet Loss

Transmit Packet Loss

VM | Nutanix Guest Tools
Disk Configuration Update Failed

VM Guest Power Op Failed

iSCSI Configuration Failed

VM | Remote Site
VM Virtual Hardware Version Compatible

VM | Services
VM Action Status

VM | Virtual Machine
Application Consistent Snapshot Skipped

NGT Mount Failure

NGT Version Incompatible

Temporary Hypervisor Snapshot Cleanup Failed

VSS Snapshot Aborted

VSS Snapshot Not Supported

host | Network
Hypervisor time synchronized

Sep
07

Windows Get Some Love with #Docker EE 17.06

With the new release of Docker 17.06 EE Windows containers gets lots of added features. First up is the ability to run Windows and Linux worker nodes in the same same cluster. This is great because you have centralized security and logging across your whole environment. Your .NET and Java teams can live in peace to consolidate your infrastructure instead of spinning of separate environments.

Continuously scanning for vulnerabilities in Windows images was added if your have Advanced EE license. Not only does it scan images it will also alert when new vulnerabilities are found in existing images.

Bringing everything together you can use the same overlay networks to connect your application in the case of SQL server and web servers running on Linux. Your developers can create a single compose file covering both SQL and web severs.

Other New Windows related features in Docker 17.06:

Windows Server 2016 support
Windows 10586 is marked as deprecated; it will not be supported going forward in stable releases
Integration with Docker Cloud, with the ability to control remote Swarms from the local command line interface (CLI) and view your repositories
Unified login between the Docker CLI and Docker Hub, Docker Cloud.
Sharing a drive can be done on demand, the first time a mount is requested
Add an experimental DNS name for the host: docker.for.win.localhost
Support for client (i.e. “login”) certificates for authenticating registry access (fixes docker/for-win#569)
New installer experience

Aug
29

VMworld attendees get to the Docker booth to save money & time like Visa.

The Docker booth is right beside the Nutanix booth at VMworld this year so I have seen lots of people there but not 23,000 but there should be. Docker had been apart of all the announcements if you realized it our not. Lots of talk about Google with Kubernetes. Kubernetes still requires Docker as the container engine so whether it’s Swarm or Kubernetes you’re going to be using Docker. If you want Enterprise support Docker is both you want to be visiting and learning what they can do to develop better end to end software while saving you money.

With Docker EE has been in production at Visa for over 6 months and is seeing improvements in a number of ways:

Provisioning time: Visa can now provision in seconds rather than days even while more application teams join the effort. They can also deliver just-in-time infrastructure across multiple datacenters around the world with a standardized format that works across their diverse set of applications.
Patching & maintenance: With Docker, Visa can simply redeploy an application with a new image. This also allows Visa to respond quickly to new threats as they can deploy patches across their entire environment at one time.
Tech Refresh: Once applications are containerized with Docker, developers do not have to worry about the underlying infrastructure; the infrastructure is invisible.
Multi-tenancy: Docker containers provides both space and time division multiplexing by allowing Visa to provision and deprovision microservices quickly as needed. This allows them to strategically place new services into the available infrastructure which has allowed the team to support 10x the scale they could previously.

Visa moved a VM-based environment to containers running on bare metal and saved the time to provision and decommissioned its first containerized app by 50%.By saving time and money on the existing infrastructure and applications, organizations can reinvest the savings — both the time and money — in transforming the business.

BTW Nutanix can do bare-metal or run AHV to provide great experience for containers with our own Docker Volume plugin.

Aug
16

Move Your DBs From Cloud or 3-Tier Clunker to Nutanix with Xtract

Xtract for DBs enables you to migrate your Microsoft SQL Server instances from non-Nutanix infrastructures (source) to Nutanix Cloud Platform (target) with a 1-click operation. You can migrate both virtual and physical SQL Servers to Nutanix. Xtract captures the state of your source SQL Server environments, applies any recommended changes, recreates the state on Nutanix, and then migrates the underlying data.

Xtract is a virtual appliance that runs as a web application. It migrates your source SQL Server instances to Nutanix in the following four phases:

Scanning. Scans and discovers your existing SQL Server environments through application-level inspection.
Design. Creates an automated best practice design for the target SQL Servers.
Deployment. Automates the full-stack deployment of the target SQL Servers with best practices.
Migration. Migrates the underlying SQL Server databases and security settings from your source SQL Servers to the target SQL Servers.
Note: Xtract supports SQL Server 2008 R2 through SQL Server 2016 running on Windows 2008 through Windows 2012 R2.

Xtract first scans your source SQL Server instances, so that it can generate a best-practice design template for your target SQL Server environment. To scan the source SQL Server instances, Xtract requires the access credentials of the source SQL Server instances to connect to the listening ports.

You can group one or more SQL Server instances for migration. Xtract performs migrations at the instance level, which means that all databases registered to a SQL Server instance are migrated and managed as part of a single migration plan. Xtract allows you to create multiple migration plans to assist with a phased migration of different SQL Server instances over time.

xtract

Once the full restore is complete and transaction logs are in the process of getting replayed, you can perform the following actions on your SQL Server instances:

In the Migration Plans screen, you can perform one of the following:

Start Cutover
The cutover operation quiesces the source SQL Server databases by placing them in the single user mode, takes a final backup, restores the backup to the target server, and then brings all the databases in the target server online and ready for use. This action completes all migration activities for a migration plan.

Test

The test operation takes a point-in-time copy of the databases in the source instance and brings them online for testing in the target SQL Server instance. This action does not provide a rollback. Once a Test action has been initiated, you can perform tests on the copy. However, if you want to perform a cutover after the Test operation, you should begin again from the scanning phase.

Come to the Nutanix Booth at VMworld in Vegas to see it in action. One-click yourself out of your AWS bill.

Jun
30

The Down Low on Near-Sync On Nutanix

Nutanix refers to its current implementation of redirect-on-write snapshots as vDisk based snapshots. Nutanix has continued to improve on its implementation of snapshots by adding in Light-Weight Snapshots (LWS) to provide near-sync replication. LWS uses markers instead of creating full snapshots for RPO 15 minutes and under. LWS further reduce overhead with managing metadata and remove overhead associated high number of frequent caused by long snapshot chains. The administrator doesn’t have to worry about setting a policy between using vDisk snapshots or LWS. Acropolis Operating System (AOS) will transition between the two forms of replication based on the RPO and available bandwidth. If the network can’t handle the low RPO replication will transition out of near-sync. When the network is OK again to meet the near-sync requirements AOS will start using LWS again. In over-subscribed networks, near-sync can provide almost the same level protection a synchronous replication without impacting the running workload.

The administrator only need to set the RPO, no knowledge of near-sync is needed.

The administrator only need to set the RPO, no knowledge of near-sync is needed.

The tradeoff is that all changes are handled in SSD when near-sync is enabled. Due to this trade off Nutanix reserves a percentage of SSD space to be used by LWS when it’s enabled.

near-sync

In the above diagram, first a vDisk based snapshot is taken and replicated to the remote site. Once the fully replication is complete, LWS will begin at the set schedule. If there is no remote site setup LWS will happen locally right way. If you have the bandwidth available life is good but that’s not always the case in the real world. If you miss your RPO target repeatedly it will automatically transition back to vDisk based snapshots. Once vDisk based snapshots meets occurs fast enough it will automatically transition back to near-sync. Both transitioning out and into near-sync is controlled by advanced settings called gflags.
One the destination side AOS creates hydration points. Hydration points is a way for the LWS to transition into a vDisk based snapshot. The process for inline hydration is to:

1. Create a staging area for each VM (CG) that’s protected by the production domain
2. The staging area is essentially a directory with a set of vdisks for the VM.
3. Afterwards, any new incoming LWS’s will be applied to the same set of vdisks.
4. And the staging area can be snapshotted from time to time and then you would have individual vdisk-backed snapshots.

The source side doesn’t need to hydrate as a vDisk based snapshot is taken every hour.

Have questions? Please leave a comment.

Jun
29

ROBO Deployments & Operations Best Practices on Nutanix

The Nutanix platform’s self-healing design reduces operational and support costs, such as unnecessary site visits and overtime. With Nutanix, you can proactively schedule projects and site visits on a regular cadence, rather than working around emergencies. Prism, our end-to-end infrastructure management tool, streamlines remote cluster operations via one-click upgrades, while also providing simple orchestration for multiple cluster upgrades. Following the best practices in this new document ensures that your business services are quickly restored in the event of a disaster. The Nutanix Enterprise Cloud Platform makes deploying and operating remote and branch offices as easy as deploying to the public cloud, but with control and security on your own terms.

One section I would like to call out in the doc is how to seed your customer data if your dealing with poor WAN links.

Seed Procedure

The following procedure lets you use seed cluster (SC) storage capacity to bypass the network replication step. In the course of this procedure, the administrator stores a snapshot of the VMs on the SC while it’s installed in the ROBO site, then physically ships it to the main datacenter.

Install and configure application VMs on a ROBO cluster.
Create a protection domain (PD) called PD1 on the ROBO cluster for the VMs and volume groups.
Create an out-of-band snapshot S1 for the PD on ROBO with no expiration.
Create an empty PD called PD1 (same name used in step 2) on the SC.
Deactivate PD1 on the SC.
Create remote sites on the ROBO cluster and the SC.
Retrieve snapshot S1 from the ROBO cluster to the SC (via Prism on the SC).
Ship the SC to the datacenter.
ReIP the SC.
Create remote sites on the SC cluster and on the datacenter main cluster (DC1).
Create PD1 (same name used in steps 2 and 4) on DC1.
Deactivate PD1 on DC1.
Retrieve S1 from the SC to DC1 (via Prism on DC1). Prism generates an alert here, but though it appears to be a full data replication, the SC transferred metadata information only.
Create remote sites on DC1 and the ROBO cluster.
Set up a replication schedule for PD1 on the ROBO cluster in Prism.
Once the first scheduled replication is successful, you can delete snapshot S1 to reclaim space.

To get all of the best practices please download the full document here, https://portal.nutanix.com/#/page/solutions/details?targetId=BP-2083-ROBO-Deployment:BP-2083-ROBO-Deployment

Jun
28

Rubrik and AHV: Say No to Proxies

rubrik
The last couple of years I am a huge fan of backup software that removes the need for having proxies. Rubrik provides a proxy-less backup solution by using the Nutanix Data Services Virtual IP address to talk directory to each individual virtual disk that it needs to back up.
Rubrik and Nutanix have some key advantages with this solution:
• AOS 5.1+ with version 3 API’s provides change region tracking for quick efficient backup with no hypervisor based snap. This allow for quick and efficient snapshots.
• With AHV and data locality, Rubrik can grab the most recently changed data without flooding the network which can happen when the copy and VM might not live on the same host. For Nutanix the reads happen locally.
• Rubrik has access to ever virtual disk by making an iSCSI connection to bypass the need of proxies.
• AOS can redirect the 2nd RF copy away from a node with it’s advanced data placement if the backup load becomes too great during a backup window. Thus protecting your mission critical apps that running 24-7.
• Did I mention no proxies? 🙂

Stop by the Rubrik booth and catch their session if your at .Next this week.