DATA-DRIVEN INSIGHTS FOR SMARTER IT

What’s Lurking in Your Virtualized Datacenter?

By Krishna Raj Raja (@esxtopGuru), CloudPhysics

Datacenters are fraught with hidden operational hazards, and because virtual infrastructure is extremely complex and dynamic, many vulnerabilities go undetected and can be onerous to find. On the other hand, known hazards are often ignored because administrators underestimate or don't understand their scope, severity and risk. For example, we recently spoke with a well-known company who told us they'd brought down a large portion of their datacenter during an upgrade because of incompatibilities between their PCI-E cards and ESX. A seemingly trivial issue resulted in hours of costly disruption to their business. 

In the “spirit” of Halloween we created an infographic (see below) to shed some light on things that may be haunting your datacenter. Before I get into that, let me tell you about some tricks and a treat from CloudPhysics:

  • TRICKS: I’ve written a Halloween Cookbook for CloudPhysics users that compiles tips and tricks for using our analytics to roust goblins from your datacenter.
  • TREAT: Halloween Special: Let us find your spooks for you! Our data scientists will produce an Insights Report highlighting where hazards lurk in your datacenter. It’s free and only takes about 15 minutes of your time. Our analytics will do the rest. You can request this free report today (see box at right).

Here's the infographic, followed by an explanation of what we’ve found. The source for these stats is the CloudPhysics global data set, which has more than 50 trillion points of machine metadata from datacenters of all shapes and sizes around the world.

Halloween Special:
Let us find your spooks!

Request a free analysis and report highlighting where hazards lurk in your datacenter. It requires about 15 minutes of your time; our experts do the rest.

Please use your corporate email to qualify for the offer.
Available for datacenters running vSphere 4.1 and higher.
No purchase necessary.

Availability: Application Downtime

Applications are the lifeblood of an organization: when an application slows down or goes down, productivity – and often profitability – is severely compromised. Most organizations with a virtualized datacenter rely on hypervisor high-availability features to keep applications running. That makes it all the more surprising to find that 41% of VMware HA clusters do not have admission control enabled. Without admission control, HA cannot guarantee all the VMs in the cluster will be successfully powered-on in the event of a host failure.

Similarly, running out of disk space can wreak havoc and cause application downtime. Our analysis shows that every week 43% of organizations risk application downtime due to a “disk full” condition in the guest.

Software bugs are another source of downtime. Often vendors discover and document these issues in knowledge base articles. Every month VMware and other vendors release or update about 200 knowledge base articles and roughly 10 of these are for critical data loss or server outage issues. The sheer number of KB articles can overwhelm IT teams that have a backlog of critical tasks and projects. As a result, there are scores of issues that are simply ignored for lack of time, but are operational time bombs waiting to go off.

Another big problem for virtual datacenters is contention for storage resources, which causes application latency and unresponsiveness. Bully VMs – those that consume more than their fair share of shared resources – are hard to detect and hazardous. On average, each bully victimizes 5 other VMs, starving them of the resources they need to run properly. Contention is one of the toughest problems to pinpoint and troubleshoot because of the way virtualization scrambles storage I/O.

Utilization: Dead Space and Zombie VMs

Storage is the biggest cost in the datacenter and storage growth threatens to take on a life of its own. Today, lots of folks take comfort in using thin provisioning (either at the virtualization layer or at the storage array) thinking that it will reduce storage usage. Yet we found that 26% of the disk space used by the virtual machine is dead space. What exactly is a dead space? It is the space previously allocated, but currently deleted and no longer used. Dead space exists at the virtualization layer as well as in the guest operating systems. Newer version of ESX and some newer storage arrays can manage dead space at the virtualization layer but the dead space in the guest operating system is invisible both to the virtualization layer and to the storage array. Why you should care? Because you can easily reclaim this space.

Zombie VMs are another source of wasted space: 16% of VMs are powered off or suspended and never used again. They are living dead in your datacenter. Why living dead? Because these virtual machines are not active and do not consume CPU or memory resources but they occupy valuable disk space on expensive storage arrays.

Vulnerabilities: Heartbleed and ShellShock

Organizations are constantly at risk from security vulnerabilities such as the well-known SSL Heartbleed and more recent ShellShock security bugs. Major vendors such as VMware quickly released patches for both issues, however patch adoption is surprisingly slow. Take for instance the SSL Heartbleed issue, which was patched in April by VMware. We examined our global dataset in July and discovered 50% of vulnerable ESX hosts were unpatched. After communicating this finding to our users and releasing a method for them to determine their vulnerability, we ran the same analysis this week, and found that 22% of ESX hosts remain unpatched. While that is a substantial improvement, many hosts remain vulnerable.

Shellshock is more recent. This issue affects all Linux virtual machines including virtual appliances, and older versions of ESX hosts (4.1 and below). In our global dataset, we found 27% of all virtual machines run Linux and are therefore exposed to Shellshock. We also found 7% of ESX hosts are still running ESX 4.1 classic version and below, which means they are also exposed.

End of Life: Unsupported Software

Running unsupported software is inherently risky. One commonly used operating system, Windows 2003, is hitting the graveyard when it reaches the end of support life next year. Our analysis found that Windows 2003 accounts for 25% of the total Windows VMs running in the datacenter. Further, 5.4% of the Windows VMs run Windows XP, which already reached end of support life this year. In addition, over 6.2% of ESX hosts are running ESX version 4.1 and below, versions which have already reached end of VMware support life.

Summary

These are just a few of the issues that could be haunting your datacenter, and the thought of trying to find all these creepy crawlies may seem downright frightening.

But don’t be scared, be prepared. Call in the experts at CloudPhysics to show you how our data-driven insights can help you quickly and easily find and exorcise operational hazards in your virtual datacenter.