I am a big fan of virtualization. My feeling is that many – if not most – workloads in small to medium sized enterprises should be running as Virtual Machines on Virtual Servers. BUT please be very, very careful how you build those systems!
The truth about virtualization is that it is a platform with which you can provide a highly flexible computing environment. This includes a ton of wonderful features and benefits. But, before you go trip over your pants leg, here is a tip: highly available virtualization environments do sometimes fail! (Sooner or later, everything does.) So my recommendation is to be very careful in designing and protecting your HA solutions – provide two or more of everything. Virtualization technologies will save you boatloads of money if you build them right, so don’t scrimp on the details!
So here are my simple rules:
- Provide two or more Virtualization Hosts. Make sure they’re sized such that if one should fail, you have the capacity on the surviving host(s) to restart any critical workloads that are affected by the failure.
- Shared storage (e.g., a SAN) is a necessity for “Live Motion,” which allows you to move running virtual machines from one host to another, either to balance the workload or to unload a host so you can perform maintenance on it. It’s also what enables you to restart critical workloads on a surviving host if one should fail. But to keep the SAN itself from becoming a single point of failure, you should provide at least two SAN nodes that are configured to replicate your data.
- Back up your data and your VM’s using tools that allow both images and folder based backups. When recovering from a catastrophic failure, restoring a server image is often the fastest way to get things running again – but you don’t want to go to the trouble of restoring a complete server image if all you need are a couple of files. So a schedule that encompasses both kinds of backups is best.
- Make certain that you get data and server images offsite religiously. Rule #1 for Disaster Recovery / Business Continuance is to get the data out of the building.
These simple rules allow for a significant amount of reliability and flexibility. Even with inexpensive hardware and software (there are a number of excellent software products that are free to use), your systems can continue to run or be easily restarted within minutes of hardware failure. In many cases even the total loss of two servers (one virtualization host and one SAN node, for example) would be a minor event in terms of its impact on operations. If you are religious about taking your data and image backups offsite your entire system could be up and running within a day even if you were not able to get to your main location for some reason.
Since a virtualized infrastructure is so resilient, you can afford to use computer systems that are not necessarily top-of-the-line, but you can’t afford not to build it right. A long-time customer (you know who you are) once told us, “The worst thing I could do would be to spend $25,000 on my new systems when I should have spent $30,000.” The dollar amounts aren’t the important thing here – it’s the concept that when you cut corners on something, the chances are high that sooner or later it will come back and bite you. You’ll never be sorry if you take the time and effort to make sure you do it right.

