Your are here: Home > Blog

Back with another Moose Logic video for your viewing pleasure. In this installment, our own Steve Parlee, Moose Logic’s Director of Engineering, talks about SAN storage repository design concepts, and the effects your design choices have on things like snapshots, disk usage, and overall performance. In the process, you’ll also learn what we consider to be “best practice,” and some of the reasons why. As always, your comments will be appreciated. Enjoy!

Back in the old days of minicomputers and mainframes, we used to joke about IBM’s ability to, for all intents and purposes, get the customer to sign a blank check. They were better than anybody I’ve ever seen at getting people to commit to a solution when they really had no idea what the ultimate cost would be – and they were successful because of another cliche (which became a cliche because it was so accurate): “Nobody ever got fired for buying from IBM.” The message was basically, “Yes, we may be more expensive than everybody else, but we’ll take care of you.”

For the most part, those days are long gone, which made it all the more amazing to me to read that VMware is adopting per-VM licensing for most of its management products.

The article nails the basic problem with this licensing approach:

You know how many processors you have on a system, and that’s a fixed number. But the number of VMs on one host — let alone throughout your entire infrastructure — is regularly in flux. How do you plan your purchasing around that? And how do you make sure you don’t violate your licensing terms?

Hey, it’s easy – you just let VMware tell you what to put on your check at the end of the year:

You estimate your needs for the next year and buy licenses to meet those needs. Over the course of those 12 months, vCenter Server calculates the average number of concurrently powered-on VMs running the software. And if you end up needing more licenses to cover what you used, you just reconcile with VMware at the end of the year.

And, before you ask, no, you don’t get money back if you use fewer licenses than you originally purchased.

Sounds to me like a sweet deal – for VMware.

By comparison, the most expensive version of XenServer is $5,000 per server (not per processor, not per VM), and all of the management functionality is included. And the basic version of XenServer, which includes live motion, is free, and still includes the XenCenter distributed management software. (Here’s a helpful comparison chart of which features are included in which version of XenServer.)

A number of years ago, I attended a seminar that discussed the product adoption curve, and how products moved from the “innovation” phase to the “commodity” phase. The inflection point for a particular market was referred to as the “point of most” – where most of the products met most of the needs of most of the customers most of the time. When this point is reached, additional feature innovation no longer justifies a premium price.

The fact is that XenServer and Hyper-V are rapidly achieving feature parity with VMware. If we haven’t reached the “point of most” yet, we certainly will before much more time goes by. So even if you have a substantial investment in VMware already, at some point you have to re-examine what it’s costing every year, don’t you? Or are you OK with just signing a check and letting them fill in the amount later?

I recently implemented both the new Citrix Access Gateway (CAG) VPX and the Branch Repeater VPX within our development lab. Both are “virtual appliances” designed to run directly on a XenServer host. Both are impressive products and work great – in fact, we can use “live motion” to move the CAG between XenServers while running video in a XenDesktop session with not even a pause in the video playback. The CAG moves with no interruption in service. NONE!

But this isn’t just a post to sing the praises of the virtual appliances. Rather, it’s about LICENSING!!! Specifically, licensing the Branch Repeater VPX.

As with many Citrix products, obtaining the license and getting it properly installed is not necessarily easy and intuitive…and in many cases (particularly with new products), we’ve found that the Citrix licensing support team does not know all the ins and outs of licensing a specific product either. That is not intended as a slam on this team. They do the best they can – but Citrix is a big company now, and sometimes it takes a while for information on new products to filter down to the front-line troops. In this case they worked with me for quite some time until we got this figured out (so there is at least one guy on the Citrix support team who now knows how this works).

So…now that I’ve gone through the pain, I thought I’d try to spare you from it if I can. (You’re welcome.)

One complication you’ll encounter is that, depending upon what you’re attempting to accomplish, these appliances may require one license or two. For example, with the CAG, if you are only going to use it for running secured sessions to a web interface (the equivalent of the legacy Citrix Secure Gateway) then you only need a “platform license.” However, if you also plan to run SSL VPN sessions though the CAG, you will need Access Gateway Universal licenses for your users, which will be rolled into a second license file.

Access Gateway licensing isn’t new and it’s pretty well understood. But what about the Branch Repeater? Just as with the CAG, the Branch Repeater may require one license or two, depending upon the functionality you need. If you are going to use the Branch Repeater VPX to connect to another (physical or virtual) Branch Repeater then you only need a platform license. However, if you want to take advantage of its ability to support client PCs that use the Branch Repeater Plug-in, you will need a second license to enable that feature. So we finally come to the topic of this post: how do you get the license file(s) onto your new Branch Repeater VPX?

First, you must log onto the “MyCitrix” web site with your account credentials, and access the Licensing Tool Box to activate and allocate the license. That part of the process is well documented, and if you’re a Citrix customer, you’ve probably done it at least once. The tricky part is what you have to do to download the VPX license file, what you need to enter in the Repeater itself, where to put it, and what you should see.

Here’s what we learned (NOTE: Click on any graphic to view full-sized):

  1. On the Branch Repeater VPX Web-based management interface, access the “Manage Licenses” screen, and in the right panel, choose “local” as shown below, and click the “Apply” button.
    License Server Configuration

    License Server Configuration

  2. Then click on the “License Information” tab and you will see something similar to this next image. What you will need from this screen is the “Local License Server Host Id:” Write down this information – you will need it in the next step.
    Information Used for License Management

    Information Used for License Management

  3. Now you can download the license file from your “MyCitrix” portal. Save it to your PC, and make a note of where you saved it. As part of the process of downloading the license, you must enter the license server ID. Traditionally, you would enter the name of the Citrix license server in this field (and it was case-sensitive, which tripped up a lot of users). But in this case, the system is expecting the MAC address of the Branch Repeater VPX itself…which is what you just copied in Step 2. Another difference is that in the past the License Server Host Type was always set to “HostName.” However, there is now a drop down box with a second choice, “ETHERNET.” For the Branch Repeater VPX, you want to select “ETHERNET,” and then enter the host id that you wrote down in Step 2:
    Downloading the License File from MyCitrix

    Downloading the License File from MyCitrix


    In case you’re wondering, the MAC address we’re using is the address of the first interface on the Branch Repeater VPX, as displayed in XenCenter. If you want to find it in XenCenter click on the VM in the left column and then select the Network tab in the right window and you should see it there:
    XenCenter Display

    XenCenter Display

  4. Now that you have your license downloaded to your local PC, you need to add it to your Branch Repeater. Access the “Local Licenses” tab and click the Add button (note that you will not see all the content in the window as shown here until you’ve added your license):
    Local Licenses Display

    Local Licenses Display


    After you click Add, this screen will appear and you will need to browse to the location where you saved your license file, and click the “Install” button:
    Add License

    Add License


    Now the “Local Licenses” tab should be populated with content:
    Local Licenses Display

    Local Licenses Display


    Next, go to the “Licensed Features” tab. You should see your features listed as shown below:
    Licensed Features

    Licensed Features

  5. As mentioned earlier, if you plan to support client PCs that have the Branch Repeater Plug-in, you will need another license to enable this feature. Once again you will need to go to your MyCitrix portal and follow the same procedure as you did for your platform license to obtain the Plug-in license. Once you have the Plug-in license you will need to add it to the Virtual Appliance in the same manner as you added the platform license. Once that’s done, if you click the down arrow under “Local Licenses” you will see both licenses:
    Manage Licenses Screen

    Manage Licenses Screen


    Finally, if you click the “Licensed Features” tab, both licenses should show up with the number of licenses available:
    Licensed Features

    Licensed Features

This should be all you need to get the Branch Repeater VPX licensed. Now you just need to get it configured correctly… but that’s another blog post.

It’s 8 pm on a Sunday evening, and I get a panicked call from a customer because he cannot connect to his XenServersTM via the XenCenterTM management tool. However, as near as he could tell, all of the hosted virtual machines were up and running and in a healthy state. He had unsuccessfully tried to point the XenCenter management tool at another member of the XenServer pool but was unsuccessful.

So what happened and how do you fix it?

This situation can happen for several reasons but generally it happens when there are only two servers in the XenServer pool, and the pool master suddenly fails. In essence, what happens is the surviving server (let’s just call it the “slave”) can no longer see its peer, the pool master, so it assumes it has been stranded and goes into emergency mode to protect its own VMs. There are other ways this can happen (an incorrectly configured pool with HA turned on for example), but this is the most common reason that I have personally experienced.

Depending upon the situation, you may not be able to ping the master server because it is actually down, or you may be able to ping the server but it is in an inconsistent, “locked up”, state such that it cannot answer requests to it. If you are able to connect to the console of the master server either directly with a monitor, keyboard, and mouse (the old fashioned way) or through a remote management interface (DRAC, ILO, ILOM, etc) the server may appear to be running, but you may not be able to do anything with it.

At this point you may be thinking, “This is no big deal – just reboot the machine and it will be fine.” If you are lucky that may actually solve the problem, but in many cases it will not. What you might see is that after the master reboots you will be able to connect to the master but you will not see the slave. Or it may be that your master is truly broken and you are not able to simply reboot it due to a system or hardware failure. But, of course, you’ve still got to get your pool online and working again regardless.

During this period of time, if you try to use a tool such as Putty to connect to the slave via its management interface, you may not be able to connect to it either. If you try to ping the slave on the management interface you may not get any replies. But if you connect to the console of the slave (again, either the physical console or via a remote management interface) you will probably see that the machine is running, but if you look at XSconsole it will appear that the management interface is gone because there will be no IP address showing. By now you’ll probably be scratching your head because the strange thing is all the VMs are running.

So at this point your master appears to be down, or at least impaired, you’ve got no management interface on the slave, your pool is broken and you cannot manage the VMs. So what do you do?

Well, if this happens to you and your VMs are still up and running the first thing you should do is take a deep breath, because more than likely it is not as bad as you might think. XenServer is a robust platform and if the infrastructure is built correctly (and I’m going to quote a customer), “you can really slam the things around and they still work”.

After you take a deep breath and let it out slowly, from the console of the slave server, you will need to access the command line and start by typing:

xe host-is-in-emergency-mode

If the server returns an answer of “True” then you’ve confirmed that the server has gone into emergency mode in order to protect itself and the VMs running on it. (If the server returns an answer of “False” then you can stop reading, because the rest of this post isn’t going to help you.)

Assuming you receive the answer of “True” the slave server is in emergency mode because it cannot see a master – either because the master is actually down, or because the management interface(s) is(are) not working. Therefore, the next step is to promote the slave to master to get it out of emergency mode. We do this by typing:

xe pool-emergency-transition-to-master

At this point the slave server should take over as the pool master and the management interface should be available again. Now if you type the xe host-is-in-emergency-mode command again you should get an answer of “False”.

Now, open XenCenter again. It will first try to connect to the server that was the master, but after it times out it will then attempt to connect to the new master server. Be patient, because eventually it will connect (it may take several seconds) and you will again see your pool and be able to manage your VM’s. If some of the VMs are down because they were on the server that failed you’ll be able to start them on the remaining server (assuming you have shared backend storage and sufficient processor and memory resources).

Now what about the master if it has totally failed? What do I do after I’ve fixed, say, a hardware problem in order to return it to my pool?

If the following two conditions are true:

  1. You are using shared storage so that your VMs are not stored on the XenServer local drives, and
  2. You have built your XenServers with HBAs (fiber or iSCSI) rather than using Open iSCSI, which means the connectivity information to your backend SAN will be stored within the HBA,

…then it may be much simpler and quicker just to reload the XenServer operating system. (If you do not have shared backend storage, which means your VMs are on local storage, DO NOT DO THIS). I can rebuild my XenServers from scratch in about 20 – 30 minutes and have them back in the pool and running.

If either of those two conditions is not true then, depending upon your situation, recovery may be significantly more difficult. It could be as simple as resetting your Open iSCSI settings and connecting back to your SAN (still easy but takes more time to accomplish) or it could be as painful as rebuilding your VMs because you lost your server drives. (OUCH!)

Real world example: I recently had a NIC fail on the motherboard of my master server. Of course since the NIC was on the motherboard it meant the whole motherboard had to be replaced which significantly modified the hardware configuration for that server.

In this case, when I brought that XenServer back online it still had all the information about the old NICs showing in XenCenter, plus it had all the new NICs from the new hardware. Yes I could have used some PIF forget commands to remove the NICs that no longer existed and reconfigure everything but that would have taken me a bit of time to straighten out. Since I had iSCSI HBAs attached to a Datacore SAN (great product, by the way) for shared storage, all I did was reload XenServer on that machine, modify the multipath-enabled.conf file (that is a different blog topic for another day), and rejoin the server to the pool. Because the HBAs already had all the iSCSI information saved in the card, the storage automatically reconnected all the LUNs, the network interfaces took the configuration of the pool, and I was back online and running in less than 30 minutes.

After you repair the machine that failed and get it back online, you may want it to once again be the master server. To do this type:

xe host-list

You will get a list of available servers with their UUID’s. Record the UUID of the server that you want to designate as the new master and then type:

xe pool-designate-new-master host-uuid=[the uuid of the host you want]

After you type this your pool will again disappear from XenCenter, but after about 20 – 30 seconds (be patient) it will reappear with the new server as the master. Your pool should now be healthy, and you should again be able to manage servers as normal.

So, grasshopper, you have decided to take the plunge and virtualize your server infrastructure. Someone (perhaps us) explained the business benefits of virtualization, you decided that it made sense, and that it’s time to make the move. But do you know how virtualization will affect your Windows Server licensing model?

The first thing you need to know is that Windows Server licenses are assigned to physical hardware, not to server workloads. When you purchase a license, you must “assign” that license to a physical server. How do you do that? Well, in today’s world, there is no formal process for doing that, although if it makes you feel better, you can write it down somewhere.

You may assign more than one license to a physical server, but you may not assign the same license to more than one physical server. You may reassign a license from one physical server to another, but not more frequently than every 90 days, unless the server it was assigned to is being retired due to “permanent hardware failure.”

Sound reasonable so far? Of course it does. Right up until the license model runs head-on into one of the coolest features of virtualization: live motion. Most virtualization platforms, including Microsoft’s Hyper-V R2, allow you to easily move a virtual server from one physical host to another. Great feature, right? But if you do it, you may have just violated your Windows license agreement.

I say “may” because different versions of Windows Server come with different virtualization rights. For example, a Windows Server Standard license can be used to run one physical instance of Windows (and by “physical instance,” I mean Windows is installed directly on the hardware) or one virtual instance of Windows, but not both – unless the physical instance is being used solely to manage the virtual environment.

Let me say that another way: If you buy a single license for Windows Server Standard Edition with Hyper-V, you can install it directly on the hardware without bothering with the Hyper-V role. Or you can install the Hyper-V role, have one virtual Windows Server running on top of Hyper-V, and use the physical instance exclusively to manage the virtual instance. Of course, you haven’t really gained anything by doing that…but you can purchase additional copies of Windows Server Standard, assign them to the same physical host, and run more virtual servers on Hyper-V.

Thinking this scenario through, then, if you currently have a bunch of physical Windows Servers – each licensed with Windows Standard Edition – and you want to virtualize them all, that’s no problem. You can reassign your server licenses to your virtual hosts and be perfectly legal. As long as you don’t move a server from one host to another. But if all you own are Standard Edition licenses, and you move a server from one host to another, you’ve just violated the license agreement – unless you own a “spare” server license that you have “assigned” to the target server (the host you’re moving it to) but that is not being used.

Now, in the scenario I just described, it’s possible that the most cost-effective thing you could do is to just buy a few additional licenses as “spares” rather than re-licensing your entire environment. But let’s move ahead – once we’ve covered the other Windows editions that are available to you, you’ll be better able to decide what makes financial sense for your project.

Windows Server Enterprise Edition comes with expanded virtualization rights. Each Enterprise Edition license gives you the rights to run one physical instance and up to four virtual instances on the physical host to which it is assigned. Once again, if you want to run all four virtual instances, then the physical instance may only be used to manage the virtual environment. If you want to run other services on the physical instance – and that’s actually fairly common in a Hyper-V deployment – then you only get to run three virtual instances. And you may not split the license across multiple physical hosts.

The “estimated retail price” (just the license, no Software Assurance, assuming Open Business pricing) for Windows Enterprise is $2,358, vs. $726 for Windows Standard. So Enterprise is less expensive than four copies of Standard. Therefore, if you need to buy new licenses (perhaps you’re upgrading from Server 2003 to Server 2008 as part of your virtualization project), it may make sense in a small environment to buy a copy of Enterprise Edition for each virtual host, and perhaps supplement it with a few spare copies of Standard Edition. Here’s an example:

Let’s say you have a total of nine physical servers today, and you want to virtualize them on three dual-processor virtualization hosts. (You could probably run them on two hosts, but if one failed, it might be a stretch to run all nine on one host. If you start with three hosts, and one fails, you still have two to carry the load.) You could buy nine new copies of Windows Standard Edition for $6,534, but you’d have no flexibility to use live motion to move things around. On the other hand, you could buy three copies of Enterprise Edition for your three hosts for $7,074, and effectively have one “spare” instance on each host that’s available for moving a virtual machine from one host to another.

Of course, that may not be quite enough if you want to completely unload one of your servers (perhaps to take it off-line for maintenance), because unless you’re prepared to shut down one VM completely, you’re going to need to run five VMs on one of your remaining servers. Since you may not know in advance which server needs to assume the extra VM workload, you could just buy three additional copies of Standard Edition, and assign one to each host. That would push your total license acquisition cost to $9,252, but you would then be licensed for five VMs on each of your hosts.

The ultimate in flexibility is Windows Server Datacenter Edition. Datacenter Edition is licensed per processor socket rather than per physical host, but includes unlimited virtualization rights. You can run as many VMs on your hosts as they’re capable of running, and move them around to your heart’s content. If you just don’t want to worry about what’s running where or whether or not it’s technically legal to move a given VM around, this is the license model to use.

Of course, this is also the most expensive edition of Windows. The estimated retail price for Datacenter Edition is $2,405 per processor socket (regardless of the number of cores per processor). So it would cost $14,430 to license three dual-processor servers with Datacenter Edition. This probably isn’t cost effective if you’re only virtualizing nine servers. However, if you have lots of servers, and many of them are fairly lightly loaded (in terms of processor utilization), the picture could change. If your average consolidation ratio is greater than or equal to four servers per physical processor then Datacenter Edition becomes the most cost-effective license.

In fact, if you’re even close to that 4:1 ratio, you should strongly consider Datacenter Edition, for two reasons:

  1. Windows environments inevitably grow. However many servers you have today, you’re probably going to have more of them a year from now. With Datacenter Edition, you can continue to fire up new servers to the limits of your hardware without having to buy more server licenses.
  2. AMD already has six-core processors. You know the “arms race” between Intel and AMD will continue. So the number of servers per processor that you can reasonably expect to support will continue to increase as the processors themselves become more powerful and contain more cores, and as this happens, Datacenter Edition will look better and better.

Note that everything we’ve discussed holds true if you’re virtualizing on XenServer or VMware rather than on Hyper-V. The only difference is that you won’t be using any of the allowed physical instances of Windows.

If you want to delve deeper into this issue, you can download a copy of the Microsoft Product Use Rights document from their Web site. Happy virtualizing!

Virtualization can mean different things depending on who you ask so we are going to take a broad look at what virtualization is, the different forms it comes in, and why it is so popular.

This is going to be pretty basic stuff so if you are looking for more advanced material I promise we will have advanced stuff in future posts.

Virtualization has been getting a lot of buzz the last few years as it moved from being “bleeding edge” technology to becoming an industry standard. You may have even heard that there are lots of benefits to virtualizing your datacenter…but you may not be sure whether it’s for you, how it works, or even what it means.

There are several kinds of virtualization, including server virtualization, storage virtualization, application virtualization, network virtualization, and desktop virtualization. But when most folks talk about virtualization, they’re referring to server virtualization, so that’s what we will cover today.

So, what is server virtualization?  Simply put server virtualization is the technology that is designed to allow multiple (virtual) servers to reside on a single piece of (physical) hardware and share the resources of the physical server – while still maintaining separate operating environments, so that a problem that crops up in one virtual server won’t affect the operation of others that may be running on the same physical “host.” To help explain what this means I’m going to use the house and condo analogy.

Let’s say you’re a land developer and you build residential property. You cut your land into smaller plots and build one house per plot. As part of the land development, you need to bring in all the utilities from the main street to each and every plot. All of this development costs money.  To make matter worse you know that your city’s population is growing, you’re running out of land to build on, and you also need to control the spiraling costs of building materials. How do you cut cost and provide more homes for a growing population on a limited amount of land?

Figure 1 - Typical cul-de-sac USA

Figure 1 - Typical cul-de-sac USA

Perhaps instead of building single-family homes and having one resident per plot you start building condominiums that hold several residents each. Now the utilities that are brought in to the condo complex are shared by all the residents and yet no one ever sees the other residents’ bills. You’re making more efficient use of the land you have and not wasting time and money bringing in utilities to each individual house. Plus one yard is easier to take care of than ten yards.

1 & 2bd Condos Available Now!!

Figure 2 - 1 & 2bd Condos Available Now!!

So how does this relate to server virtualization?

Each plot of land is a physical server, the structure you build on that plot is a server “workload” (i.e., Exchange, SQL, file server, print server, etc.), and the city is your data center. The utilities are things like power, cooling, and network connectivity. When there is only one workload per physical server, a lot of space and resources get wasted. It’s common to see only 10-15% (if that) processor utilization on physical servers which run only one operating system and one application.

With server virtualization we can now create several “virtual” servers on one physical piece of hardware – think of the hardware as little “server condos” if you like. Just as you can have one-bedroom, two-bedroom, and three-bedroom units in a single building, you can allocate differing amounts of processing and memory resources to the virtual servers depending on the requirements of each individual workload. Each virtual server can now share the physical resources of the host machine with the other virtual servers and never know that they are sharing. In fact, each virtual server “thinks” it’s running on its own dedicated hardware platform. By doing this you can now utilize 80-90% of the processing power of the hardware you own, and cut down on the total amount of power, cooling, and floor space you need in your data center.

For example (just pulling numbers out of the air), let’s say that you’ve been paying an average of $5K each for servers that would handle a single workload. If you need four of them, that’s $20K in hardware cost. But if you can buy one server for $8 – 10K to virtualize these 4 machines, that’s a significant reduction in hardware cost. And with fewer machines to plug in and keep cool, your savings can be up to 40% on power consumption alone. (Did you know that we’ve now reached the point where, over the service life of a typical new server, it’s going to cost you more to keep it cool than it cost you to buy it?)

Since the virtual servers are all located on one physical box you now have fewer pieces of hardware to maintain – allowing the IT staff to use their time more efficiently. You’ll save space in your data center. You’ll also cut down on the amount of waste (some of it hazardous) that must be recycled or disposed of when your hardware finally reaches its end-of-life.

You’ve also cut down time needed to bring a new server on line. In the past you would have had to acquire the hardware, assemble it, rack it, connect it to the network, install and patch the OS, install and configure the application, test it all, and finally put it into service. Now that the servers are virtual they can be created, configured, and put into production in a few hours as opposed to the weeks it used to take. In some cases, by using templates for commonly-needed workloads, it can take only minutes. This makes for a much more flexible and scalable environment.

So server virtualization can:

  • Cut hardware costs
  • Cut energy costs (for both power and cooling)
  • Cut system maintenance time and costs
  • Create a very scalable and flexible data center
  • Save space
  • Create a more environmentally friendly data center (a.k.a. “green computing”)

These are the main reasons that server virtualization has become an industry standard. According to folks like Gartner, we’ve now reached the point where the majority of new servers placed into service are being virtualized, and the majority of enterprises have made it a standard practice to virtualize all new servers unless there is a compelling reason why a server can’t or shouldn’t be virtualized. Virtualization also makes it easier to implement things like high availability, disaster recovery, and business continuity, but that’s a subject for a future post.

Recently I had my first opportunity to create a Windows 2008 R2 virtual machine on Citrix XenServer 5.0. When I attempted to install the operating system I ran into an interesting issue where the installation would hang right at the initial Windows install screen and the CPU usage pegged at 100%. Once that happened no matter how long I waited the installation never progressed beyond that point.

Of course what did I do? I turned to Google and quickly found the following article which provided a workaround for the issue I was having.

I followed the advice in the article and after running the “xe vm-param-set uuid= platform:viridian=false” command as outlined in the article was able to install Windows Server 2008 R2.

Two more things are worth mentioning here, which are not specifically addressed in the previously referenced article:

  1. With Windows 2008 R2 I was able to install the XenServer 5.0 tools with none of the problems others people are having with Windows 7 installations.
  2. This issue has been resolved in XenServer v5.5!

The TechTarget family of blog sites has a lot of great information. That’s why we have several of their sites linked in our Blogroll (under “Virtualization” in the right sidebar). But one thing that I don’t like about their sites is that – unlike this blog – there is no way to directly comment on their posts. That makes it difficult to respond to posts like the one last week on VMware’s High Availability (VMHA).

In that post, author David Davis opens by stating:

VMware’s High Availability (VMHA) provides high availability to any guest operating system at a potentially much lower cost than other HA options (as you don’t have to pay per virtual machines [VMs] or per server; VMHA is included in the price of vSphere).

I have a couple of problems with this statement.

First, I don’t know what “a potentially much lower cost” means. Is it less expensive than other HA options, or isn’t it? If it is, which other HA options are you comparing it to? If you’re going to throw that line out there, shouldn’t you give us the data on which the statement is based?

Second, it appears that the “lower cost” claim is primarily based on the fact that VMHA is included in the price of vSphere, rather than requiring a separate license. That’s a little like claiming that the high-end German sound system is less expensive if you get it in a Mercedes – because it’s standard equipment – whereas if you want one in your Malibu you have to buy it separately. What matters is the total amount of money I have to spend to get all the functionality I need, isn’t it?

It is true that with, say, Citrix XenServer, you have to purchase a Citrix Essentials for XenServer license to get HA functionality. That will cost you, at the suggested retail price (which nobody actually pays), $2,500 per XenServer for the Enterprise Edition. But the copy of XenServer you’re putting it on is free. On the other hand, vSphere 4 lists for $2,875 per processor, so if I’m using dual-processor servers, I’m looking at $5,750 for vSphere 4 compared to $2,500 for that copy of Essentials for XenServer. If I’m using quad-processor servers, vSphere 4 is going to run $11,500, but I still only need that single license for Essentials. And don’t forget the cost of VirtualCenter to control my vSphere environment, whereas XenCenter is, again, free, and runs on a workstation rather than requiring a dedicated server.

The point of this post is not to argue the relative merits of vSphere vs. XenServer, nor of whose HA feature is better. In fact, if you follow this blog, you’ll know that we’ve raised some red flags regarding how to properly deploy XenServer HA without risking potentially “career-altering” disasters. The point is simply that the old adage “don’t believe everything you read” is particularly appropriate for stuff you read on the Internet. (But you already knew that, right?)

People who throw out unsubstantiated generalized statements need to be challenged. If the TechTarget site allowed comments, I would have challenged the statement there. Since they don’t, I’m challenging it here. If I’m missing something, David Davis (or anyone else, for that matter) is welcome to comment on this post and point out what it is.

Recently I wrote a post about the hazards of XenServer HA and how to avoid a couple of different pitfalls which lead to XenServer fencing. In that post I talked about the necessity of correctly setting the HA heartbeat timeout for your environment so that your XenServers will allow enough time for a storage failover to occur. The idea, of course, is to prevent your XenServer from going into a “fence” condition which can occur for many reasons. The reason we’re discussing here is triggered when the XenServer believes its storage has suddenly become unavailable and it is not able to recover its state quickly enough to prevent the HA timeout from fencing the server.

I frequently build environments that use a pair of replicated DataCore SANmelody nodes (two physical nodes) and configure my XenServer in a multipath configuration. With this configuration my XenServers see two active paths to their storage (the status of the multipath is shown in the image below) – one path to each of the two nodes. If, for example, one of the SANmelody nodes goes off line, the other node will immediately take over. However, the XenServers have to be given enough time to fully recognize a failover has occurred, and the storage is still available, in order to avoid a fence. The default HA timeout in XenServer is 30 seconds which means if it takes a XenServer more than 30 seconds to realize the storage is still healthy and available then the server will fence. If the storage was indeed still available, then more than likely there were still VM guests up and running on the XenServer, which have now been taken offline unnecessarily.

To test and tune this setting I first make sure HA is enabled on the pool, then I perform hard failover tests where, using a DRAC or iLO card if I have one, I suddenly power cycle one of the storage servers and watch to see if any XenServers fence. I run this hard power cycle test because this specific problem never comes up with simple storage stops and restarts; rather it only shows up when a storage server actually goes down suddenly, or “hard,” as we say. So I run these tests because I want to stress the system to simulate unfortunate things like power failures, sudden server reboots due to gremlins, and other things along those lines. If nothing happens then great – let’s go home and we can sleep well knowing HA is working correctly. But what if you do have one or more servers which do fence because they believe their storage is gone when in fact it is not?

The last time I had this happen to me I had to test my environment several times, and with each successive run through the hard failover test I used a different timeout setting. In the end I found that 120 seconds worked best for me. (Keep in mind I am doing this during a build and there are no live production workloads running on any of these servers.)

So what is the downside of setting your timeout this high? Well, if a XenServer really fails (for whatever reason) it will take about 120 seconds for the Pool to decide there is a problem and then take action to restart the VMs elsewhere based upon available resources and the restart priority of each VM. Personally, I’d rather wait the 120 seconds when something has really gone wrong than suffer an unnecessary fence/shutdown when all the VMs were actually still running fine.

So how did I set the timeout values? Like this:

Rather than enable HA from the GUI you’re going to have to do it from a command line. I use PuTTY when I’m not actually at the XenServer console. The command you will use is xe pool-ha-enable heartbeat-sr-uuids=your uuid goes here ha-config:timeout=however many seconds you want.

But in that command string, how do you know what the sr-uuid is? The way I find it is to start with XenCenter and locate the SR (storage repository) which is going to be used for the heartbeat status disk. I locate the SCSI ID of that SR and copy the number as shown in this image (click picture to view full-size):

Finding the SCSI ID of a Storage Repository

Finding the SCSI ID of a Storage Repository


After I have that number I next connect to the master XenServer using PuTTY (the master XenServer in a pool is always the top server shown in XenCenter) and run this command xe pbd-list device-config=SCSIid:\ 360030d903131325f48415f4865617274 where the number in RED is the ID just copied from Xencenter:
Finding the sr-uuid

Finding the sr-uuid


What is shown above is what the output should look like. The reason you see three sequences in this example is because there are three hosts in this pool, notice the host-uuids are all different. However also notice the sr-uuid value is the same in each grouping and this is the number we are after. Take the sr-uuid you just found and enter it into a command like this: xe pool-ha-enable heartbeat-sr-uuids=7a213624-1209-c467-42ed-6ef72a1b7699 ha-config:timeout=120

It may take a bit of time for the command to actually complete but once it does you should be able to refresh your Xencenter by using either the xe-toolstack-restart or the service xapi restart command and then when you look at the pool level on the HA tab you should see that HA is now turned on:

Verify that HA is now turned on

Verify that HA is now turned on


As I said previously I found 120 seconds worked best for me – but how did I determine that? Simple: I started by setting the HA timeout to 60 seconds (twice the default) and then ran the hard shutdown test again. One of the XenServers still fenced so I went to 90 seconds, and then finally 120 seconds. The point at which the XenServers do not fence is where you want to stop. But don’t just do this test on one side of the storage! You will want to recover your storage servers and once everything is back online and healthy run the same test again – but this time hard-shutdown the other storage node. Now if none of the XenServers fence then you are done…unless you disable and re-enable HA. As I pointed out in that earlier post, this manual timeout setting is not persistent – if you disable and re-enable HA on the pool, you will have to re-enable it from the command line again to insure that the timeout is set correctly. If it’s done from the GUI, it will revert to the 30-second default.

Citrix Provisioning Services, which evolved from their acquisition of the Ardence technology, enables some great concepts:

  • Since the first time a Citrix customer deployed more than one WinFrame server, we’ve struggled with the issue of change control – how do we insure that, over time, all of the servers that are supposed to be identical do, in fact, remain identical? Booting and running them all from a single, read-only image is a great way to do that.
  • It gives you an “undo” option when you upgrade your server image. You can make a copy of your read-only image, set it to read/write, apply your patches, updates, etc., reboot one server from the new image, do your testing, then set the new image to read-only, reboot your servers, and ba-da-boom ba-da-bing (that’s a technical term), in the time it takes them to reboot, they’re all running from the new image. If you then discover that there’s something wrong with the new image, point them back at the old image and reboot them again, and, in the time it takes them to reboot again, you’ve just rolled back to the old image.
  • In a VDI scenario, not only do you enjoy the first two advantages, you also save a ton of expensive SAN storage. If your typical desktop image is, say, 10 Gb, and you want to deploy 100 virtual desktops, with some vendors’ approaches you will consume a full terabyte of expensive SAN storage. By using provisioning services, you consume only the 10 Gb required by the common image.

Unfortunately, when you convert a modern Microsoft OS image to a shared read-only image, it looks like a hardware change to the OS, and breaks the license activation. This is the case with Windows 2008, 2008 R2, Vista, and Windows 7.

Enter the KMS server. KMS stands for “Key Management Service,” and it’s one way to automate the activation of Microsoft volume licenses within an organization. There’s a pretty good video that you can download from Microsoft Technet that walks through the process of configuring a KMS server to automatically activate servers and workstations, but it was made prior to the release of 2008 R2, so it omits a very important point (which we will get to in due time).

The concept is that as an un-activated copy of Server 2008, Vista, or Win7 boots, it queries Active Directory to see if there is a KMS server on the network. If there is, it contacts the KMS server for activation. However, for reasons that are not at all clear to me, the KMS server must be contacted by a minimum number of machines before it will actually activate anything. So, each time a different machine contacts the KMS server for activation, it is assigned a unique ID number, and the KMS server increments its counter by one. When it has been contacted by a total of five different systems, it will begin to activate servers. When it has been contacted by a total of 25 different systems, it will begin to activate workstations.

Before the release of Server 2008 R2, only physical systems would increment the counter – virtual systems would not. (Don’t ask me how the KMS server could tell the difference – that’s one of the ongoing mysteries of KMS.) And that’s the message you’ll hear when you watch the video referenced earlier. However, if KMS is running on a Windows 2008 R2 server, both physical and virtual systems will increment the counter. Note also that what matters is the aggregate number of all systems that have contacted the server for activation, regardless of whether they’re running Server 2008, 2008 R2, Vista, or Win7.

If the threshold has not yet been reached, the system will not be activated, but will still run…within the constraints of the built-in 30-day “grace period” for activation. (Although the nag messages get pretty intrusive in the last three days of the grace period.) This, by the way, is good news if you’re looking at an evaluation or proof of concept that will involve fewer systems than it takes to meet the threshold – you should be OK as long as the evaluation term doesn’t exceed the 30-day grace period. The system will continue to check back in with the KMS server ever two hours to see if the threshold has been met. When it is met, all of the systems that have been waiting will be activated. Once activated, a system will attempt to check back in and renew its activation every 7 days. It must renew its activation within 180 days, or it will revert back to an un-activated state.

The KMS server keeps track of the ID numbers of the systems that have contacted it for activation. If an activated system does not check back in within 30 days, its ID number is removed from the KMS server’s cache, and the counter is decremented. If the count falls back below the threshold, the KMS server will stop activating systems. To help guard against this, the KMS server’s cache size is set to 2x the threshold. In other words, if you’re only activating servers, the cache will contain the IDs of the last 10 servers that have contacted it for activation. If you’re activating workstations, or a combination of workstations and servers, the cache will contain the IDs of the last 50 systems that have contacted it for activation.

The KMS service can be co-hosted with other services in your server infrastructure – you do not have to dedicate a server to this function. In fact, if all you care about are workstations, you can host the KMS service on a Win7 workstation. You’re going to want to have more than one KMS host running, to insure that it doesn’t become a single point of failure in your infrastructure. And remember, unless you’re going to be activating enough physical systems to meet the KMS threshold, you need to be running KMS on Server 2008 R2. That will give you the ability to activate “any Windows operating system that supports Volume Activation,” (which today means the four operating systems we’ve been discussing here), and count both physical and virtual systems toward the required threshold.

So…wrapping back around to the beginning of this discussion, if you want to use Provisioning Services to provision XenApp servers on Server 2008 (and remember, XenApp does not yet work on 2008 R2 as of this writing), you’re going to need a couple of KMS servers. And unless you have five or more physical 2008 servers that it can activate, you’re going to need to have your KMS servers running on R2. And even then, you’re going to need a total of at least five machines to meet the threshold before KMS will activate anything.

Likewise, if you want to use Provisioning Services to provision Win7 desktops – and I’m ignoring Vista here, because, even though I personally liked Vista, I think Win7 is sufficiently superior that it just doesn’t make sense at this point not to go to Win7 – you’re also going to need a couple of KMS servers. And unless you have 25 or more physical systems (in aggregate, counting both servers and workstations), they’re going to need to be running on R2. And in any event, you’re going to need a total of at least 25 systems.

For more information on exactly how KMS works, I strongly recommend the Technet Volume Activation Planning Guide for Windows 7 and Windows Server 2008 R2. Happy provisioning!