You are here: Home > Blog

If you’ve been following this blog for any length of time, you know that we’ve written extensively about XenDesktop, and spent a lot of time on best practices and problems to avoid. And one of the biggest problems to avoid is poor storage design resulting in poor VDI performance.

In a nutshell, the problem is that a Windows desktop OS uses disk far differently than a Windows server OS. Thanks to the way Windows uses the swap file, disk writes outnumber disk reads by about 2 to 1. You can build your virtual desktop infrastructure on the latest and greatest server hardware, with tons of processing power and insanely huge amounts of RAM, but if all of the disk I/O for all of those virtual desktops is hitting your SAN, you’ve got a scalability problem on your hands.

Provisioning Services (“PVS”) can help to mitigate this in two ways (assuming for sake of argument that you’re provisioning multiple virtual systems from a common, read-only image): First, if you build your Provisioning Servers correctly, you should be able to serve up most of the OS read operations from the Provisioning Server’s own cache memory. Second, you can use the virtualization host’s local disk storage as the required “write cache” – because all of those write operations have to go somewhere while the virtual system is running.

But XenDesktop 5 introduced a new way to provision desktops called “Machine Creation Services” (“MCS”). We wrote about this in the April edition of our Moose Views newsletter, so if you’re not familiar with all the pros and cons of MCS vs. PVS, I’d encourage you to take a brief time out and read that article. Suffice it to say that, despite all the advantages of MCS, the biggest downside of using MCS to provision pooled desktops was that all of the IOPS hit your SAN storage, which limited the scalability of an MCS-provisioned VDI deployment.

But all of that just changed, with the release of XenDesktop 5 Service Pack 1, which was made available for download a week ago (May 13). With SP1, XenDesktop 5 is now able to take advantage of the “IntelliCache” feature that was introduced as part of XenServer v5.6 Service Pack 2. Using MCS with the combination of XenDesktop 5 SP1 and XenServer SP2…

  • The first time a virtual desktop is booted on a given XenServer, the boot image is cached on that XenServer’s local storage.
  • Subsequent virtual desktops booted on that same XenServer will boot and run from that locally cached image.
  • You can use the XenServer’s local storage for the write cache as well.

The bottom line is that you can move as much as 90% of the IOPS off of the SAN and onto local XenServer storage, removing nearly all of the scalability limitations from an MCS-provisioned environment.

With most of the IOPS for running VMs taking place on local storage, it’s pretty straightforward to figure out how many VMs you can expect to support on a given virtualization host. Dan Feller’s blog post does a great job of walking through the process of calculating the functional IOPS that your local XenServer storage repository should be able to support, and inferring from that number how many light, normal, or power users you should be able to support as a result.

This also means that using XenServer as the hypervisor for your XenDesktop 5 deployment is going to yield a significant performance advantage over any other hypervisor, unless or until the other guys come out with similar local caching features. So, if you’re a VMware shop, my advice is this: Go ahead and virtualize all of the supporting XenDesktop server components on your VSphere infrastructure. Run your XenDesktop 5 VMs on XenServer hosts, and just don’t tell anyone! If you’re asked, just say, “Oh, yeah, these are my XenDesktop host systems – they’re completely separate from our VSphere infrastructure, because we don’t need the (insert favorite VSphere feature) function for these systems.” Your infrastructure will run better, and no one will know but you…

In this installment of the Moose Logic Video Series, Steve Parlee, our Director of Engineering, talks about:

  • Why we always use iSCSI HBAs in our Citrix XenServer deployments.
  • The possible risks of using HA in a two-server pool. (NOTE: Initial testing indicates that XenServer v5.6 may not present the same problems in a two-server pool as earlier versions. When we have completed our testing, we will post an update here.)
  • A useful utility for XenServer called “hostdevscan.”

It’s 8 pm on a Sunday evening, and I get a panicked call from a customer because he cannot connect to his XenServersTM via the XenCenterTM management tool. However, as near as he could tell, all of the hosted virtual machines were up and running and in a healthy state. He had unsuccessfully tried to point the XenCenter management tool at another member of the XenServer pool but was unsuccessful.

So what happened and how do you fix it?

This situation can happen for several reasons but generally it happens when there are only two servers in the XenServer pool, and the pool master suddenly fails. In essence, what happens is the surviving server (let’s just call it the “slave”) can no longer see its peer, the pool master, so it assumes it has been stranded and goes into emergency mode to protect its own VMs. There are other ways this can happen (an incorrectly configured pool with HA turned on for example), but this is the most common reason that I have personally experienced.

Depending upon the situation, you may not be able to ping the master server because it is actually down, or you may be able to ping the server but it is in an inconsistent, “locked up”, state such that it cannot answer requests to it. If you are able to connect to the console of the master server either directly with a monitor, keyboard, and mouse (the old fashioned way) or through a remote management interface (DRAC, ILO, ILOM, etc) the server may appear to be running, but you may not be able to do anything with it.

At this point you may be thinking, “This is no big deal – just reboot the machine and it will be fine.” If you are lucky that may actually solve the problem, but in many cases it will not. What you might see is that after the master reboots you will be able to connect to the master but you will not see the slave. Or it may be that your master is truly broken and you are not able to simply reboot it due to a system or hardware failure. But, of course, you’ve still got to get your pool online and working again regardless.

During this period of time, if you try to use a tool such as Putty to connect to the slave via its management interface, you may not be able to connect to it either. If you try to ping the slave on the management interface you may not get any replies. But if you connect to the console of the slave (again, either the physical console or via a remote management interface) you will probably see that the machine is running, but if you look at XSconsole it will appear that the management interface is gone because there will be no IP address showing. By now you’ll probably be scratching your head because the strange thing is all the VMs are running.

So at this point your master appears to be down, or at least impaired, you’ve got no management interface on the slave, your pool is broken and you cannot manage the VMs. So what do you do?

Well, if this happens to you and your VMs are still up and running the first thing you should do is take a deep breath, because more than likely it is not as bad as you might think. XenServer is a robust platform and if the infrastructure is built correctly (and I’m going to quote a customer), “you can really slam the things around and they still work”.

After you take a deep breath and let it out slowly, from the console of the slave server, you will need to access the command line and start by typing:

xe host-is-in-emergency-mode

If the server returns an answer of “True” then you’ve confirmed that the server has gone into emergency mode in order to protect itself and the VMs running on it. (If the server returns an answer of “False” then you can stop reading, because the rest of this post isn’t going to help you.)

Assuming you receive the answer of “True” the slave server is in emergency mode because it cannot see a master – either because the master is actually down, or because the management interface(s) is(are) not working. Therefore, the next step is to promote the slave to master to get it out of emergency mode. We do this by typing:

xe pool-emergency-transition-to-master

At this point the slave server should take over as the pool master and the management interface should be available again. Now if you type the xe host-is-in-emergency-mode command again you should get an answer of “False”.

Now, open XenCenter again. It will first try to connect to the server that was the master, but after it times out it will then attempt to connect to the new master server. Be patient, because eventually it will connect (it may take several seconds) and you will again see your pool and be able to manage your VM’s. If some of the VMs are down because they were on the server that failed you’ll be able to start them on the remaining server (assuming you have shared backend storage and sufficient processor and memory resources).

Now what about the master if it has totally failed? What do I do after I’ve fixed, say, a hardware problem in order to return it to my pool?

If the following two conditions are true:

  1. You are using shared storage so that your VMs are not stored on the XenServer local drives, and
  2. You have built your XenServers with HBAs (fiber or iSCSI) rather than using Open iSCSI, which means the connectivity information to your backend SAN will be stored within the HBA,

…then it may be much simpler and quicker just to reload the XenServer operating system. (If you do not have shared backend storage, which means your VMs are on local storage, DO NOT DO THIS). I can rebuild my XenServers from scratch in about 20 – 30 minutes and have them back in the pool and running.

If either of those two conditions is not true then, depending upon your situation, recovery may be significantly more difficult. It could be as simple as resetting your Open iSCSI settings and connecting back to your SAN (still easy but takes more time to accomplish) or it could be as painful as rebuilding your VMs because you lost your server drives. (OUCH!)

Real world example: I recently had a NIC fail on the motherboard of my master server. Of course since the NIC was on the motherboard it meant the whole motherboard had to be replaced which significantly modified the hardware configuration for that server.

In this case, when I brought that XenServer back online it still had all the information about the old NICs showing in XenCenter, plus it had all the new NICs from the new hardware. Yes I could have used some PIF forget commands to remove the NICs that no longer existed and reconfigure everything but that would have taken me a bit of time to straighten out. Since I had iSCSI HBAs attached to a Datacore SAN (great product, by the way) for shared storage, all I did was reload XenServer on that machine, modify the multipath-enabled.conf file (that is a different blog topic for another day), and rejoin the server to the pool. Because the HBAs already had all the iSCSI information saved in the card, the storage automatically reconnected all the LUNs, the network interfaces took the configuration of the pool, and I was back online and running in less than 30 minutes.

After you repair the machine that failed and get it back online, you may want it to once again be the master server. To do this type:

xe host-list

You will get a list of available servers with their UUID’s. Record the UUID of the server that you want to designate as the new master and then type:

xe pool-designate-new-master host-uuid=[the uuid of the host you want]

After you type this your pool will again disappear from XenCenter, but after about 20 – 30 seconds (be patient) it will reappear with the new server as the master. Your pool should now be healthy, and you should again be able to manage servers as normal.

Five or six years ago, when Citrix first announced the Citrix Access Gateway appliance, I remember scratching my head and thinking, “Wait a minute, Citrix is in the software business. Why in the world do they want to start selling hardware, with all of the warranty, repair, and support issues that come along with being a hardware manufacturer?” The answer, of course, was that in order to build out the complete Application Delivery solution they envisioned, they needed components that, at the time, couldn’t be delivered using software alone.

But the world turns, and time moves on, and today Citrix has a world-class virtualization platform that runs on off-the-shelf server hardware that is itself mind-bogglingly powerful compared to what was available five or six years ago. So it makes all the sense in the world for Citrix to turn all of those hardware devices into virtual appliances as quickly as they can.

Yesterday, they formally announced virtualized versions of both the Access Gateway and the Branch Repeater. We’ll get to the virtual Branch Repeater in another post, because we’ll have our hands full in this one just covering the things you need to know about the Access Gateway VPX.

First, you need to know that the Access Gateway VPX is fundamentally a virtualized version of the 2010 CAG Appliance – with some exceptions that we’ll get into in a moment. You can download it and use XenCenter to import it directly into your XenServer environment. The cost is only $995 (compared to $3,500 for the 2010 hardware appliance), with an ongoing Subscription Advantage cost of $129/year. Here’s where it gets cool:

  • It was difficult to come up with a good solution for redundancy and automatic failover with the 2010 appliance. Unless you wanted to put a load-balancer in front of it (and if you’re going to do that, you may as well just buy a NetScaler in the first place), redundancy depended on putting primary and secondary appliance URLs or IP addresses into the CAG client. And that did you no good at all if you were trying to run it in “CSG-replacement mode” just to provide secure Web access to a XenApp farm. But the VPX virtual appliance fully supports Live Motion, XenServer HA, and NIC bonding. So the VPX doesn’t have to be redundant, because the underlying XenServer infrastructure can provide the resilience you need.
  • If you were using a 2010 appliance, and wanted to use “SmartAccess,” you had to stand up a separate “Advanced Access Control” Web server in your protected network. Obviously, that added to the cost and complexity of the solution. The VPX appliance, on the other hand, supports SmartAccess policies directly.

    Edit July 27, 2010: Not sure now where I originally picked up this information, but it is incorrect. An Advanced Access Control Web server is still required to enable SmartAccess policies with the Access Gateway VPX.

NOTE: SmartAccess, in case you’re not familiar with the term, allows you to control, at a very granular level, what applications and information a user can access, and what they can do with that information, based on the access scenario. The same user, presenting the same authentication credentials, might get a totally different level of access depending on whether s/he is connecting from inside the corporate network, from outside the network using a company-owned laptop, from home using a personal PC, or from a hotel business center using a totally untrusted device. For more information on how SmartAccess works and why it’s cool, check out this video from Citrix TV:


  • The VPX appliance fully supports the latest generation of the Citrix Receiver, and works with Dazzle and the Merchandising Server.
  • You no longer need to buy VPN client licenses to run it in “CSG replacement” mode. This is a biggie. Citrix made it clear some time ago that they would not be putting any more development time and effort into enhancing the software “Citrix Secure Gateway.” But the CSG just wouldn’t die, for one simple reason: it’s free. If you own XenApp or XenDesktop licenses with current Subscription Advantage, you’ve got the rights to use the CSG software, and your only cost is a server to run it on…and that’s pretty low in today’s virtual world. Yes, it could be argued that the CAG appliance was somewhat more secure, since it ran on a hardened Linux-derived kernel. But it cost $3,500 plus roughly $100 per concurrent user. Hmmm… CSG, free, CAG appliance, several thousand dollars. That was an easy decision for a lot of users.

    Co-incident with the release of the VPX appliance, Citrix is announcing that they’re eliminating the Access Gateway Standard User Licenses. They will no longer be sold as of June 30. Instead, when you buy an Access Gateway (physical or virtual), you get a “platform license” that entitles you to use it to secure access to a XenApp or XenDesktop farm (i.e., what’s generally referred to as “CSG Replacement Mode”) at no additional charge. So now the equation is: CSG, free, but I’ve got to put it on a server, and if it’s a Windows Server, the OS is going to cost me $700 – $800 or so. CAG VPX, $995, but I import it directly into my XenServer infrastructure and don’t have to pay for anything else unless I want the advanced access functionality. Suddenly the value proposition looks a lot more attractive.

  • Speaking of the advanced access functionality, Citrix has made some licensing changes there as well. The Access Gateway Universal licensing model has been reduced from three tiers to two, and the prices have been lowered. So now, if you didn’t purchase the XenApp or XenDesktop Platinum Editions (which include Access Gateway Universal licenses), you can purchase the Access Gateway Universal licenses separately for $100/concurrent user in quantities up to 2,500, and $50/concurrent user for 2,500+ users.

What’s the down side? Well, I’m not sure there is one. The VPX appliance isn’t going to work well as a general-purpose SSL/VPN for thousands of concurrent users, but then neither did the 2010 hardware appliance. So if that’s what you need, or if you need the high-end enterprise features like Global Server Load Balancing to enable transparent failover to a Disaster Recovery site, then we need to have a conversation about NetScalers. But for basic CSG-like functionality, or a SmartAccess deployment for a few hundred concurrent users, the virtual appliance looks pretty darned attractive to me.

For more information on the Access Gateway VPX, including a demo of just how easy it is to import it into your XenServer environment and get it running, check out the following video from Citrix TV:

So, grasshopper, you have decided to take the plunge and virtualize your server infrastructure. Someone (perhaps us) explained the business benefits of virtualization, you decided that it made sense, and that it’s time to make the move. But do you know how virtualization will affect your Windows Server licensing model?

The first thing you need to know is that Windows Server licenses are assigned to physical hardware, not to server workloads. When you purchase a license, you must “assign” that license to a physical server. How do you do that? Well, in today’s world, there is no formal process for doing that, although if it makes you feel better, you can write it down somewhere.

You may assign more than one license to a physical server, but you may not assign the same license to more than one physical server. You may reassign a license from one physical server to another, but not more frequently than every 90 days, unless the server it was assigned to is being retired due to “permanent hardware failure.”

Sound reasonable so far? Of course it does. Right up until the license model runs head-on into one of the coolest features of virtualization: live motion. Most virtualization platforms, including Microsoft’s Hyper-V R2, allow you to easily move a virtual server from one physical host to another. Great feature, right? But if you do it, you may have just violated your Windows license agreement.

I say “may” because different versions of Windows Server come with different virtualization rights. For example, a Windows Server Standard license can be used to run one physical instance of Windows (and by “physical instance,” I mean Windows is installed directly on the hardware) or one virtual instance of Windows, but not both – unless the physical instance is being used solely to manage the virtual environment.

Let me say that another way: If you buy a single license for Windows Server Standard Edition with Hyper-V, you can install it directly on the hardware without bothering with the Hyper-V role. Or you can install the Hyper-V role, have one virtual Windows Server running on top of Hyper-V, and use the physical instance exclusively to manage the virtual instance. Of course, you haven’t really gained anything by doing that…but you can purchase additional copies of Windows Server Standard, assign them to the same physical host, and run more virtual servers on Hyper-V.

Thinking this scenario through, then, if you currently have a bunch of physical Windows Servers – each licensed with Windows Standard Edition – and you want to virtualize them all, that’s no problem. You can reassign your server licenses to your virtual hosts and be perfectly legal. As long as you don’t move a server from one host to another. But if all you own are Standard Edition licenses, and you move a server from one host to another, you’ve just violated the license agreement – unless you own a “spare” server license that you have “assigned” to the target server (the host you’re moving it to) but that is not being used.

Now, in the scenario I just described, it’s possible that the most cost-effective thing you could do is to just buy a few additional licenses as “spares” rather than re-licensing your entire environment. But let’s move ahead – once we’ve covered the other Windows editions that are available to you, you’ll be better able to decide what makes financial sense for your project.

Windows Server Enterprise Edition comes with expanded virtualization rights. Each Enterprise Edition license gives you the rights to run one physical instance and up to four virtual instances on the physical host to which it is assigned. Once again, if you want to run all four virtual instances, then the physical instance may only be used to manage the virtual environment. If you want to run other services on the physical instance – and that’s actually fairly common in a Hyper-V deployment – then you only get to run three virtual instances. And you may not split the license across multiple physical hosts.

The “estimated retail price” (just the license, no Software Assurance, assuming Open Business pricing) for Windows Enterprise is $2,358, vs. $726 for Windows Standard. So Enterprise is less expensive than four copies of Standard. Therefore, if you need to buy new licenses (perhaps you’re upgrading from Server 2003 to Server 2008 as part of your virtualization project), it may make sense in a small environment to buy a copy of Enterprise Edition for each virtual host, and perhaps supplement it with a few spare copies of Standard Edition. Here’s an example:

Let’s say you have a total of nine physical servers today, and you want to virtualize them on three dual-processor virtualization hosts. (You could probably run them on two hosts, but if one failed, it might be a stretch to run all nine on one host. If you start with three hosts, and one fails, you still have two to carry the load.) You could buy nine new copies of Windows Standard Edition for $6,534, but you’d have no flexibility to use live motion to move things around. On the other hand, you could buy three copies of Enterprise Edition for your three hosts for $7,074, and effectively have one “spare” instance on each host that’s available for moving a virtual machine from one host to another.

Of course, that may not be quite enough if you want to completely unload one of your servers (perhaps to take it off-line for maintenance), because unless you’re prepared to shut down one VM completely, you’re going to need to run five VMs on one of your remaining servers. Since you may not know in advance which server needs to assume the extra VM workload, you could just buy three additional copies of Standard Edition, and assign one to each host. That would push your total license acquisition cost to $9,252, but you would then be licensed for five VMs on each of your hosts.

The ultimate in flexibility is Windows Server Datacenter Edition. Datacenter Edition is licensed per processor socket rather than per physical host, but includes unlimited virtualization rights. You can run as many VMs on your hosts as they’re capable of running, and move them around to your heart’s content. If you just don’t want to worry about what’s running where or whether or not it’s technically legal to move a given VM around, this is the license model to use.

Of course, this is also the most expensive edition of Windows. The estimated retail price for Datacenter Edition is $2,405 per processor socket (regardless of the number of cores per processor). So it would cost $14,430 to license three dual-processor servers with Datacenter Edition. This probably isn’t cost effective if you’re only virtualizing nine servers. However, if you have lots of servers, and many of them are fairly lightly loaded (in terms of processor utilization), the picture could change. If your average consolidation ratio is greater than or equal to four servers per physical processor then Datacenter Edition becomes the most cost-effective license.

In fact, if you’re even close to that 4:1 ratio, you should strongly consider Datacenter Edition, for two reasons:

  1. Windows environments inevitably grow. However many servers you have today, you’re probably going to have more of them a year from now. With Datacenter Edition, you can continue to fire up new servers to the limits of your hardware without having to buy more server licenses.
  2. AMD already has six-core processors. You know the “arms race” between Intel and AMD will continue. So the number of servers per processor that you can reasonably expect to support will continue to increase as the processors themselves become more powerful and contain more cores, and as this happens, Datacenter Edition will look better and better.

Note that everything we’ve discussed holds true if you’re virtualizing on XenServer or VMware rather than on Hyper-V. The only difference is that you won’t be using any of the allowed physical instances of Windows.

If you want to delve deeper into this issue, you can download a copy of the Microsoft Product Use Rights document from their Web site. Happy virtualizing!

Recently I had my first opportunity to create a Windows 2008 R2 virtual machine on Citrix XenServer 5.0. When I attempted to install the operating system I ran into an interesting issue where the installation would hang right at the initial Windows install screen and the CPU usage pegged at 100%. Once that happened no matter how long I waited the installation never progressed beyond that point.

Of course what did I do? I turned to Google and quickly found the following article which provided a workaround for the issue I was having.

I followed the advice in the article and after running the “xe vm-param-set uuid= platform:viridian=false” command as outlined in the article was able to install Windows Server 2008 R2.

Two more things are worth mentioning here, which are not specifically addressed in the previously referenced article:

  1. With Windows 2008 R2 I was able to install the XenServer 5.0 tools with none of the problems others people are having with Windows 7 installations.
  2. This issue has been resolved in XenServer v5.5!

Recently I wrote a post about the hazards of XenServer HA and how to avoid a couple of different pitfalls which lead to XenServer fencing. In that post I talked about the necessity of correctly setting the HA heartbeat timeout for your environment so that your XenServers will allow enough time for a storage failover to occur. The idea, of course, is to prevent your XenServer from going into a “fence” condition which can occur for many reasons. The reason we’re discussing here is triggered when the XenServer believes its storage has suddenly become unavailable and it is not able to recover its state quickly enough to prevent the HA timeout from fencing the server.

I frequently build environments that use a pair of replicated DataCore SANmelody nodes (two physical nodes) and configure my XenServer in a multipath configuration. With this configuration my XenServers see two active paths to their storage (the status of the multipath is shown in the image below) – one path to each of the two nodes. If, for example, one of the SANmelody nodes goes off line, the other node will immediately take over. However, the XenServers have to be given enough time to fully recognize a failover has occurred, and the storage is still available, in order to avoid a fence. The default HA timeout in XenServer is 30 seconds which means if it takes a XenServer more than 30 seconds to realize the storage is still healthy and available then the server will fence. If the storage was indeed still available, then more than likely there were still VM guests up and running on the XenServer, which have now been taken offline unnecessarily.

To test and tune this setting I first make sure HA is enabled on the pool, then I perform hard failover tests where, using a DRAC or iLO card if I have one, I suddenly power cycle one of the storage servers and watch to see if any XenServers fence. I run this hard power cycle test because this specific problem never comes up with simple storage stops and restarts; rather it only shows up when a storage server actually goes down suddenly, or “hard,” as we say. So I run these tests because I want to stress the system to simulate unfortunate things like power failures, sudden server reboots due to gremlins, and other things along those lines. If nothing happens then great – let’s go home and we can sleep well knowing HA is working correctly. But what if you do have one or more servers which do fence because they believe their storage is gone when in fact it is not?

The last time I had this happen to me I had to test my environment several times, and with each successive run through the hard failover test I used a different timeout setting. In the end I found that 120 seconds worked best for me. (Keep in mind I am doing this during a build and there are no live production workloads running on any of these servers.)

So what is the downside of setting your timeout this high? Well, if a XenServer really fails (for whatever reason) it will take about 120 seconds for the Pool to decide there is a problem and then take action to restart the VMs elsewhere based upon available resources and the restart priority of each VM. Personally, I’d rather wait the 120 seconds when something has really gone wrong than suffer an unnecessary fence/shutdown when all the VMs were actually still running fine.

So how did I set the timeout values? Like this:

Rather than enable HA from the GUI you’re going to have to do it from a command line. I use PuTTY when I’m not actually at the XenServer console. The command you will use is xe pool-ha-enable heartbeat-sr-uuids=your uuid goes here ha-config:timeout=however many seconds you want.

But in that command string, how do you know what the sr-uuid is? The way I find it is to start with XenCenter and locate the SR (storage repository) which is going to be used for the heartbeat status disk. I locate the SCSI ID of that SR and copy the number as shown in this image (click picture to view full-size):

Finding the SCSI ID of a Storage Repository

Finding the SCSI ID of a Storage Repository


After I have that number I next connect to the master XenServer using PuTTY (the master XenServer in a pool is always the top server shown in XenCenter) and run this command xe pbd-list device-config=SCSIid:\ 360030d903131325f48415f4865617274 where the number in RED is the ID just copied from Xencenter:
Finding the sr-uuid

Finding the sr-uuid


What is shown above is what the output should look like. The reason you see three sequences in this example is because there are three hosts in this pool, notice the host-uuids are all different. However also notice the sr-uuid value is the same in each grouping and this is the number we are after. Take the sr-uuid you just found and enter it into a command like this: xe pool-ha-enable heartbeat-sr-uuids=7a213624-1209-c467-42ed-6ef72a1b7699 ha-config:timeout=120

It may take a bit of time for the command to actually complete but once it does you should be able to refresh your Xencenter by using either the xe-toolstack-restart or the service xapi restart command and then when you look at the pool level on the HA tab you should see that HA is now turned on:

Verify that HA is now turned on

Verify that HA is now turned on


As I said previously I found 120 seconds worked best for me – but how did I determine that? Simple: I started by setting the HA timeout to 60 seconds (twice the default) and then ran the hard shutdown test again. One of the XenServers still fenced so I went to 90 seconds, and then finally 120 seconds. The point at which the XenServers do not fence is where you want to stop. But don’t just do this test on one side of the storage! You will want to recover your storage servers and once everything is back online and healthy run the same test again – but this time hard-shutdown the other storage node. Now if none of the XenServers fence then you are done…unless you disable and re-enable HA. As I pointed out in that earlier post, this manual timeout setting is not persistent – if you disable and re-enable HA on the pool, you will have to re-enable it from the command line again to insure that the timeout is set correctly. If it’s done from the GUI, it will revert to the 30-second default.

NOTE: This was originally posted in October, 2009, and may not be a problem any more with current versions of XenServer, as some of the more recent comments would tend to verify – but we will keep the post active for historical purposes. (added by Moose Logic administrator, March 16, 2012)

The Level 1 HA (High Availability) feature that comes with Citrix Essentials for XenServer may be one of the best ways to crash your whole virtual infrastructure if you don’t understand how it works and don’t design in an appropriate level of redundancy. This of course will lead to hours of down time, unhappy management, possible data loss, and lots of extra work for you (most likely on a weekend).

The basics -
HA is designed to monitor the XenServer virtualization environment. When HA is enabled, the administrator can specify which virtual machines (VMs) need to be automatically restarted if the host server they’re running on should fail. If there is a failure of a host server, HA should then automatically restart its designated guest VMs on another host in the XenServer “resource pool.” Note that the HA function does not “live migrate” the guest VMs, because when a host fails the VMs on that host also fail. Rather, it selects another host server and restarts the VMs on that host. For all of this to happen correctly, Citrix’s HA requires two things to be true at all times:

  1. Each XenServer must be able to communicate with its peers in the pool.
  2. Each XenServer in the pool requires access at all times to the HA heartbeat disk, which is shared by all the XenServers in the pool.

If either of these two items is not true for any given XenServer in the pool, that server will “fence.” The short definition of “fencing” is that the XenServer suspects – although it’s not absolutely sure – that it is experiencing some kind of failure, so to protect against possible data corruption it shuts itself down – essentially sacrificing itself to protect the data – until a human comes along and sorts things out. If the fenced server is in a correctly configured HA pool, guest VMs that were configured for HA restart will be restarted on a surviving XenServer.

Considerations -
So… you have two XenServers all set up and all your VMs configured just the way you like them, and you decide to turn on HA. Everything appears to be working until one of the hosts suffers a failure and goes off line. (Murphy’s Law says this will happen on a Saturday evening right before your BBQ party is starting.) With HA enabled, you would expect, based on the whole “High Availability” concept, that everything would be OK. Critical VMs should get restarted on the other host and you should be able to deal with the failed host on Monday.

Oh, but wait, remember HA rule #1? The XenServer host that is still running suddenly does not have any peers to talk to. It no longer knows whether or not it’s healthy so, in the interest of protecting your data from corruption, it does what it’s designed to do – it fences, and now both of your XenServers are down. They may try to reboot, but you are now in an endless loop of fencing, and to get it resolved, you’re going to have to know how to use the “xe host-emergency-ha-disable force=true” command to resolve your problems. (And if you don’t understand that last sentence, you’re in for a long weekend.)

This results in a situation that we in IT refer to as “not good,” with a chance of “career altering,” and you’re going to miss your BBQ party.

Here’s another scenario that will spoil your party: What if both XenServers are actually healthy, and all the virtual servers are up and functioning, but the network link for the management communications between the XenServers fails? Again, each XenServer would think it was stranded from the pool and fence itself in an attempt to correct the issue. With both servers fencing, this would again create an endless loop of server fencing. In essence, one server would start to come back online and would still not see the other XenServer and would fence again, and so on, and so on.

So for those reasons a two-XenServer pool cannot successfully run HA! Just don’t do it – even though you can configure HA on a two-server pool the result can be disastrous and ruin your weekend…not to mention your next performance review.

HA In a Two-Server Pool - Just Don't Do It!

HA In a Two-Server Pool - Just Don't Do It!


Well, what about HA in a three node XenServer pool? Based upon the previously described scenarios, you now have a valid “pool,” in which HA will function. So you configure and enable HA, and when you test the HA functionality by killing one of the XenServers, everything works like it is supposed to. The guest VMs are restarted on the surviving XenServer hosts and you’re happy that everything is working correctly.

But here is another “gotcha!” If you have only one Ethernet interface per XenServer assigned to management, and they’re all plugged into one switch, what happens if the management link fails because a NIC fails – or even worse, the switch fails? If it’s just a NIC in one server, then that XenServer will fence – not too bad but still not what you want. If you were using a different set of NICs (as you always should) for the guest VMs to communicate with the rest of the world, then the guests on that server were probably up and working just fine until the server fenced. Sure, the critical ones will restart on the remaining servers, but you’ve lost a third of the resources in your pool unnecessarily.

Now let’s consider what would happen if the switch should fail and you had only single management ports on each XenServer all plugged into just that one switch. If this happens, it may be time to dust off the old resume, because you have just lost your entire XenServer pool. Why? Because when the switch went down, all the XenServers lost communication with one another, and each assumed that, because it was suddenly isolated from the pool, it must be experiencing some kind of failure. Therefore the whole pool fenced.

Non-Redundant Management Links - Don't Do This Either!

Non-Redundant Management Links - Don't Do This Either!


Conclusions -
Citrix’s HA does not work in a two host pool, period. With a pool of three or more XenServers you’ll be OK if you design the infrastructure correctly so that there is no single point of failure in your peer communications. How? Simply by bonding together two NICs, dedicating them to the management communication function, and then splitting the bonded pairs between two separate Ethernet switches. That way you’re protected against both a NIC failure and a switch failure.

But you’re not out of the woods yet! Don’t forget HA rule #2 – servers need to see the HA heartbeat disk. This is equally important, and you must consider the topology of that side of the network (iSCSI, Fiber, etc.) and be sure it is also redundant. And if you’re using iSCSI multi-pathing (e.g., with a pair of mirrored DataCore iSCSI SAN nodes), be sure to manually bump up the HA timeout interval so that if one of the SAN nodes should fail, the multi-pathing function has time to fail over to the other node before the XenServers all conclude that the HA heartbeat disk is gone – otherwise, again, they will all fence. Our testing indicates that a two minute timeout appears to have an adequate margin of safety. The default setting of one minute (oops – the default is actually 30 seconds) is definitely too short. Unfortunately, this setting does not appear to be persistent, so if you turn HA off and then back on, you’ll need to manually reset the timeout interval again. (This is probably a job for Workflow Studio, but we just haven’t had time to work through the process yet.)

NO Single Points of Failure
HA will do a fine job of protecting you, if you build the network correctly. So make sure you’ve built in enough redundancy that you have no single point of failure, and enjoy your BBQ.

The Right Way to Build an HA Environment

The Right Way to Build an HA Environment


P.S.: If you can’t justify more than two XenServers, but you still have one or more critical guests that need to be highly available, there is a solution: Marathon Technologies’ everRun VM. But that’s another post for another day.

Have you been considering moving from VMware ESX or vSphere to either Citrix® XenServer™ or Microsoft® Windows® Server 2008 Hyper-V™ – but been concerned about exactly how to go about it? Knowing what tools to use to make the migration go smoothly is often a major concern. Also, what kind of support can you get during the transition? And structured training on a new platform is not inexpensive, either. Now Citrix is trying to eliminate these obstacles with a new promotion that runs through March 31, 2010.

On October, 14, 2009, Citrix announced a new program called Project “Open Door”. Customers who switch existing VMware servers to XenServer or Hyper-V, and add Citrix Essentials™ for advanced virtualization management, will receive additional technical support, training, and conversion tools from Citrix at no additional cost.

The Project Open Door promotion will be effective worldwide from October 1 – March 31, 2010. Customers who decommission five or more VMware vSphere 4 or VI3 servers and replace them with XenServer or Hyper-V plus the Citrix Essentials solution, receive the following:

  • A free five incident support pack (5 by 8 hours) for every five servers converted
  • A voucher for six hours of online training for every five servers converted
  • Free migration tools for seamlessly transferring virtual machines from VMware to XenServer or Hyper-V

Check out http://www.citrix.com/opendoor for more information on the program. If you’re seriously considering making the switch, this just might be the time to do it.

Latest Blog Feeds
Testimonials
“Our business is all about process and margins; we rely on Moose Logic to install and manage network solutions that enable us to control both. Moose Logic created solutions that transformed our business relationships and processes.”
Ron Horowitz
Birchwood Park Homes
Read our Newsletter
Copyright © 2010 All rights reserved.
Wordpress Delicate template designed by NattyWP