Recently I wrote a post about the hazards of XenServer HA and how to avoid a couple of different pitfalls which lead to XenServer fencing. In that post I talked about the necessity of correctly setting the HA heartbeat timeout for your environment so that your XenServers will allow enough time for a storage failover to occur. The idea, of course, is to prevent your XenServer from going into a “fence” condition which can occur for many reasons. The reason we’re discussing here is triggered when the XenServer believes its storage has suddenly become unavailable and it is not able to recover its state quickly enough to prevent the HA timeout from fencing the server.
I frequently build environments that use a pair of replicated DataCore SANmelody nodes (two physical nodes) and configure my XenServer in a multipath configuration. With this configuration my XenServers see two active paths to their storage (the status of the multipath is shown in the image below) – one path to each of the two nodes. If, for example, one of the SANmelody nodes goes off line, the other node will immediately take over. However, the XenServers have to be given enough time to fully recognize a failover has occurred, and the storage is still available, in order to avoid a fence. The default HA timeout in XenServer is 30 seconds which means if it takes a XenServer more than 30 seconds to realize the storage is still healthy and available then the server will fence. If the storage was indeed still available, then more than likely there were still VM guests up and running on the XenServer, which have now been taken offline unnecessarily.
To test and tune this setting I first make sure HA is enabled on the pool, then I perform hard failover tests where, using a DRAC or iLO card if I have one, I suddenly power cycle one of the storage servers and watch to see if any XenServers fence. I run this hard power cycle test because this specific problem never comes up with simple storage stops and restarts; rather it only shows up when a storage server actually goes down suddenly, or “hard,” as we say. So I run these tests because I want to stress the system to simulate unfortunate things like power failures, sudden server reboots due to gremlins, and other things along those lines. If nothing happens then great – let’s go home and we can sleep well knowing HA is working correctly. But what if you do have one or more servers which do fence because they believe their storage is gone when in fact it is not?
The last time I had this happen to me I had to test my environment several times, and with each successive run through the hard failover test I used a different timeout setting. In the end I found that 120 seconds worked best for me. (Keep in mind I am doing this during a build and there are no live production workloads running on any of these servers.)
So what is the downside of setting your timeout this high? Well, if a XenServer really fails (for whatever reason) it will take about 120 seconds for the Pool to decide there is a problem and then take action to restart the VMs elsewhere based upon available resources and the restart priority of each VM. Personally, I’d rather wait the 120 seconds when something has really gone wrong than suffer an unnecessary fence/shutdown when all the VMs were actually still running fine.
So how did I set the timeout values? Like this:
Rather than enable HA from the GUI you’re going to have to do it from a command line. I use PuTTY when I’m not actually at the XenServer console. The command you will use is xe pool-ha-enable heartbeat-sr-uuids=your uuid goes here ha-config:timeout=however many seconds you want.
But in that command string, how do you know what the sr-uuid is? The way I find it is to start with XenCenter and locate the SR (storage repository) which is going to be used for the heartbeat status disk. I locate the SCSI ID of that SR and copy the number as shown in this image (click picture to view full-size):
After I have that number I next connect to the master XenServer using PuTTY (the master XenServer in a pool is always the top server shown in XenCenter) and run this command xe pbd-list device-config=SCSIid:\ 360030d903131325f48415f4865617274 where the number in RED is the ID just copied from Xencenter:
What is shown above is what the output should look like. The reason you see three sequences in this example is because there are three hosts in this pool, notice the host-uuids are all different. However also notice the sr-uuid value is the same in each grouping and this is the number we are after. Take the sr-uuid you just found and enter it into a command like this: xe pool-ha-enable heartbeat-sr-uuids=7a213624-1209-c467-42ed-6ef72a1b7699 ha-config:timeout=120
It may take a bit of time for the command to actually complete but once it does you should be able to refresh your Xencenter by using either the xe-toolstack-restart or the service xapi restart command and then when you look at the pool level on the HA tab you should see that HA is now turned on:
As I said previously I found 120 seconds worked best for me – but how did I determine that? Simple: I started by setting the HA timeout to 60 seconds (twice the default) and then ran the hard shutdown test again. One of the XenServers still fenced so I went to 90 seconds, and then finally 120 seconds. The point at which the XenServers do not fence is where you want to stop. But don’t just do this test on one side of the storage! You will want to recover your storage servers and once everything is back online and healthy run the same test again – but this time hard-shutdown the other storage node. Now if none of the XenServers fence then you are done…unless you disable and re-enable HA. As I pointed out in that earlier post, this manual timeout setting is not persistent – if you disable and re-enable HA on the pool, you will have to re-enable it from the command line again to insure that the timeout is set correctly. If it’s done from the GUI, it will revert to the 30-second default.