Your are here: Home > Blog

Many times, terms like “High Availability” and “Fault Tolerance” get thrown around as though they were the same thing. In fact, the term “fault tolerant” can mean different things to different people – and much like the terms “portal,” or “cloud,” it’s important to be clear about exactly what someone means by the term “fault tolerant.”

As part of our continuing efforts to guide you through the jargon jungle, we would like to discuss redundancy, fault tolerance, failover, and high availability, and we’d like to add one more term: continuous availability.

Our friends at Marathon Technologies shared the following graphic, which shows how IDC classifies the levels of availability:
graphic of availability levels

Redundancy is simply a way of saying that you are duplicating critical components in an attempt to eliminate single points of failure. Multiple power supplies, hot-plug disk drive arrays, multi-pathing with additional switches, and even duplicate servers are all part of building redundant systems.

Unfortunately, there are some failures, particularly if we’re talking about server hardware, that can take a system down regardless of how much you’ve tried to make it redundant. You can build a server with redundant hot-plug power supplies and redundant hot-plug disk drives, and still have the system go down if the motherboard fails – not likely, but still possible. And if it does happen, the server is down. That’s why IDC classifies this as “Availability Level 1″ (“AL1″ on the graphic)…just one level above no protection at all.

The next step up is some kind of failover solution. If a server experiences a catastrophic failure, the work loads are “failed over” to a system that is capable of supporting those workloads. Depending on those work loads, and what kind of fail-over solution you have, that process can take anywhere from minutes to hours. If you’re at “AL2,” and you’ve replicated your data using, say, SAN replication or some kind of server-to-server replication, it could take a considerable amount of time to actually get things running again. If your servers are virtualized, with multiple virtualization hosts running against a shared storage repository, you may be able to configure your virtualization infrastructure to automatically restart a critical workload on a surviving host if the host it was running on experiences a catastrophic failure – meaning that your critical system is back up and on-line in the amount of time it takes the system to reboot – typically 5 to 10 minutes.

If you’re using clustering technology, your cluster may be able to fail over in a matter of seconds (“AL3″ on the graphic). Microsoft server clustering is a classic example of this. Of course, it means that your application has to be cluster-aware, you have to be running Windows Enterprise Edition, and you may have to purchase multiple licenses for your application as well. And managing a cluster is not trivial, particularly when you’ve fixed whatever failed and it’s time to unwind all the stuff that happened when you failed over. And your application was still unavailable during whatever interval of time was required for the cluster to detect the failure and complete the failover process.

You could argue that a fail over of 5 minutes or less equals a highly available system, and indeed there are probably many cases where you wouldn’t need anything better than that. But it is not truly fault tolerant. It’s probably not good enough if you are, say, running a security application that’s controlling the smart-card access to secured areas in an airport, or a video surveillance system that sufficiently critical that you can’t afford to have a 5-minute gap in your video record, or a process control system where a five minute halt means you’ve lost the integrity of your work in process and potentially have to discard thousands of dollars worth of raw material and lose thousands more in lost productivity while you clean out your assembly line and restart it.

That brings us to the concept of continuous availability. This is the highest level of availability, and what we consider to be true fault tolerance. Instead of simply failing workloads over, this level allows for continuous processing without disruption of access to those workloads. Since there is no disruption in service there is no data loss, no loss of productivity and no waiting for your systems to restart your workloads.

So all this leads to the question of what your business needs.

Do you have applications that are critical to your organization? If those applications go down how long could you afford to be without access to them? If those applications go down how much data can you afford to lose? 5 minutes? An hour? And, most importantly, what does it cost you if that application is unavailable for a period of time? Do you know, or can you calculate it?

This is another way to ask what the requirements are for your “RTO” (“Recovery Time Objective” – i.e., how long, when a system goes down, do you have before you must be back up) and “RPO” (“Recovery Point Objective” – i.e., when you do get the system back up, how much data it is OK to have lost in the process). We’ve discussed these concepts in previous posts. These are questions that only you can answer, and the answers are significantly different depending on your business model. If you’re a small business, and your accounting server goes down, and all it means is that you have to wait until tomorrow to enter today’s transactions, it’s a far different situation from a major bank that is processing millions of dollars in credit card transactions.

If you can satisfy your business needs by deploying one of the lower levels of availability, great! Just don’t settle for an AL1 or even an AL3 solution if what your business truly demands is continuous availability.

Color me skeptical when it comes to the “cloud computing” craze. Well, OK, maybe my skepticism isn’t so much about cloud computing per se as it is about the way people seem to think it is the ultimate answer to Life, the Universe, and Everything (shameless Douglass Adams reference). In part, that’s because I’ve been around IT long enough that I’ve seen previous incarnations of this concept come and go. Application Service Providers were supposed to take the world by storm a decade ago. Didn’t happen. The idea came back around as “Software as a Service” (or, as Microsoft preferred to frame it, “Software + Services”). Now it’s cloud computing. In all of its incarnations, the bottom line is that you’re putting your critical applications and data on someone else’s hardware, and sometimes even renting their Operating Systems to run it on and their software to manage it. And whenever you do that, there is an associated risk – as several users of Amazon’s EC2 service discovered just last week.

I have no doubt that the forensic analysis of what happened and why will drag on for a long time. Justin Santa Barbara had an interesting blog post last Thursday (April 21) that discussed how the design of Amazon Web Services (AWS), and its segmentation into Regions and Availability Zones, is supposed to protect you against precisely the kind of failure that occurred last week…except that it didn’t.

Phil Wainewright has an interesting post over at ZDnet.com on the “Seven lessons to learn from Amazon’s outage.” The first two points he makes are particularly important: First, “Read your cloud provider’s SLA very carefully” – because it appears that, despite the considerable pain some of Amazon’s customers were feeling, the SLA was not breached, legally speaking. Second, “Don’t take your provider’s assurances for granted” – for reasons that should be obvious.

Wainewright’s final point, though, may be the most disturbing, because it focuses on Amazon’s “lack of transparency.” He quotes BigDoor CEO Keith Smith as saying, “If Amazon had been more forthcoming with what they are experiencing, we would have been able to restore our systems sooner.” This was echoed in Santa Barbara’s blog post where, in discussing customers’ options for failing over to a different cloud, he observes, “Perhaps they would have started that process had AWS communicated at the start that it would have been such a big outage, but AWS communication is – frankly – abysmal other than their PR.” The transparency issue was also echoed by Andrew Hickey in an article posted April 26 on CRN.com.

CRN also wrote about “lessons learned,” although they came up with 10 of them. Their first point is that “Cloud outages are going to happen…and if you can’t stand the outage, get out of the cloud.” They go on to talk about not putting “Blind Trust” in the cloud, and to point out that management and maintenance are still required – “it’s not a ‘set it and forget it’ environment.”

And it’s not like this is the first time people have been affected by a failure in the cloud:

  • Amazon had a significant outage of their S3 online storage service back in July, 2008. Their northern Virginia data center was affected by a lightning strike in July of 2009, and another power issue affected “some instances in its US-EAST-1 availability zone” in December of 2009.
  • Gmail experienced a system-wide outage for a period of time in August, 2008, then was down again for over 1 ½ hours in September, 2009.
  • The Microsoft/Danger outage in October, 2009, caused a lot of T-Mobile customers to lose personal information that was stored on their Sidekick devices, including contacts, calendar entries, to-do lists, and photos.
  • In January, 2010, failure of a UPS took several hundred servers offline for hours at a Rackspace data center in London. (Rackspace also had a couple of service-affecting failures in their Dallas area data center in 2009.)
  • Salesforce.com users have suffered repeatedly from service outages over the last several years.

This takes me back to a comment made by one of our former customers, who was the CIO of a local insurance company, and who later joined our engineering team for a while. Speaking of the ASPs of a decade ago, he stated, “I wouldn’t trust my critical data to any of them – because I don’t believe that any of them care as much about my data as I do. And until they can convince me that they do, and show me the processes and procedures they have in place to protect it, they’re not getting my data!”

Don’t get me wrong – the “Cloud” (however you choose to define it…and that’s part of the problem) has its place. Cloud services are becoming more affordable, and more reliable. But, as one solution provider quoted in the CRN “lessons learned” article put it, “Just because I can move it into the cloud, that doesn’t mean I can ignore it. It still needs to be managed. It still needs to be maintained.” Never forget that it’s your data, and no one cares about it as much as you do, no matter what they tell you. Forrester analyst Rachel Dines may have said it best in her blog entry from last week: “ASSUME NOTHING. Your cloud provider isn’t in charge of your disaster recovery plan, YOU ARE!” (She also lists several really good questions you should ask your cloud provider.)

Cloud technologies can solve specific problems for you, and can provide some additional, and valuable, tools for your IT toolbox. But you dare not assume that all of your problems will automagically disappear just because you put all your stuff in the cloud. It’s still your stuff, and ultimately your responsibility.

Back at the end of January, DataCore announced the availability of a new product called SANsymphony-V. This product replaces SANmelody in their product line, and is the first step in the eventual convergence of SANmelody and SANsymphony into a single product with a common user interface.

Note: In case you’re not familiar with DataCore, they make software that will turn an off-the-shelf Windows server into an iSCSI SAN node (FibreChannel is optional) with all the bells and whistles you would expect from a modern SAN product. You can read more about them on our DataCore page.

We’ve been playing with SANsymphony-V in our engineering lab, and our technical team is impressed with both the functionality and the new user interface – but that’s another post for another day. This post is focused on the packaging and pricing of SANsymphony-V, which in many cases can come in significantly below the old SANmelody pricing.

First, we need to recap the old SANmelody pricing model. SANmelody nodes were priced according to the maximum amount of raw capacity that node could manage. The full-featured HA/DR product could be licensed for 0.5 Tb, 1 Tb, 2 Tb, 3 Tb, 4 Tb, 8 Tb, 16 Tb, or 32 Tb. So, for example, if you wanted 4 Tb of mirrored storage (two 4 Tb nodes in an HA pair), you would purchase two 4 Tb licenses. At MSRP, including 1 year of software maintenance, this would have cost you a total of $17,496. But what if you had another 2 Tb of archival data that you wanted available, but didn’t necessarily need it mirrored between your two nodes? Then you would want 4 Tb in one node, and 6 Tb in the other node. However, since there was no 6 Tb license, you’d have to buy an 8 Tb license. Now your total cost is up to $21,246.

SANsymphony-V introduced the concept of separate node licenses and capacity licenses. The node license is based on the maximum amount of raw storage that can exist in the storage pool to which that node belongs. The increments are:

  • “VL1″ – Up to 5 Tb – includes 1 Tb of capacity per node (more on this in a moment)
  • “VL2″ – Up to 16 Tb – includes 2 Tb of capacity per node
  • “VL3″ – Up to 100 Tb – includes 8 Tb of capacity per node
  • “VL4″ – Up to 256 Tb – includes 40 Tb of capacity per node
  • “VL5″ – More than 256 Tb – includes 120 Tb of capacity per node

In my example above, with 4 Tb of mirrored storage and 2 Tb of non-mirrored storage, there is a total of 10 Tb of storage in the storage pool: (4 x 2) + 2 = 10. Therefore, each node needs a “VL2″ node license, since the total storage in the pool is more than 5 Tb but less than 16 Tb. We also need a total of 10 Tb of capacity licensing. We’ve already got 4 Tb, since 2 Tb of capacity were included with each node license. So we need to buy an additional six 1 Tb capacity licenses. At MSRP, this would cost a total of $14,850 – substantially less than the old SANmelody price.

The cool thing is, once we have our two VL2 nodes and our 10 Tb of total capacity licensing, DataCore doesn’t care how that capacity is allocated between the nodes. We can have 5 Tb of mirrored storage, we can have 4 Tb in one node and 6 Tb in the other, we can have 3 Tb in one node and 7 Tb in the other. We can divide it up any way we want to.

If we now want to add asynchronous replication to a third SAN node that’s off-site (e.g., in our DR site), that SAN node is considered a separate “pool,” so its licensing would be based on how much capacity we need at our DR site. If we only cared about replicating 4 Tb to our DR site, then the DR node would only need a VL1 node license and a total of 4 Tb of capacity licensing (i.e., a VL1 license + three additional 1 Tb capacity licenses, since 1 Tb of capacity is included with the VL1 license).

At this point, no new SANmelody licenses are being sold – although, if you need to, you can still upgrade an existing SANmelody license to handle more storage. If you’re an existing SANmelody customer with current software maintenance, rest assured that you will be entitled to upgrade to SANsymphony-V as a benefit of your software maintenance coverage. However, there will not be a mechanism that allows for an easy in-place upgrade until sometime in Q3. In the meantime, an upgrade from SANmelody to SANsymphony-V would entail a complete rebuild from the ground up. (Which we would be delighted to do for you if you just can’t wait for the new features.)

Most companies instinctively know that they need to be prepared for an event that will compromise business operations, but it’s often difficult to know where to begin.  We hear a lot of acronyms: “BC” (Business Continuity), “DR” (Disaster Recovery), “BIA” (Business Impact Analysis), “RA” (Risk Assessment), but not a lot of guidance on exactly what those things are, or how to figure out what is right for any particular business.

Many companies we meet with today are not really sure what components to implement or what to prioritize.  So what is the default reaction?  “Back up my Servers!  Just get the stuff off-site and I will be OK.”   Unfortunately, this can leave you with a false sense of security.  So let’s stop and take a moment to understand these acronyms that are tossed out at us.

BIA (Business Impact Analysis)
BIA is a process through which a business will gain an understanding from a financial perspective how and what to recover once a disruptive business event occurs.   This is one of the more critical steps and should be done early on as it directly impacts  BC and DR. If you’re not sure how to get started, get out a blank sheet of paper, and start listing everything you can think of that could possibly disrupt your business. Once you have your list, rank each item on a scale of 1 – 3 on how likely it is to happen, and how severely it would impact your business if it did. This will give you some idea of what you need to worry about first (the items that were ranked #1 in both categories). Congratulations! You just performed a Risk Assessment!

Now, before we go much farther, you need to think about two more acronyms: “RTO” and “RPO.” RTO is the “Recovery Time Objective.” If one of those disruptive events occurs, how much time can pass before you have to be up and running again? An hour? A half day? A couple of days? It depends on your business, doesn’t it? I can’t tell you what’s right for you – only you can decide. RPO is the “Recovery Point Objective.” Once you’re back up, how much data is it OK to have lost in the recovery process? If you have to roll back to last night’s backup, is that OK? How about last Friday’s backup? Of course, if you’re Bank of America and you’re processing millions of dollars worth of credit card transactions, the answer to both RTO and RPO is “zero!” You can’t afford to be down at all, nor can you afford to lose any data in the recovery process. But, once again, most of our businesses don’t need quite that level of protection. Just be aware that the closer to zero you need those numbers to be, the more complex and expensive the solution is going to be!

BC (Business Continuity)
Business Continuity planning is the process through which a business develops a specific plan to assure survivability in the event of a disruptive business event: fire, earthquake, terrorist events, etc.  Ideally, that plan should encompass everything on the list you created – but if that’s too daunting, start with a plan that addresses the top-ranked items. Then revise the plan as time and resources allow to include items that were, say, ranked #1 in one category and #2 in the other, and so forth. Your plan should detail specifically how you are going to meet the RTO and RPO you decided on earlier.

And don’t forget the human factor. You can put together a great plan for how you’re going to replicate data off to another site where you can have critical systems up and running within a couple of hours of your primary facility turning into a smoking hole in the ground. But where are your employees going to report for work? Where will key management team members convene to deal with the crisis and its aftermath? How are they going to get there if transportation systems are disrupted, and how will they communicate if telephone lines are jammed?

DR (Disaster Recovery)
Disaster recovery is the process or action a business takes to bring the business back to a basic functioning entity after a disruptive business event. Note that BC and DR are complementary: BC addresses how you’re going to continue to operate in the face of a disruptive event; DR addresses how you get back to normal operation again.

Most small business think of disasters as events that are not likely to affect them.  Their concept of “disaster” is that of a rare act of God or a terrorist attack.  But in reality, there are many other things that would qualify as a “disruptive business event:” fire, long term power loss, network security breach, swine flu pandemic, and in the case of one of my clients, a fire in the power vault of a building that crippled the building for three days.  It is imperative to not overlook some of the simpler events that can stop us from conducting our business.

Finally, it is important to actually budget some money for these activities. Don’t try to justify this with a classic Return on Investment calculation, because you can’t. Something bad may never happen to your business…or it could happen tomorrow. If it never happens, then the only return you’ll get on your investment is peace of mind (or regulatory compliance, if you’re in a business that is required to have these plans in place). Instead, think of the expense the way you think of an insurance premium, because, just like an insurance premium, it’s money you’re paying to protect against a possible future loss.

These days, it seems everybody is talking about “cloud computing,” even if they don’t completely understand what it is. If you’re among those who are wondering what the “cloud” is all about and what it can do for you, maybe you should investigate moving your email to the cloud. You’ll find that there are several hosted Exchange providers (including ourselves) who would be very happy to help you do it.

Why switch to hosted Exchange?  Well,  it is fair to say that for most SMBs, email has become a predominant tool in our arsenal of communications.  The need for fast, efficient, and cost effective collaboration, as well as integration with our corporate environment and mobile devices, has become the baseline of operations – an absolute requirement for our workplace today.

So why not just get an Exchange Server or Small Business Server?  You can, but managing that environment may not be the best use of your resources.  Here are a few things to consider:

Low and Predictable Costs:
Hosted Exchange has become a low cost enterprise service without the enterprise price tag. If you own the server and have it deployed on your own premise, it now becomes your responsibility to prepare for a disruptive business event: fire, earthquake, flood, and in the Puget Sound Area, a dusting of snow. And it isn’t just an event in your own office space that you have to worry about:

  • A few years ago, there was a fire in a cable vault in downtown Seattle that caused some nearby businesses to lose connectivity for as long as four days.
  • Last year, wildfires in Eastern Washington interrupted power to the facility of one of our customers, and the recovery from the event was delayed because their employees were not allowed to cross the fire line to get to the facility.
  • If you are in a building that’s shared with other tenants, a fire or police action in a part of the building that’s unrelated to your own office space could still block access to the building and prevent your employees from getting to work.
  • Finally, even though it may be a cliche, you’re still at the mercy of a backhoe-in-the-parking-lot event

The sheer cost of trying to protect yourself against all of these possibilities can be daunting, and many business would rather spend their cash on things that generate revenue instead.

Depending on features and needs, hosted Exchange plans can be as low as $5 per month per user – although to get the features most users want, you’re probably looking at $10 or so – and if you choose your hosting provider carefully, you’ll find that they have already made the required investments for high availability. Plus you’ll always have the latest version available to you without having to pay for hardware or software upgrades.

Simplified Administration:
For many small businesses, part of the turn-off of going to SBS or a full blown Exchange server is the technical competency and cost associated with managing and maintaining the environment.  While there are some advantages to having your own deployed environment, most customers I talk to today would rather not have to deal with the extra costs of administering backups and managing server licensing (and periodic upgrade costs), hardware refresh, security, etc.  With a good hosted exchange provider, you will enjoy all the benefits of an enterprise environment, with a simple management console.

UP TIME:
Quality hosted Exchange providers will provide an SLA (“Service Level Agreement”) and up time guarantees – and they have the manpower and infrastructure in place to assure up time for their hundreds and thousands of users.

For deployed Exchange, you’ll need to invest in a robust server environment, power protection (e.g., an Uninterruptible Power Supply, or UPS, that can keep your server running long enough for a graceful shutdown – and maybe even a generator if you can’t afford to wait until your local utility restores power), data backup and recovery hardware and software, and the time required to test your backups.  (Important side note here: If you never do a test restore, you only think you have your data backed up. Far too often, the first time users find out that they have a problem is when they have a data loss and find that they are unable to successfully restore from their backup.) The cost/benefit ratio for a small business is simply not in favor of deployed.

Simple Deployment:
Properly setting up and configuring an Exchange environment and not leaving any security holes can be a daunting task for the non-IT Professional.  Most SMBs will need to hire someone like us to set up and manage the environment, and, although we love it when you hire us, and although the total cost of hiring us may be less than it would cost you to try to do it yourself (especially if something goes wrong), it is still a cost.

With a hosted environment, there is no complicated hardware and software setup.  In some cases, hosting providers have created a tool that you execute locally on your PC that will even configure the Outlook client for you.

A few questions to ask yourself:

  • Do we have the staff and technical competency to deploy and maintain our own Exchange environment?
  • What is the opportunity cost/gain by deploying our own?
  • What are the costs of upgrades/migration in a normal life-cycle refresh?
  • Is there a specific business driver that requires us to deploy?
  • What are the additional costs we will incur?  (Security, archiving, competency, patch management, encryption, licensing, etc.)

This is not to say that some businesses won’t benefit from a deployed environment, but for many – and perhaps most – businesses, hosted Exchange will provide a strong reliable service that will enable you to effectively communicate while having the peace of mind that your stuff is secure and available from any location where you have Internet access. Even if the ultimate bad thing happens and your office is reduced to a smoking crater, your people can still get to their email if they have Internet access at home or at the coffee shop down the street. If you’re as dependent on email as most of us are, there’s a definite value in that.

In this installment of the Moose Logic Video Series, Steve Parlee, our Director of Engineering, talks about:

  • Why we always use iSCSI HBAs in our Citrix XenServer deployments.
  • The possible risks of using HA in a two-server pool. (NOTE: Initial testing indicates that XenServer v5.6 may not present the same problems in a two-server pool as earlier versions. When we have completed our testing, we will post an update here.)
  • A useful utility for XenServer called “hostdevscan.”

This is the second of two videos addressing virtual storage and its benefits. In Part 1, we addressed thin provisioning and virtual volumes. In this video, Steve talks about multipathing, and how it contributes to a high availability storage solution:

This is the first of two videos addressing virtual storage and its benefits. There are a number of storage solutions out there on the market but we have chosen to focus on DataCore of the purposes of this video. DataCore is an iSCSI SAN solution and you can learn more about their products here.

In part one, we address thin provisioning and virtual volumes. Watching this video will help you understand part 2 of “What is Storage Virtualization” where we talk about how multipathing relates to virtual volumes and contributes to a highly available SAN solution.

The TechTarget family of blog sites has a lot of great information. That’s why we have several of their sites linked in our Blogroll (under “Virtualization” in the right sidebar). But one thing that I don’t like about their sites is that – unlike this blog – there is no way to directly comment on their posts. That makes it difficult to respond to posts like the one last week on VMware’s High Availability (VMHA).

In that post, author David Davis opens by stating:

VMware’s High Availability (VMHA) provides high availability to any guest operating system at a potentially much lower cost than other HA options (as you don’t have to pay per virtual machines [VMs] or per server; VMHA is included in the price of vSphere).

I have a couple of problems with this statement.

First, I don’t know what “a potentially much lower cost” means. Is it less expensive than other HA options, or isn’t it? If it is, which other HA options are you comparing it to? If you’re going to throw that line out there, shouldn’t you give us the data on which the statement is based?

Second, it appears that the “lower cost” claim is primarily based on the fact that VMHA is included in the price of vSphere, rather than requiring a separate license. That’s a little like claiming that the high-end German sound system is less expensive if you get it in a Mercedes – because it’s standard equipment – whereas if you want one in your Malibu you have to buy it separately. What matters is the total amount of money I have to spend to get all the functionality I need, isn’t it?

It is true that with, say, Citrix XenServer, you have to purchase a Citrix Essentials for XenServer license to get HA functionality. That will cost you, at the suggested retail price (which nobody actually pays), $2,500 per XenServer for the Enterprise Edition. But the copy of XenServer you’re putting it on is free. On the other hand, vSphere 4 lists for $2,875 per processor, so if I’m using dual-processor servers, I’m looking at $5,750 for vSphere 4 compared to $2,500 for that copy of Essentials for XenServer. If I’m using quad-processor servers, vSphere 4 is going to run $11,500, but I still only need that single license for Essentials. And don’t forget the cost of VirtualCenter to control my vSphere environment, whereas XenCenter is, again, free, and runs on a workstation rather than requiring a dedicated server.

The point of this post is not to argue the relative merits of vSphere vs. XenServer, nor of whose HA feature is better. In fact, if you follow this blog, you’ll know that we’ve raised some red flags regarding how to properly deploy XenServer HA without risking potentially “career-altering” disasters. The point is simply that the old adage “don’t believe everything you read” is particularly appropriate for stuff you read on the Internet. (But you already knew that, right?)

People who throw out unsubstantiated generalized statements need to be challenged. If the TechTarget site allowed comments, I would have challenged the statement there. Since they don’t, I’m challenging it here. If I’m missing something, David Davis (or anyone else, for that matter) is welcome to comment on this post and point out what it is.

Recently I wrote a post about the hazards of XenServer HA and how to avoid a couple of different pitfalls which lead to XenServer fencing. In that post I talked about the necessity of correctly setting the HA heartbeat timeout for your environment so that your XenServers will allow enough time for a storage failover to occur. The idea, of course, is to prevent your XenServer from going into a “fence” condition which can occur for many reasons. The reason we’re discussing here is triggered when the XenServer believes its storage has suddenly become unavailable and it is not able to recover its state quickly enough to prevent the HA timeout from fencing the server.

I frequently build environments that use a pair of replicated DataCore SANmelody nodes (two physical nodes) and configure my XenServer in a multipath configuration. With this configuration my XenServers see two active paths to their storage (the status of the multipath is shown in the image below) – one path to each of the two nodes. If, for example, one of the SANmelody nodes goes off line, the other node will immediately take over. However, the XenServers have to be given enough time to fully recognize a failover has occurred, and the storage is still available, in order to avoid a fence. The default HA timeout in XenServer is 30 seconds which means if it takes a XenServer more than 30 seconds to realize the storage is still healthy and available then the server will fence. If the storage was indeed still available, then more than likely there were still VM guests up and running on the XenServer, which have now been taken offline unnecessarily.

To test and tune this setting I first make sure HA is enabled on the pool, then I perform hard failover tests where, using a DRAC or iLO card if I have one, I suddenly power cycle one of the storage servers and watch to see if any XenServers fence. I run this hard power cycle test because this specific problem never comes up with simple storage stops and restarts; rather it only shows up when a storage server actually goes down suddenly, or “hard,” as we say. So I run these tests because I want to stress the system to simulate unfortunate things like power failures, sudden server reboots due to gremlins, and other things along those lines. If nothing happens then great – let’s go home and we can sleep well knowing HA is working correctly. But what if you do have one or more servers which do fence because they believe their storage is gone when in fact it is not?

The last time I had this happen to me I had to test my environment several times, and with each successive run through the hard failover test I used a different timeout setting. In the end I found that 120 seconds worked best for me. (Keep in mind I am doing this during a build and there are no live production workloads running on any of these servers.)

So what is the downside of setting your timeout this high? Well, if a XenServer really fails (for whatever reason) it will take about 120 seconds for the Pool to decide there is a problem and then take action to restart the VMs elsewhere based upon available resources and the restart priority of each VM. Personally, I’d rather wait the 120 seconds when something has really gone wrong than suffer an unnecessary fence/shutdown when all the VMs were actually still running fine.

So how did I set the timeout values? Like this:

Rather than enable HA from the GUI you’re going to have to do it from a command line. I use PuTTY when I’m not actually at the XenServer console. The command you will use is xe pool-ha-enable heartbeat-sr-uuids=your uuid goes here ha-config:timeout=however many seconds you want.

But in that command string, how do you know what the sr-uuid is? The way I find it is to start with XenCenter and locate the SR (storage repository) which is going to be used for the heartbeat status disk. I locate the SCSI ID of that SR and copy the number as shown in this image (click picture to view full-size):

Finding the SCSI ID of a Storage Repository

Finding the SCSI ID of a Storage Repository


After I have that number I next connect to the master XenServer using PuTTY (the master XenServer in a pool is always the top server shown in XenCenter) and run this command xe pbd-list device-config=SCSIid:\ 360030d903131325f48415f4865617274 where the number in RED is the ID just copied from Xencenter:
Finding the sr-uuid

Finding the sr-uuid


What is shown above is what the output should look like. The reason you see three sequences in this example is because there are three hosts in this pool, notice the host-uuids are all different. However also notice the sr-uuid value is the same in each grouping and this is the number we are after. Take the sr-uuid you just found and enter it into a command like this: xe pool-ha-enable heartbeat-sr-uuids=7a213624-1209-c467-42ed-6ef72a1b7699 ha-config:timeout=120

It may take a bit of time for the command to actually complete but once it does you should be able to refresh your Xencenter by using either the xe-toolstack-restart or the service xapi restart command and then when you look at the pool level on the HA tab you should see that HA is now turned on:

Verify that HA is now turned on

Verify that HA is now turned on


As I said previously I found 120 seconds worked best for me – but how did I determine that? Simple: I started by setting the HA timeout to 60 seconds (twice the default) and then ran the hard shutdown test again. One of the XenServers still fenced so I went to 90 seconds, and then finally 120 seconds. The point at which the XenServers do not fence is where you want to stop. But don’t just do this test on one side of the storage! You will want to recover your storage servers and once everything is back online and healthy run the same test again – but this time hard-shutdown the other storage node. Now if none of the XenServers fence then you are done…unless you disable and re-enable HA. As I pointed out in that earlier post, this manual timeout setting is not persistent – if you disable and re-enable HA on the pool, you will have to re-enable it from the command line again to insure that the timeout is set correctly. If it’s done from the GUI, it will revert to the 30-second default.