You are here: Home > Blog

Red Cross Ready Rating Program

Ready Rating Program Seal


A few days ago, I spotted a headline in the local morning paper: “SBA Partners with the Red Cross to Promote Disaster Planning.” We’ve written some posts in the past that dealt with the importance of DR planning, and how to go about it, so this piqued my curiosity enough that I visited the Red Cross “Ready Rating” Web site. I was sufficiently impressed with what I found there that I wanted to share it with you.

Membership in the Ready Rating program is free. All you have to do to become a member is to sign up and take the on-line self-assessment, which will help you determine your current level of preparedness. And I’m talking about overall business preparedness, not just IT preparedness. The assessment rates you on your responses to questions dealing with things like:

  • Have you conducted a “hazard vulnerability assessment,” including identifying appropriate emergency responders (e.g., police, fire, etc.) in your area and, if necessary, obtaining agreements with them?
  • Have you developed a written emergency response plan?
  • Has that plan been communicated to employees, families, clients, media representatives, etc.?
  • Have you developed a “continuity of operations plan?”
  • Have you trained your people on what to do in an emergency?
  • Do you conduct regular drills and exercises?

That last point is more important than you might think. It’s not easy to think clearly when you’re in the middle of an earthquake, or when you’re trying to find the exit when the building is on fire and there’s smoke everywhere. The best way to insure that everyone does what they’re supposed to do is to drill until the response is automatic. It’s why we had fire drills when we were in elementary school. It’s still effective now that we’re all grown up.

Once you become a member, your membership will automatically renew from year to year, as long as you take the self-assessment annually and can show that your score has improved from the prior year. (Once your score reaches a certain threshold, you’re only required to maintain that level to retain your membership.)

So, why should you be concerned about this? It’s hard to imagine that, after the tsunami in Japan and the flooding and tornadoes here at home, there’s anyone out there who still doesn’t get it. But, just in case, consider these points taken from the “Emergency Fast Facts” document in the members’ area:

  • Only 2 in 10 Americans feel prepared for a catastrophic event.
  • Close to 60% of Americans are wholly unprepared for a disaster of any kind.
  • 54% of Americans don’t prepare because they believe a disaster will not affect them – although 51% of Americans have experienced at least one emergency situation where they lost utilities for at least three days, had to evacuate and could not return home, could not communicate with family members, or had to provide first aid to others.
  • 94% of small business owners believe that a disaster could seriously disrupt their business within the next two years.
  • 15 – 40% of small businesses fail following a natural or man-made disaster.

If you’re not certain how to even get started, they can help there as well. Here’s a screen capture showing a partial list of the resources available in the members’ area:

Member Resources

You may also want to review the following articles and posts:

And speaking of getting started, check this out: Just about everything I’ve ever read about disaster preparedness talks about the importance of having a “72-hour kit” – something that you can quickly grab and take with you that contains everything you need to survive for three days. Well, for those of you who haven’t got the time to scrounge up all of the recommended items and pack them up, you may find the solution at your local Costco. Here’s what I spotted on my most recent trip:

Pre-Packaged 3-day Survival Kit

Yep, it’s a pre-packaged 3-day survival kit. The cost at my local store (in Woodinville, WA, if you’re curious) was $69.95. That, in my opinion, is a pretty good deal.

So, if you haven’t started planning yet, consider this your call to action. Don’t end up as a statistic. You can do this.

Many times, terms like “High Availability” and “Fault Tolerance” get thrown around as though they were the same thing. In fact, the term “fault tolerant” can mean different things to different people – and much like the terms “portal,” or “cloud,” it’s important to be clear about exactly what someone means by the term “fault tolerant.”

As part of our continuing efforts to guide you through the jargon jungle, we would like to discuss redundancy, fault tolerance, failover, and high availability, and we’d like to add one more term: continuous availability.

Our friends at Marathon Technologies shared the following graphic, which shows how IDC classifies the levels of availability:

Graphic of Availability Levels

The Availability Pyramid



Redundancy is simply a way of saying that you are duplicating critical components in an attempt to eliminate single points of failure. Multiple power supplies, hot-plug disk drive arrays, multi-pathing with additional switches, and even duplicate servers are all part of building redundant systems.

Unfortunately, there are some failures, particularly if we’re talking about server hardware, that can take a system down regardless of how much you’ve tried to make it redundant. You can build a server with redundant hot-plug power supplies and redundant hot-plug disk drives, and still have the system go down if the motherboard fails – not likely, but still possible. And if it does happen, the server is down. That’s why IDC classifies this as “Availability Level 1″ (“AL1″ on the graphic)…just one level above no protection at all.

The next step up is some kind of failover solution. If a server experiences a catastrophic failure, the work loads are “failed over” to a system that is capable of supporting those workloads. Depending on those work loads, and what kind of fail-over solution you have, that process can take anywhere from minutes to hours. If you’re at “AL2,” and you’ve replicated your data using, say, SAN replication or some kind of server-to-server replication, it could take a considerable amount of time to actually get things running again. If your servers are virtualized, with multiple virtualization hosts running against a shared storage repository, you may be able to configure your virtualization infrastructure to automatically restart a critical workload on a surviving host if the host it was running on experiences a catastrophic failure – meaning that your critical system is back up and on-line in the amount of time it takes the system to reboot – typically 5 to 10 minutes.

If you’re using clustering technology, your cluster may be able to fail over in a matter of seconds (“AL3″ on the graphic). Microsoft server clustering is a classic example of this. Of course, it means that your application has to be cluster-aware, you have to be running Windows Enterprise Edition, and you may have to purchase multiple licenses for your application as well. And managing a cluster is not trivial, particularly when you’ve fixed whatever failed and it’s time to unwind all the stuff that happened when you failed over. And your application was still unavailable during whatever interval of time was required for the cluster to detect the failure and complete the failover process.

You could argue that a fail over of 5 minutes or less equals a highly available system, and indeed there are probably many cases where you wouldn’t need anything better than that. But it is not truly fault tolerant. It’s probably not good enough if you are, say, running a security application that’s controlling the smart-card access to secured areas in an airport, or a video surveillance system that sufficiently critical that you can’t afford to have a 5-minute gap in your video record, or a process control system where a five minute halt means you’ve lost the integrity of your work in process and potentially have to discard thousands of dollars worth of raw material and lose thousands more in lost productivity while you clean out your assembly line and restart it.

That brings us to the concept of continuous availability. This is the highest level of availability, and what we consider to be true fault tolerance. Instead of simply failing workloads over, this level allows for continuous processing without disruption of access to those workloads. Since there is no disruption in service there is no data loss, no loss of productivity and no waiting for your systems to restart your workloads.

So all this leads to the question of what your business needs.

Do you have applications that are critical to your organization? If those applications go down how long could you afford to be without access to them? If those applications go down how much data can you afford to lose? 5 minutes? An hour? And, most importantly, what does it cost you if that application is unavailable for a period of time? Do you know, or can you calculate it?

This is another way to ask what the requirements are for your “RTO” (“Recovery Time Objective” – i.e., how long, when a system goes down, do you have before you must be back up) and “RPO” (“Recovery Point Objective” – i.e., when you do get the system back up, how much data it is OK to have lost in the process). We’ve discussed these concepts in previous posts. These are questions that only you can answer, and the answers are significantly different depending on your business model. If you’re a small business, and your accounting server goes down, and all it means is that you have to wait until tomorrow to enter today’s transactions, it’s a far different situation from a major bank that is processing millions of dollars in credit card transactions.

If you can satisfy your business needs by deploying one of the lower levels of availability, great! Just don’t settle for an AL1 or even an AL3 solution if what your business truly demands is continuous availability.

Color me skeptical when it comes to the “cloud computing” craze. Well, OK, maybe my skepticism isn’t so much about cloud computing per se as it is about the way people seem to think it is the ultimate answer to Life, the Universe, and Everything (shameless Douglass Adams reference). In part, that’s because I’ve been around IT long enough that I’ve seen previous incarnations of this concept come and go. Application Service Providers were supposed to take the world by storm a decade ago. Didn’t happen. The idea came back around as “Software as a Service” (or, as Microsoft preferred to frame it, “Software + Services”). Now it’s cloud computing. In all of its incarnations, the bottom line is that you’re putting your critical applications and data on someone else’s hardware, and sometimes even renting their Operating Systems to run it on and their software to manage it. And whenever you do that, there is an associated risk – as several users of Amazon’s EC2 service discovered just last week.

I have no doubt that the forensic analysis of what happened and why will drag on for a long time. Justin Santa Barbara had an interesting blog post last Thursday (April 21) that discussed how the design of Amazon Web Services (AWS), and its segmentation into Regions and Availability Zones, is supposed to protect you against precisely the kind of failure that occurred last week…except that it didn’t.

Phil Wainewright has an interesting post over at ZDnet.com on the “Seven lessons to learn from Amazon’s outage.” The first two points he makes are particularly important: First, “Read your cloud provider’s SLA very carefully” – because it appears that, despite the considerable pain some of Amazon’s customers were feeling, the SLA was not breached, legally speaking. Second, “Don’t take your provider’s assurances for granted” – for reasons that should be obvious.

Wainewright’s final point, though, may be the most disturbing, because it focuses on Amazon’s “lack of transparency.” He quotes BigDoor CEO Keith Smith as saying, “If Amazon had been more forthcoming with what they are experiencing, we would have been able to restore our systems sooner.” This was echoed in Santa Barbara’s blog post where, in discussing customers’ options for failing over to a different cloud, he observes, “Perhaps they would have started that process had AWS communicated at the start that it would have been such a big outage, but AWS communication is – frankly – abysmal other than their PR.” The transparency issue was also echoed by Andrew Hickey in an article posted April 26 on CRN.com.

CRN also wrote about “lessons learned,” although they came up with 10 of them. Their first point is that “Cloud outages are going to happen…and if you can’t stand the outage, get out of the cloud.” They go on to talk about not putting “Blind Trust” in the cloud, and to point out that management and maintenance are still required – “it’s not a ‘set it and forget it’ environment.”

And it’s not like this is the first time people have been affected by a failure in the cloud:

  • Amazon had a significant outage of their S3 online storage service back in July, 2008. Their northern Virginia data center was affected by a lightning strike in July of 2009, and another power issue affected “some instances in its US-EAST-1 availability zone” in December of 2009.
  • Gmail experienced a system-wide outage for a period of time in August, 2008, then was down again for over 1 ½ hours in September, 2009.
  • The Microsoft/Danger outage in October, 2009, caused a lot of T-Mobile customers to lose personal information that was stored on their Sidekick devices, including contacts, calendar entries, to-do lists, and photos.
  • In January, 2010, failure of a UPS took several hundred servers offline for hours at a Rackspace data center in London. (Rackspace also had a couple of service-affecting failures in their Dallas area data center in 2009.)
  • Salesforce.com users have suffered repeatedly from service outages over the last several years.

This takes me back to a comment made by one of our former customers, who was the CIO of a local insurance company, and who later joined our engineering team for a while. Speaking of the ASPs of a decade ago, he stated, “I wouldn’t trust my critical data to any of them – because I don’t believe that any of them care as much about my data as I do. And until they can convince me that they do, and show me the processes and procedures they have in place to protect it, they’re not getting my data!”

Don’t get me wrong – the “Cloud” (however you choose to define it…and that’s part of the problem) has its place. Cloud services are becoming more affordable, and more reliable. But, as one solution provider quoted in the CRN “lessons learned” article put it, “Just because I can move it into the cloud, that doesn’t mean I can ignore it. It still needs to be managed. It still needs to be maintained.” Never forget that it’s your data, and no one cares about it as much as you do, no matter what they tell you. Forrester analyst Rachel Dines may have said it best in her blog entry from last week: “ASSUME NOTHING. Your cloud provider isn’t in charge of your disaster recovery plan, YOU ARE!” (She also lists several really good questions you should ask your cloud provider.)

Cloud technologies can solve specific problems for you, and can provide some additional, and valuable, tools for your IT toolbox. But you dare not assume that all of your problems will automagically disappear just because you put all your stuff in the cloud. It’s still your stuff, and ultimately your responsibility.

This post requires two major disclaimers:

  1. I am not an engineer. I am a relatively technical sales & marketing guy. I have my own Small Business Server-based network at home, and I know enough about Microsoft Operating Systems to be able to muddle through most of what gets thrown at me. And, although I’ve done my share of friends-and-family-tech-support, you do not want me working on your critical business systems.
  2. I am not, by any stretch of the imagination, a Linux guru. However, I’ve come to appreciate the “LAMP” (Linux/Apache/MySQL/PHP) platform for Web hosting. With apologies to my Microsoft friends, there are some things that are quite easy to do on a LAMP platform that are not easy at all on a Windows Web server. (Just try, for example, to create a file called “.htaccess” on a Windows file system.)

Some months ago, I got my hands on an old Dell PowerEdge SC420. It happened to be a twin of the system I’m running SBS on, but didn’t have quite as much RAM or as much disk space. I decided to install CentOS v5.4 on it, turn it into a LAMP server, and move the four or five Web sites I was running on my Small Business Server over my new LAMP server instead. I even found an open source utility called “ISP Config” that is a reasonable alternative – at least for my limited needs – to the Parallels Plesk control panel that most commercial Web hosts offer.

Things went along swimmingly until last weekend, when I noticed a strange, rhythmic clicking and beeping coming from my Web server. Everything seemed to be working – Web sites were all up – I logged on and didn’t see anything odd in the system log files (aside from the fact that a number of people out there seemed to be trying to use FTP to hack my administrative password). So I decided to restart the system, on the off chance that it would clear whatever error was occurring.

Those of you who are Linux gurus probably just did a double facepalm…because, in retrospect, I should have checked the health of my disk array before shutting down. The server didn’t have a hardware RAID controller, so I had built my system with a software RAID1 array – which several sources suggest is both safer and better performing than the “fake RAID” that’s built into the motherboard. Turns out that the first disk in my array (/dev/sda for those who know the lingo) had died, and for some reason, the system wouldn’t boot from the other drive.

This is the point where I did a double facepalm, and muttered a few choice words under my breath. Not that it was a tragedy – all that server did was host my Web sites, and my Web site data was backed up in a couple of places. So I wouldn’t have lost any data if I had rebuilt the server…just several hours of my life that I didn’t really have to spare. So I did what any of you would have done in my place – I started searching the Web.

The first advice I found suggested that I should completely remove the bad drive from the system, and connect the good drive as drive “0.” Tried it, no change. The next advice I found suggested that I boot my system from the Linux CD or DVD, and try the “Linux rescue” function. That sounded like a good idea, so I tried it – but when the rescue utility examined my disk, it claimed that there were no Linux partitions present, despite evidence to the contrary: I could run fdisk -l and see that there were two Linux partitions on the disk, one of which was marked as a boot partition, but the rescue utility still couldn’t detect them, and the system still wouldn’t boot.

I finally stumbled across a reference to something called “SuperGRUB.” “GRUB,” for those of you who know as much about Linux as I did before this happened to me, is the “GNU GRand Unified Bootloader,” from the GNU Project. It’s apparently the bootloader that CentOS uses, and it was apparently missing from the disk I was trying to boot from. But that’s precisely the problem that SuperGRUB was designed to fix!

And fix it it did! I downloaded the SuperGRUB ISO, burned it to a CD, booted my Linux server from it, navigated through a quite intuitive menu structure, told it what partition I wanted to fix, and PRESTO! My disk was now bootable, and my Web server was back (albeit running on only one disk). But that can be fixed as well. I found a new 80 Gb SATA drive (which was all the space I needed) on eBay for $25, installed it, cruised a couple of Linux forums to learn how to (1) use sfdisk to copy the partition structure of my existing disk to the new disk, and (2) use mdadm to add the new disk to my RAID1 array, and about 15 minutes later, my array was rebuilt and my Web server was healthy again.

There are two takeaways from this story:

First, the Internet is a wonderful thing, with amazing resources that can help even a neophyte like me to find enough information to pull my ample backside out of the fire and get my system running again.

Second, all those folks out there whom we sometimes make fun of and accuse of not having a life are actually producing some amazing stuff. I don’t know the guys behind the SuperGRUB project. They may or may not be stereotypical geeks. I don’t know how many late hours were burned, nor how many Twinkies or Diet Cokes were consumed (if any) in the production of the SuperGRUB utility. I do know that it was magical, and saved me many hours of work, and for that, I am grateful. (I’d even ship them a case of Twinkies if I knew who to send it to.) If you ever find yourself in a similar situation, it may save your, um, bacon as well.

Most companies instinctively know that they need to be prepared for an event that will compromise business operations, but it’s often difficult to know where to begin.  We hear a lot of acronyms: “BC” (Business Continuity), “DR” (Disaster Recovery), “BIA” (Business Impact Analysis), “RA” (Risk Assessment), but not a lot of guidance on exactly what those things are, or how to figure out what is right for any particular business.

Many companies we meet with today are not really sure what components to implement or what to prioritize.  So what is the default reaction?  “Back up my Servers!  Just get the stuff off-site and I will be OK.”   Unfortunately, this can leave you with a false sense of security.  So let’s stop and take a moment to understand these acronyms that are tossed out at us.

BIA (Business Impact Analysis)
BIA is a process through which a business will gain an understanding from a financial perspective how and what to recover once a disruptive business event occurs.   This is one of the more critical steps and should be done early on as it directly impacts  BC and DR. If you’re not sure how to get started, get out a blank sheet of paper, and start listing everything you can think of that could possibly disrupt your business. Once you have your list, rank each item on a scale of 1 – 3 on how likely it is to happen, and how severely it would impact your business if it did. This will give you some idea of what you need to worry about first (the items that were ranked #1 in both categories). Congratulations! You just performed a Risk Assessment!

Now, before we go much farther, you need to think about two more acronyms: “RTO” and “RPO.” RTO is the “Recovery Time Objective.” If one of those disruptive events occurs, how much time can pass before you have to be up and running again? An hour? A half day? A couple of days? It depends on your business, doesn’t it? I can’t tell you what’s right for you – only you can decide. RPO is the “Recovery Point Objective.” Once you’re back up, how much data is it OK to have lost in the recovery process? If you have to roll back to last night’s backup, is that OK? How about last Friday’s backup? Of course, if you’re Bank of America and you’re processing millions of dollars worth of credit card transactions, the answer to both RTO and RPO is “zero!” You can’t afford to be down at all, nor can you afford to lose any data in the recovery process. But, once again, most of our businesses don’t need quite that level of protection. Just be aware that the closer to zero you need those numbers to be, the more complex and expensive the solution is going to be!

BC (Business Continuity)
Business Continuity planning is the process through which a business develops a specific plan to assure survivability in the event of a disruptive business event: fire, earthquake, terrorist events, etc.  Ideally, that plan should encompass everything on the list you created – but if that’s too daunting, start with a plan that addresses the top-ranked items. Then revise the plan as time and resources allow to include items that were, say, ranked #1 in one category and #2 in the other, and so forth. Your plan should detail specifically how you are going to meet the RTO and RPO you decided on earlier.

And don’t forget the human factor. You can put together a great plan for how you’re going to replicate data off to another site where you can have critical systems up and running within a couple of hours of your primary facility turning into a smoking hole in the ground. But where are your employees going to report for work? Where will key management team members convene to deal with the crisis and its aftermath? How are they going to get there if transportation systems are disrupted, and how will they communicate if telephone lines are jammed?

DR (Disaster Recovery)
Disaster recovery is the process or action a business takes to bring the business back to a basic functioning entity after a disruptive business event. Note that BC and DR are complementary: BC addresses how you’re going to continue to operate in the face of a disruptive event; DR addresses how you get back to normal operation again.

Most small business think of disasters as events that are not likely to affect them.  Their concept of “disaster” is that of a rare act of God or a terrorist attack.  But in reality, there are many other things that would qualify as a “disruptive business event:” fire, long term power loss, network security breach, swine flu pandemic, and in the case of one of my clients, a fire in the power vault of a building that crippled the building for three days.  It is imperative to not overlook some of the simpler events that can stop us from conducting our business.

Finally, it is important to actually budget some money for these activities. Don’t try to justify this with a classic Return on Investment calculation, because you can’t. Something bad may never happen to your business…or it could happen tomorrow. If it never happens, then the only return you’ll get on your investment is peace of mind (or regulatory compliance, if you’re in a business that is required to have these plans in place). Instead, think of the expense the way you think of an insurance premium, because, just like an insurance premium, it’s money you’re paying to protect against a possible future loss.

These days, it seems everybody is talking about “cloud computing,” even if they don’t completely understand what it is. If you’re among those who are wondering what the “cloud” is all about and what it can do for you, maybe you should investigate moving your email to the cloud. You’ll find that there are several hosted Exchange providers (including ourselves) who would be very happy to help you do it.

Why switch to hosted Exchange?  Well,  it is fair to say that for most SMBs, email has become a predominant tool in our arsenal of communications.  The need for fast, efficient, and cost effective collaboration, as well as integration with our corporate environment and mobile devices, has become the baseline of operations – an absolute requirement for our workplace today.

So why not just get an Exchange Server or Small Business Server?  You can, but managing that environment may not be the best use of your resources.  Here are a few things to consider:

Low and Predictable Costs:
Hosted Exchange has become a low cost enterprise service without the enterprise price tag. If you own the server and have it deployed on your own premise, it now becomes your responsibility to prepare for a disruptive business event: fire, earthquake, flood, and in the Puget Sound Area, a dusting of snow. And it isn’t just an event in your own office space that you have to worry about:

  • A few years ago, there was a fire in a cable vault in downtown Seattle that caused some nearby businesses to lose connectivity for as long as four days.
  • Last year, wildfires in Eastern Washington interrupted power to the facility of one of our customers, and the recovery from the event was delayed because their employees were not allowed to cross the fire line to get to the facility.
  • If you are in a building that’s shared with other tenants, a fire or police action in a part of the building that’s unrelated to your own office space could still block access to the building and prevent your employees from getting to work.
  • Finally, even though it may be a cliche, you’re still at the mercy of a backhoe-in-the-parking-lot event

The sheer cost of trying to protect yourself against all of these possibilities can be daunting, and many business would rather spend their cash on things that generate revenue instead.

Depending on features and needs, hosted Exchange plans can be as low as $5 per month per user – although to get the features most users want, you’re probably looking at $10 or so – and if you choose your hosting provider carefully, you’ll find that they have already made the required investments for high availability. Plus you’ll always have the latest version available to you without having to pay for hardware or software upgrades.

Simplified Administration:
For many small businesses, part of the turn-off of going to SBS or a full blown Exchange server is the technical competency and cost associated with managing and maintaining the environment.  While there are some advantages to having your own deployed environment, most customers I talk to today would rather not have to deal with the extra costs of administering backups and managing server licensing (and periodic upgrade costs), hardware refresh, security, etc.  With a good hosted exchange provider, you will enjoy all the benefits of an enterprise environment, with a simple management console.

UP TIME:
Quality hosted Exchange providers will provide an SLA (“Service Level Agreement”) and up time guarantees – and they have the manpower and infrastructure in place to assure up time for their hundreds and thousands of users.

For deployed Exchange, you’ll need to invest in a robust server environment, power protection (e.g., an Uninterruptible Power Supply, or UPS, that can keep your server running long enough for a graceful shutdown – and maybe even a generator if you can’t afford to wait until your local utility restores power), data backup and recovery hardware and software, and the time required to test your backups.  (Important side note here: If you never do a test restore, you only think you have your data backed up. Far too often, the first time users find out that they have a problem is when they have a data loss and find that they are unable to successfully restore from their backup.) The cost/benefit ratio for a small business is simply not in favor of deployed.

Simple Deployment:
Properly setting up and configuring an Exchange environment and not leaving any security holes can be a daunting task for the non-IT Professional.  Most SMBs will need to hire someone like us to set up and manage the environment, and, although we love it when you hire us, and although the total cost of hiring us may be less than it would cost you to try to do it yourself (especially if something goes wrong), it is still a cost.

With a hosted environment, there is no complicated hardware and software setup.  In some cases, hosting providers have created a tool that you execute locally on your PC that will even configure the Outlook client for you.

A few questions to ask yourself:

  • Do we have the staff and technical competency to deploy and maintain our own Exchange environment?
  • What is the opportunity cost/gain by deploying our own?
  • What are the costs of upgrades/migration in a normal life-cycle refresh?
  • Is there a specific business driver that requires us to deploy?
  • What are the additional costs we will incur?  (Security, archiving, competency, patch management, encryption, licensing, etc.)

This is not to say that some businesses won’t benefit from a deployed environment, but for many – and perhaps most – businesses, hosted Exchange will provide a strong reliable service that will enable you to effectively communicate while having the peace of mind that your stuff is secure and available from any location where you have Internet access. Even if the ultimate bad thing happens and your office is reduced to a smoking crater, your people can still get to their email if they have Internet access at home or at the coffee shop down the street. If you’re as dependent on email as most of us are, there’s a definite value in that.

Moose Logic has been building and supporting networks for a long time. And during most of that time we’ve had a real love-hate relationship with most of the backup technologies we’ve implemented and/or recommended.

Tape backups – although they are arguably the best technology for long-term archival storage – are a pain to manage. Tapes wear out. Tape drives get dirty. People just don’t do test restores as often as they should. As a result, all too often, the first time you realize that you’ve got a problem with your backups is when you have a data loss, try to restore from your backups, and find out that they’re no good.

Add to that the astronomical growth in storage capacity, meaning that all the data you need to back up often won’t fit on one tape any more. So, unless you have someone working the night shift who can swap out the tape when it gets full, you’re faced with…

  • Buying multiple tape drives, which typically means you’re going to spend more on your backup software. And if your servers are virtualized, where are you going to install those tape drives?
  • Buying a tape library (a.k.a. autoloader), which can also get expensive.
  • Changing the tape when you come in the next morning, which means that your network performance suffers because you’re trying to finish the backup job(s) while people are trying to get work done.

Then there’s the issue of getting a copy of your data out of the building. Typically, that’s done by having multiple sets of tapes, and a designated employee who takes one set home every Friday and brings the other set in. If s/he remembers. Or isn’t sick or on vacation.

Backing up to external hard drives is a reasonable alternative for some. It solves the capacity issue in most cases. But over the years, we’ve seen reliability issues with some manufacturers’ units. We’ve uncovered nagging little issues like some units that don’t automatically come back on line after a power interruption. And they’re not necessarily the best for long-term archival storage, unless you keep them powered on – or at least power them on once in a while – because hard disks that just sit for long periods of time may develop issues with the lubrication in their bearings and not want to spin back up.

But we’ve finally found an approach that we really, really like. One that, as one of our engineers said in an internal email thread, we actually enjoy managing. In fact, we like it so much we built a backup appliance around it. It’s Microsoft’s System Center Data Protection Manager (SCDPM).

In this installment of the Moose Logic Video Series, our own Scott Gorcester gives you a quick overview of SCDPM 2010:



For more detail on how it works, check out the description of our MooseSentryTM backup appliance.

A few days ago, in the post entitled “Seven things you need to do to keep your data safe,” we were talking primarily about some simple things that individuals can do to protect their data, even if (or especially if) they’re not IT professionals. In this post, we’re talking to you, Mr. Small Business Owner.

You might think that it’s intuitively obvious why you would need good backups, but according to an HP White Paper I recently discovered (which you should definitely download and read), as many as 40% of Small and Medium Sized Businesses don’t back up their data at all.

The White Paper is entitled Impact on U.S. Small Business of Natural and Man-Made Disasters. What kinds of disasters are we talking about? The White Paper cites statistics from a presentation to the 2007 National Hurricane Conference in New Orleans by Robert P. Hartwig of the Insurance Information Institute. According to Hartwig, over the 20-year period of 1986 through 2005, catastrophic losses broke down like this:

  • Hurricanes and tropical storms – 47.5%
  • Tornado losses – 24.5%
  • Winter storms – 7.8%
  • Terrorism – 7.7%
  • Earthquakes and other geologic events – 6.7%
  • Wind/hail/flood – 2.8%
  • Fire – 2.3%
  • Civil disorders, water damage, and utility services disruption – less than 1%

If you’re in Moose Logic’s back yard here in the great State of Washington, you probably went down that list and told yourself, with a sigh of relief, that you didn’t have to worry about almost three-quarters of the disasters, because we typically don’t have to deal with hurricanes and tornadoes. But you might be surprised, as I was, to learn that we are nevertheless in the top twenty states in terms of the number of major disasters, with 40 disasters declared in the period of 1955 – 2007. We’re tied with West Virginia for 15th place.

Sometimes, disasters come at you from completely unexpected directions. Witness the “Great Chicago Flood” of 1992. Quoting from the White Paper:

In 1899 the city of Chicago started work on a series of interconnecting tunnels located approximately forty feet beneath street level. This series of tunnels ran below the Chicago River and underneath the Chicago business district, known as The Loop. The tunnels housed a series of railroad tracks that were used to haul coal and to remove ashes from the many office buildings in the downtown area. The underground system fell into disuse in the 1940’s and was officially abandoned in 1959 and the tunnels were largely forgotten until April 13th, 1992.

Rehabilitation work on the Kinzie Street bridge crossing the Chicago River required new pilings and a work crew apparently drove one of those pilings through the roof of one of those long abandoned tunnels. The water flooded the basements of Loop office buildings and retail stores and an underground shopping district. More than 250 million gallons of water quickly began flooding the basements and electrical controls of over 300 buildings throughout the downtown area. At its height, some buildings had 40 feet of water in their lower levels. Recovery efforts lasted for over four weeks and, according to the City of Chicago cost businesses and residents, an estimated $1.95 billion. Some buildings remained closed for weeks. In those buildings were hundreds of small and medium businesses suddenly cut off from their data and records and all that it took to conduct business. The underground flood of Chicago proved to be one of the worst business disasters ever.

Or how about the disaster that hit Tessco Technologies, outside of Baltimore, in October of 2002? A faulty fire hydrant outside its Hunt Valley data center failed, and “several hundred thousand gallons of water blasted through a concrete wall leaving the company’s primary data center under several feet of water and left some 1400 hard drives and 400 SAN disks soaking wet and caked with mud and debris.”

How could you have possibly seen those coming?

And as if these disasters aren’t bad enough, other studies show that as much as 50% of data loss is caused by user error – and we all have users!

One problem, of course, as we’ve observed before, is that it’s difficult to build an ROI justification around the bad thing that didn’t happen. Unforeseen disasters are, well, unforeseen. There’s no guarantee that the big investment you make in backup and disaster recovery planning is going to give you any return in the next 12 – 24 months. It’s only going to pay off if, God forbid, you actually have a disaster to recover from. So it’s no surprise that, when a business owner is faced with the choice between making that investment and making some other kind of business investment that will have a higher likelihood of a short-term payback (or perhaps taking that dream vacation that the spouse has been bugging you about for the last five years), the backup / disaster recovery expenditure drops, once again, to the bottom of the priority list.

One solution is to shift your perspective, and view the expense as insurance. Heck, if it helps you can even take out a lease to cover the cost – then you can pretend the lease payment is an insurance premium! You wouldn’t run your business without business liability insurance – because without it you could literally lose everything. You shouldn’t run your business without a solid backup and disaster-recovery plan, either, and for precisely the same reason.

Please. Download the HP White Paper, read it, then work through the following exercise:

  • List all of the things that you can imagine that would possibly have an impact on your business. I mean everything – from the obvious things like flood, fire, and earthquake, to less obvious things, like a police action that restricts access to the building your office is in, or the pandemic that everyone keeps telling us is just around the corner.
  • For each item on your list, make your best judgment call, on a scale of 1 to 3, of
    • How likely it is to happen, and
    • How severely it would affect your business if it did happen.

You now have the beginnings of a priority list. The items that you rated “3″ in both columns (meaning not likely to happen, and not likely to have a severe effect on your business even if they did) you can push to the bottom of the priority list. The items that you rated “1″ in both columns need to be addressed yesterday. The others fall somewhere in between, and you’re going to have to use your best judgment in how to prioritize them – but at least you now have some rationale behind your decisions.

The one thing you can’t afford to do is to keep putting it off. Hope is not a strategy, nor is it a DR plan.

Jeremy Moskowitz recently posted a great article entitled Backup Tips for the 21st Century: Backup procedures so easy, your Mom could (and should) do it. This is not directed at IT managers or anyone else who has to manage a business network, although there are certainly some common themes, which we’ll talk about a bit later. Rather, the article is targeted at the average home user – you know, those people who are always asking you to help them with some kind of computer problem, because you “know about computers.”

I’d strongly recommend that you click over and read his entire article, and share it with as many people as possible, because he goes into detail on why you should be doing each of these things. But just to give you a little taste of it, here are the seven things:

  1. Get an online backup service (e.g., Carbonite.com, Mozy.com, etc.)
  2. Get a full-disk backup program
  3. Backup to an external USB drive (in fact, get two or three – they’re cheap)
  4. Don’t keep all your backups in your house
  5. Rotate between at least two, possibly three USB drives
  6. Keep copies of your original disks, downloadables, keycodes, and drivers
  7. Test your restore procedure

Although he feels strongly that you should do all seven in order to be absolutely safe, he also points out that just doing one of them will make you better off than most people – who don’t do anything at all! (And if you only do one, he suggests #3.)

Why should people do these things? Because, in Jeremy’s words, “DISK DRIVES ALWAYS FAIL. ALWAYS. It’s a guarantee. Even the newest ones with no moving parts. They all fail. Eventually.” And he’s right. The only question is when. I’ve seen drives fail within days of being installed (not many, but some), and drives last for years. But eventually, they will wear out. When they do, the data on them is toast, so you’d better either have a backup or have deep pockets to pay someone who specializes in forensic data recovery, and who may or may not be able to recover your most precious data from the dead drive no matter how much you’re willing to pay.

So, how does this translate to sound business practice? Allow me to paraphrase his seven points, and combine a couple of them:

  • Make sure you’re getting a copy of your data out of the building. Use an on-line service, stream data to a repository at a branch office, or just take a copy home every Friday. But do something to get a copy out of the building.
  • Your backup strategy should encompass both machine images and file/folder based backups. If you lose an entire system, it’s a lot faster to restore from an image than to reinstall the OS from scratch and then restore the data files. On the other hand, if all you need is a single file, or a single email message or mailbox, you don’t want to have to restore an entire image just to get that one thing you need.
  • What he said about disks failing goes double (at least) for tapes. Tapes are far less reliable than hard disks. Their capacity is limited. They wear out quickly. The drives get dirty and are subject to a variety of mechanical problems. Unless you’ve either got an expensive autoloader or a night operator to swap tapes in the middle of the night, if your tape fills up you either cancel the job when you come in the next morning, or you finish the backup during working hours and live with the performance hit of doing that while users are trying to work. That’s why we believe so strongly in disk-to-disk backups.
  • Keep copies of your original disks, downloadables, keycodes, and drivers. (Not much I can add to that point.)
  • Test your restore procedure. (Not much I can add to that either.) If you don’t ever do a test restore, you only think you’re getting good backups. And if you’re not, you won’t know about it until you have a catastrophic failure and find out that your data is gone forever.

That’s all for today – you go read Jeremy’s post in full, I’m going to swing by the local office superstore and pick up a couple more USB hard drives…

I am a big fan of virtualization. My feeling is that many – if not most – workloads in small to medium sized enterprises should be running as Virtual Machines on Virtual Servers. BUT please be very, very careful how you build those systems!

The truth about virtualization is that it is a platform with which you can provide a highly flexible computing environment. This includes a ton of wonderful features and benefits. But, before you go trip over your pants leg, here is a tip: highly available virtualization environments do sometimes fail! (Sooner or later, everything does.) So my recommendation is to be very careful in designing and protecting your HA solutions – provide two or more of everything. Virtualization technologies will save you boatloads of money if you build them right, so don’t scrimp on the details!

So here are my simple rules:

  1. Provide two or more Virtualization Hosts. Make sure they’re sized such that if one should fail, you have the capacity on the surviving host(s) to restart any critical workloads that are affected by the failure.
  2. Shared storage (e.g., a SAN) is a necessity for “Live Motion,” which allows you to move running virtual machines from one host to another, either to balance the workload or to unload a host so you can perform maintenance on it. It’s also what enables you to restart critical workloads on a surviving host if one should fail. But to keep the SAN itself from becoming a single point of failure, you should provide at least two SAN nodes that are configured to replicate your data.
  3. Back up your data and your VM’s using tools that allow both images and folder based backups. When recovering from a catastrophic failure, restoring a server image is often the fastest way to get things running again – but you don’t want to go to the trouble of restoring a complete server image if all you need are a couple of files. So a schedule that encompasses both kinds of backups is best.
  4. Make certain that you get data and server images offsite religiously. Rule #1 for Disaster Recovery / Business Continuance is to get the data out of the building.

These simple rules allow for a significant amount of reliability and flexibility. Even with inexpensive hardware and software (there are a number of excellent software products that are free to use), your systems can continue to run or be easily restarted within minutes of hardware failure. In many cases even the total loss of two servers (one virtualization host and one SAN node, for example) would be a minor event in terms of its impact on operations. If you are religious about taking your data and image backups offsite your entire system could be up and running within a day even if you were not able to get to your main location for some reason.

Since a virtualized infrastructure is so resilient, you can afford to use computer systems that are not necessarily top-of-the-line, but you can’t afford not to build it right. A long-time customer (you know who you are) once told us, “The worst thing I could do would be to spend $25,000 on my new systems when I should have spent $30,000.” The dollar amounts aren’t the important thing here – it’s the concept that when you cut corners on something, the chances are high that sooner or later it will come back and bite you. You’ll never be sorry if you take the time and effort to make sure you do it right.

Latest Blog Feeds
Testimonials
“Our business is all about process and margins; we rely on Moose Logic to install and manage network solutions that enable us to control both. Moose Logic created solutions that transformed our business relationships and processes.”
Ron Horowitz
Birchwood Park Homes
Read our Newsletter
Copyright © 2010 All rights reserved.
Wordpress Delicate template designed by NattyWP