Your Service Level Agreement is a Disaster

As a DBA, you’re in charge of getting systems up and running quickly in the event of an emergency. This is all right and proper, right up until you start defining SLAs. Let’s see what went wrong.

Photo by JOSHUA COLEMAN on Unsplash

As a DBA, you’re in charge of keeping the systems healthy, and getting them back up and running quickly in the event of an emergency. This is perfectly right and proper, right up until you start defining a service level agreement.

A Service Level Agreement (SLA) defines the level of service you agree to provide, to get the system back up after a downtime. An SLA is usually expressed in terms of time. So, if you have a two-hour SLA, that means you agree that when there’s a grave issue, you’ll have the system back up within two hours.

But how did you get that two-hour SLA in the first place? Usually, it goes like this:

  1. The customer explains that they must have the database up within two hours of a problem.
  2. You don’t see any problems with that.
  3. Maybe there’s even a written agreement that you sign, knowing full well your database experience can easily handle a downed database in that time.

Like so many things, this sounds perfectly reasonable on the surface, but doesn’t hold up once it comes in contact with reality.

SLAs in the Real World

SLAs are often poorly thought-out, and very rarely tested. As a matter of fact, most companies don’t even have SLAs; they have SLOs. An SLO is a Service Level Objective – something you’re striving for, but don’t know for sure whether you can achieve. SLOs allow you to have some kind of metric, without bothering to test in advance whether the objective is even possible.

This lack of testing is the primary barrier to achievable SLAs. Lots of factors can impact your ability to get a system up and available after an issue. Let’s take a close look at just two of those factors: hardware failures, and data corruption.

Hardware Failures

When a hardware failure causes an outage, it can take a long time to get replacement parts. Companies work around this – sometimes – by keeping replacement components on hand for important systems, but that’s only sometimes. If your company has no replacements lying around, you are completely at the mercy of the hardware vendor. This can – no, will – demolish your SLA.

Now, maybe your company has an SLA with the parts vendor, and maybe they don’t. Typically, that vendor SLA will be something like a four-hour replacement window…but that’s just when they agree to show up! The vendor can’t promise that the parts will be installed, configured, and running in that time.

So, your two hour SLA won’t work with the four hours it takes just to get the hardware, and the two (or more) hours getting everything set up. Oh yes, plus the time to diagnose the issue in the first place, possibly reinstall SQL, restore the database, troubleshoot additional issues, or all the above. All of this puts you at least three times the SLA you agreed to support.

Conclusions:

  • Consider how long it takes to get replacement parts.
  • If your customer is important enough, keep replacement parts in-house.
  • Account for extra time, for troubleshooting, installation, configuration, restoring, more troubleshooting, and anything else that can come up.

Data Corruption

Database corruption is another outage scenario. Depending on the level of corruption, it may take anywhere from a few minutes to a few hours to diagnose and fix it.

For now, we’ll assume that it’s just a table that’s gotten corrupted. Now, is it the data, or just an index? If it’s an index, depending on the size of the table, it could be a quick fix, or it could be a couple hours. However, if it’s a table, then you may have to restore all or part of a database to bring it back. That brings another host of issues into the fray, like:

  • Do you have enough space to restore the database somewhere?
  • How long will it take?
  • Is the data even onsite?
  • How will you get the data back into the production table?
  • Do you bring the system down while you’re fixing it?

Of course, it could also be the entire database that’s down, in which case you will need a restore (assuming the corruption wasn’t present in the last backup). A few of the things you must consider:

  • Do you know how long that restore will take?
  • Have you done what is necessary to make sure you can restore quickly by tuning your backups, making sure the log is small, turning on IFI (Instant File Initialization), etc.?

Without some foresight, you could easily spend that two hour SLA window zeroing out 90GB of log file. Your 1.5 hours of data restore will put you quite a bit outside of your agreement.

Conclusions:

  • Make sure you have space to restore your largest database, somewhere off the production server.
  • Implement a good CheckDB solution, plus alerting.
  • Practice various recovery scenarios to see how long they take.
  • Make sure your backups are in order.
  • Practice database restores, and get your backups tuned. (Tuning your backups means you can tune your restores, too!)
  • Take these practice sessions into account when you make your SLAs.

The Solution

The conclusions above are a good start, but not at all the complete picture. If you’re keeping a single SLA for any given server, you’re doing yourself and your customers a huge disservice.

First, define separate SLAs for the different types of failure, and define what each specific failure looks like. For instance, if you define an SLA for database availability, define what ‘available’ means. Does it mean people can connect?  Does it mean that major functions are online?  Does it mean absolutely everything is online?  Does it include a performance SLA?  I’ve seen performance SLAs be included in downtime procedures because sometimes a database is so important, if the performance isn’t there, then it might as well be offline.

Next, review SLAs regularly. So, you’ve reasonably determined that you can accommodate a four-hour SLA for the DB1 database. What about, as the database grows?  Are you going to put in an allowance for the database tripling in size?  Surely you can’t be expected to hold the same SLA two years later that you did when the database was new.

Finally, test, test again, and then test one more time just to be sure. In fact, you should be testing your recovery procedures periodically so you can discover things that may go wrong, or lengthen the process. if you promised two-hour downtime and you can’t get your recovery procedures under that time, then you’ve got some re-working to do, don’t you?  Don’t just throw in the towel and say you can’t do it, because contracts may already be signed and you may have no choice but to see that it works. Maybe you’re really close to being able to hit the SLA, and you just have to be creative (and maybe, to work for a company that’s willing to spend the money).

5 thoughts on “Your Service Level Agreement is a Disaster”

  1. Great post!

    One issue I have seen with places that test the restores, is that it is quite often always the same person performing the test in the exact same way.

    Every member of the team (assuming there’s more than 1 of you) should be practicing a variety of disaster scenarios. If you really want to make the test realistic, enlist the help of some managers to get them to keep stopping by and asking them if it’s up yet while they run the test.

  2. I worked in a shop where we ran off of a different location every month. It was a live firedrill every month. When we started doing the switch-over it took us several hours to get everything online. It took us about 5mos, but we finally got the process down to about 30mins from start to finish. It can take quite a bit of trial and error for all the teams to get their code worked out and to get all the special cases taken care of.

    This is something that a lot of companies don’t count on.

  3. Yeah, we do full company Business Continuity tests regularly where we fail all applications over to another data center and run everything only from the 1 data center for that day.

    It’s just another day in the office for the DBA team because we fail over our stuff any time we need to. It’s the apps that still need some stuff worked out.

  4. Yep, I usually find that the DBA team is well ahead of all the others in these situations. Perhaps it’s because we’ve got built-in HADR, and apps and Windows, not so much.

  5. Great post. I would add that you should spend a little time planning for absolutely worst case scenarios. I once worked at a place with a good DR plan and regular tests, offsite DR, everything, but which then suffered double site outages for Prod. All of a sudden we were looking at how to restore our tape backups to our non-Prod infrastructure.

    You might not be able to guarantee to hit your SLA, but you should be able to give a reasonable estimate for when each bit of your system will be available in such a worst case scenario.

Leave a Reply

Your email address will not be published. Required fields are marked *