As a DBA, youâ€™re in charge of getting systems up and running quickly in the event of an emergency. This is all right and proper, right up until you start defining SLAs. Let’s see what went wrong.
As a DBA, youâ€™re in charge of keeping the systems healthy, and getting them back up and running quickly in the event of an emergency. This is perfectly right and proper, right up until you start defining a service level agreement.
A Service Level Agreement (SLA) defines the level of service you agree to provide, to get the system back up after a downtime. An SLA is usually expressed in terms of time. So, if you have a two-hour SLA, that means you agree that when thereâ€™s a grave issue, youâ€™ll have the system back up within two hours.
But how did you get that two-hour SLA in the first place? Usually, it goes like this:
- The customer explains that they must have the database up within two hours of a problem.
- You donâ€™t see any problems with that.
- Maybe thereâ€™s even a written agreement that you sign, knowing full well your database experience can easily handle a downed database in that time.
Like so many things, this sounds perfectly reasonable on the surface, but doesnâ€™t hold up once it comes in contact with reality.
SLAs in the Real World
SLAs are often poorly thought-out, and very rarely tested. As a matter of fact, most companies donâ€™t even have SLAs; they have SLOs. An SLO is a Service Level Objective â€“ something youâ€™re striving for, but donâ€™t know for sure whether you can achieve. SLOs allow you to have some kind of metric, without bothering to test in advance whether the objective is even possible.
This lack of testing is the primary barrier to achievable SLAs. Lots of factors can impact your ability to get a system up and available after an issue. Letâ€™s take a close look at just two of those factors: hardware failures, and data corruption.
When a hardware failure causes an outage, it can take a long time to get replacement parts. Companies work around this â€“ sometimes â€“ by keeping replacement components on hand for important systems, but thatâ€™s only sometimes. If your company has no replacements lying around, you are completely at the mercy of the hardware vendor. This can â€“ no, will â€“ demolish your SLA.
Now, maybe your company has an SLA with the parts vendor, and maybe they donâ€™t. Typically, that vendor SLA will be something like a four-hour replacement windowâ€¦but thatâ€™s just when they agree to show up! The vendor canâ€™t promise that the parts will be installed, configured, and running in that time.
So, your two hour SLA wonâ€™t work with the four hours it takes just to get the hardware, and the two (or more) hours getting everything set up. Oh yes, plus the time to diagnose the issue in the first place, possibly reinstall SQL, restore the database, troubleshoot additional issues, or all the above. All of this puts you at least three times the SLA you agreed to support.
- Consider how long it takes to get replacement parts.
- If your customer is important enough, keep replacement parts in-house.
- Account for extra time, for troubleshooting, installation, configuration, restoring, more troubleshooting, and anything else that can come up.
Database corruption is another outage scenario. Depending on the level of corruption, it may take anywhere from a few minutes to a few hours to diagnose and fix it.
For now, weâ€™ll assume that itâ€™s just a table thatâ€™s gotten corrupted. Now, is it the data, or just an index? If itâ€™s an index, depending on the size of the table, it could be a quick fix, or it could be a couple hours. However, if itâ€™s a table, then you may have to restore all or part of a database to bring it back. That brings another host of issues into the fray, like:
- Do you have enough space to restore the database somewhere?
- How long will it take?
- Is the data even onsite?
- How will you get the data back into the production table?
- Do you bring the system down while youâ€™re fixing it?
Of course, it could also be the entire database thatâ€™s down, in which case you will need a restore (assuming the corruption wasnâ€™t present in the last backup). A few of the things you must consider:
- Do you know how long that restore will take?
- Have you done what is necessary to make sure you can restore quickly by tuning your backups, making sure the log is small, turning on IFI (Instant File Initialization), etc.?
Without some foresight, you could easily spend that two hour SLA window zeroing out 90GB of log file. Your 1.5 hours of data restore will put you quite a bit outside of your agreement.
- Make sure you have space to restore your largest database, somewhere off the production server.
- Implement a good CheckDB solution, plus alerting.
- Practice various recovery scenarios to see how long they take.
- Make sure your backups are in order.
- Practice database restores, and get your backups tuned. (Tuning your backups means you can tune your restores, too!)
- Take these practice sessions into account when you make your SLAs.
The conclusions above are a good start, but not at all the complete picture. If youâ€™re keeping a single SLA for any given server, youâ€™re doing yourself and your customers a huge disservice.
First, define separate SLAs for the different types of failure, and define what each specific failure looks like. For instance, if you define an SLA for database availability, define what â€˜availableâ€™ means. Does it mean people can connect?Â Does it mean that major functions are online?Â Does it mean absolutely everything is online?Â Does it include a performance SLA?Â Iâ€™ve seen performance SLAs be included in downtime procedures because sometimes a database is so important, if the performance isnâ€™t there, then it might as well be offline.
Next, review SLAs regularly. So, youâ€™ve reasonably determined that you can accommodate a four-hour SLA for the DB1 database. What about, as the database grows?Â Are you going to put in an allowance for the database tripling in size?Â Surely you canâ€™t be expected to hold the same SLA two years later that you did when the database was new.
Finally, test, test again, and then test one more time just to be sure. In fact, you should be testing your recovery procedures periodically so you can discover things that may go wrong, or lengthen the process. if you promised two-hour downtime and you canâ€™t get your recovery procedures under that time, then youâ€™ve got some re-working to do, donâ€™t you?Â Donâ€™t just throw in the towel and say you canâ€™t do it, because contracts may already be signed and you may have no choice but to see that it works. Maybe youâ€™re really close to being able to hit the SLA, and you just have to be creative (and maybe, to work for a company thatâ€™s willing to spend the money).