Data folks talk a lot about uptime and “five nines”…a goal which means that 99.999% of the time, the system must be up. The issue is that many companies don’t bother to define what downtime means to them, and what downtime-causing threats are.

Define downtime

Before we can address uptime, we have to define downtime. Downtime could mean:

A specific database is offline
A specific server is offline
System performance is slow enough to miss service level agreements (SLAs)
Scheduled maintenance (or not)

Many shops with strict SLAs treat poor performance as downtime, and rightfully so. An excellent example of this is the drug interaction database I used to manage. When a doctor issues a prescription at a hospital, code checks that the new medication doesn’t interact badly with current medications. If the code is running too slow to trigger an alert, someone could die.

Define threats

Next, define what you need to protect against. Almost everyone overlooks this critical step, and instead jumps to a popular solution.

Do you want to protect against earthquakes, floods, tornados, hardware failures? How about software failure? And what does software failure mean to you? And don’t overlook internal espionage whether intentional or not.

What does internal espionage look like? Common causes from innocent sources include:

a DDoS attack from a user running a huge query, which uses up all the resources on the server
a massive un-batched delete
an in-house application that doesn’t close connections and fills up the memory
someone dropping an object or truncating an important table by accident
a DBA who over-tunes the backups, maxing out server resources

You get the idea. Which of these are you going to protect against?

Define duration

The next big question is, “How long are you planning your downtime to be?” After all, you could run off the secondary server for weeks at a time, or just for a couple of hours while you fix the issue.

The answer will be different for each issue you’re protecting against. You can easily plan for a two-hour downtime to recover from dropping a table. It’s much harder to have a two-hour downtime for a motherboard failure.

You should have a quick way to recover from dropping non-table objects — like stored procedures and views. Done right, you shouldn’t need to fail over to recover those objects fast.

Put it in writing!

Next, meet with all the stakeholders and put all requirements in writing. You want no misunderstandings and no ambiguities.

Document limitations, so everyone knows what types of failures would be catastrophic. If the group decides against something, document the reasoning behind that decision. Things will change in the future, and the team may want to add in an element that they rejected before.

It is crucial that you get everyone’s agreement on all this before you go any further.

The bottom line

What does “downtime” mean to you, and to your company?
What are the threats you need to protect against?
How long can an outage be, for each type of downtime?
Now, write all that down.

In the next article, we cover the various types of HA solutions available, and what they’re actually for.

Meanwhile, follow us on LinkedIn and subscribe to our newsletter there!

Watch the “Learn SQL Server: Uptime!” webinar

Follow us on LinkedIn!

SQL is offline for 10 minutes this week. Meanwhile, this article takes 5 minutes to read.

Define downtime

Define threats

Define duration

Put it in writing!

The bottom line

Watch the “Learn SQL Server: Uptime!” webinar

Products

Resources

Our Company