SQL maintenance is a lifecycle, not an event!

You’ve heard me talk about this many times, in so many different ways, but it’s worth repeating: SQL maintenance lifecycles are important.

People who disagree, disagree because they spend too much time firefighting and don’t have time to really think about it. Don’t know anything about SQL maintenance lifecycles? Let’s touch on a couple of points…today, we’ll cover SQL backup.

You’ve heard me talk about this many times, in so many different ways, but it’s worth repeating: SQL maintenance lifecycles are important.

People who disagree, disagree because they spend too much time firefighting and don’t have time to really think about it.  Don’t know anything about SQL maintenance lifecycles? Let’s touch on a couple of points…today, we’ll cover SQL backup.

SQL maintenance lifecycles: Backups

Backups have the biggest lifecycles of all, because the uses for backup files are so varied.

They’re used for restoring databases to the original server, sure.  But here’s a short list of the other kinds of things you need to do with backup files:

  • Restore to development, QA, etc.
  • Send to vendor
  • Initialize replication
  • Initialize Availability Groups
  • Restore to a secondary server to run CheckDB

A backup is not a single event

A database backup isn’t a single event.  But unfortunately, nearly every vendor out there – and most DBAs – treat backups as a single event that has a start and a finish.  “Look, I took a SQL backup and produced a file, and now I’m finished!”  But that’s not remotely the case.  We use and shuffle SQL backup files around all day and all week long on our various servers.

This is why I’m so against those tools that I call “network backup tools”.  You know the ones: Backup Exec, AppAssure, etc.  Your network admin loves those tools, because he doesn’t understand SQL and he wants the same backup utility for the whole shop.

But network backup tools simply aren’t going to cut it for SQL maintenance lifecycles, because they take snapshots of the data and log files.  And quite often those two files aren’t fully synchronized when the snapshot takes place, so you wind up creating a new log file and losing whatever data was in there.  Not only that, but you can only attach those files somewhere, usually on the same server they came from.

Without the understanding that your maintenance has a lifecycle, you wind up with a lot of piecemeal solutions.  You write something to restore a database on a development box, then another DBA does the same thing, then another, then another.  Soon you have a huge mix of solutions that need support, all created by a mix of DBAs over the years…it’s an unmanageable web of solutions, each one with its own quirks and limitations.  That’s no way to run a shop, and it’s no way to support your data.

SQL Maintenance Lifecycles: SQL backup done right

Minion BackupThis is why we’ve made Minion Backup the way we have.  Minion Backup is the only backup tool on the market that treats your environment holistically.  Minion Backup understands your SQL maintenance lifecycle and it works for you to provide you built-in solutions that you just deploy.  There’s no need to write your own processes, or manage something separate because it’s built into Minion Backup by default.

Let’s take for instance that you need to back up your SQL certificates, because you know, who doesn’t?  With Minion Backup you can easily backup every certificate, even to multiple locations.  In fact, you can back up your certificates to as many locations as you need.  This functionality is built into Minion Backup, and setting it up is a simple table setting.

Not only that, but since we store the private key password encrypted, and give it to you in the log, you can even rotate the private key passwords on any schedule you like to increase your security.  This way if you change the private key password every month, then you’ve always got access to the encrypted value in the log so you can restore that certificate with ease.  This is what we mean by SQL maintenance lifecycle management.  All of our SQL maintenance products have this level of lifecycle management.  And we’re improving them all the time.  We work closely with our users to give them built-in solutions that matter.

All of our SQL maintenance products are completely FREE, so download them and try them now.  And if you like what you see, then try our enterprise products and see the power of bringing your SQL maintenance lifecycles into the enterprise architecture.

Download our FREE maintenance modules below:

  • Minion Backup: Free SQL Server Backup utility with world-class enterprise features and full lifecycle management.
  • Minion Reindex: Free SQL Server Reindex utility with unmatched configurability.
  • Minion CheckDB: Free SQL Server CHECKDB that solves all of your biggest CHECKDB problems.

 

What Really Causes Performance Problems?

Every IT shop has its problems with performance: some localized, and some that span a server, or even multiple servers. Technologists tend to treat these problems as isolated incidents – solving one, then another, and then another. This happens especially when a problem is recurring but intermittent. When a slowdown or error happens every so often, it’s far too easy to lose the big picture.

Some shops suffer from these issues for years without ever getting to the bottom of it all. So, how can you determine what really causes performance problems?

Every IT shop has its problems with performance: some localized, and some that span a server, or even multiple servers. Technologists tend to treat these problems as isolated incidents – solving one, then another, and then another. This happens especially when a problem is recurring but intermittent. When a slowdown or error happens every so often, it’s far too easy to lose the big picture.

Some shops suffer from these issues for years without ever getting to the bottom of it all.  So, how can you determine what really causes performance problems?

First, a story

A developer in your shop creates an SSIS package to move data from one server to another. He makes the decision to pull the data from production using SELECT * FROM dbo.CustomerOrders.  This works just fine in his development environment, and it works fine in QA, and it works fine when he pushes it into production.  The package runs on an hourly schedule, and all is well.

What he doesn’t realize is that there’s a VARCHAR(MAX) column in that table that holds 2GB of data in almost every row…in production.

Things run just fine for a couple months.  Then without warning, one day things in production start to slow down.  It’s subtle at first, but then it gets worse and worse.  The team opens a downtime bridge, and a dozen IT guys get on to look at the problem.  And they find it!  An important query is getting the wrong execution plan from time to time.  They naturally conclude that they need to manage statistics, or put in a plan guide, or whatever other avenue they decide to take to solve the problem.  All is well again.

A couple of days later, it happens again.  And then again and then again.  Then it stops.  And a couple weeks later they start seeing a lot of blocking.  They put together another bridge, and diagnose and fix the issue.  Then they start seeing performance issues on another server that’s completely unrelated to that production server.  There’s another bridge line, and another run through the process again.

What’s missing here?

The team has been finding and fixing individual problems, but they haven’t gotten to the root of the issue: the SSIS package data pull is very expensive.  It ran fine for a while, but once the data grew (or more processes or more users came onto the server), the system was no longer able to keep up with demand.  The symptoms manifested differently every time.  While they’re busy blaming conditions on the server, or blaming the way the app was written, the real cause of the issues is that original data pull.

Now multiply this situation times several dozen, and you’ll get a true representation of what happens in IT shops all over the world, all the time.

What nobody saw is that the original developer should never have had access to pull that much data from production to begin with.  He didn’t need to pull all of the columns in that table, especially the VARCHAR(MAX) column.  By giving him access to prod – by not limiting his data access in any way – they allowed this situation to occur.

What Really Causes Performance Problems?

Just as too many cooks spoil the broth, too many people with access to production, will cause instability. Instability is probably the biggest performance killer. But IT shops are now in the habit of letting almost anyone make changes as needed, and then treating the resulting chaos one CPU spike at a time.

This is why performance issues go undiagnosed in so many shops.  The people in the trenches need the ability to stand back and see the real root cause of issues past the singular event they’re in the middle of, and it’s not an easy skill to develop.  It takes a lot of experience and it takes wisdom, and not everyone has both.  So, these issues can be very difficult to ferret out.

Even when someone does have this experience, they’re likely only one person in a company of others who aren’t able to make the leap.  Management quite often doesn’t understand enough about IT to see how these issues can build on each other and cause problems, so they’ll often refuse to make the necessary changes to policy.

So really, the problem is environmental, from a people point of view:

  • Too many people in production makes for an unstable shop.
  • It takes someone with vision to see that this is the problem, as opposed to troubleshooting the symptoms.
  • Most of the time, they’re overridden by others who only see the one issue.

What’s the ultimate solution?

In short: seriously limit the access people have in production. It’s absolutely critical to keep your production environments free from extra processes.

Security is one of those areas that must be constantly managed and audited, because it’s quite easy to inadvertently escalate permissions without realizing it.  This is where Minion Enterprise comes in: I spent 20 years in different shops, working out the best way to manage these permissions, and even harder, working out how to make sure permissions didn’t get out of control.

Minion Enterprise gives you a complete view of your entire shop to make it effortless to audit management conditions on all your servers.

That’s the difference between performance monitoring and management monitoring.  The entire industry thinks of performance as a single event, when in reality, performance is multi-layered.  It’s comprised of many events, management-level events where important decisions have been ignored or pushed aside.  And these decisions build on each other.  One bad decision – giving developers full access to production – can have drastic consequences that nobody will realize for a long time.

Sign up for your trial of Minion Enterprise today.

MinionWare

The $2,000 cup of coffee

What’s the most expensive cup of coffee you’ve ever bought? Was it $6, $8, $15? Try a cup of coffee that cost $2,000. And worst yet, it was made at home. Let me tell you a tale of disaster recovery…

CoffeeWhat’s the most expensive cup of coffee you’ve ever bought?  Was it $6, $8, $15?  Try a cup of coffee that cost $2,000.  And worst yet, it was made at home.

I know you’re asking what could possibly make a cup of coffee that expensive. Well, it’s not actually the coffee, it’s the container.  Usually, even the most expensive coffee is served in a nice cup, worthy of the brew. This coffee, the coffee my wife got recently, was served inside my laptop.

That’s right.  She made a lovely cup of coffee (I can only assume it was lovely, since I drink tea), and then proceeded to dump it all inside my laptop.  I never really got the full story, but I do know that before the coffee was consumed, it was in three containers.  First, it was in a cup, then in a Lenovo Yoga 2 Pro laptop, then in a Lenovo Yoga 2 Pro brick.

Coffee, and Disaster Recovery

I’m writing this blog to remind you all to make sure you’ve got your disaster plans in place and tested.  Make sure you have everything you can’t afford to lose.  Some strategies to use are:

  1. Keep everything in as centralized place as possible.  That means put as much as you can in your My Docs folder so you only have one master folder to back up.
  2. Keep as much as possible online somewhere.  Whether you use O365, DropBox, or whatever else, keep it online.  Don’t even store stuff on your box that you can’t afford to lose.
  3. Don’t rely on backup software.  I’ve had many times where backup software fails me when I go to restore it.  It usually takes the form of corrupt, inaccessible backups.  I prefer to set up a script to copy my files instead.  If I’m not storing my stuff online, then at the very least I’ll keep an external drive to copy files over to, once a week or so.
  4. Even if you have online storage, also keep an external drive, and keep it disconnected.  About two years go, I got a virus that encrypted all of my files, including my OneDrive and DropBox content.  It encrypted everything my box could get to.  But if you have an external drive, keep it disconnected until it’s time to copy files to it; then disconnect it again.  This way, a piece of ransomware won’t be able to get to your offline files.
  5. Regularly back up databases or VM images, if you have those on your box.
  6. If you have code on your box that you’re working on, nothing beats an online code vault.  GitHub is probably the most popular.  I use it for every bit of my code, and I rest assured that if anything happens to my box, then at least my code is safe… to a reasonable point in time, that is.

Be Vigilant

Those are a few steps to prevent you from being caught unawares.  Be diligent.  Don’t skip even once, and don’t relax…like I did.

Any other time, my laptop would have been closed, because I access it from another workstation via RDP.  But I was doing something on it, and I didn’t close the lid when I was finished.  I figured, “Oh, I’ll get to it in a bit.”  And then it happened.  For the past forever I’d been religious about keeping my laptop lid closed, and the one time I relaxed for an hour, this happens.  So you really can’t relax even once.  A disaster can strike at any time.

And some day it will.  Are you prepared?  How tested is your DR plan?  If someone were to pour coffee on your box right now would it be an inconvenience, or would you be in big trouble?  Always ask yourself this when you’re saying how ready you are for something to happen.

SQL Server permissions issues destabilize environments

The number one cause of instability in most database environments is SQL Server permissions. Specifically the problem is that people have more access to systems than they should. Let me walk you through a typical lockdown.

Dear CIOs: I know why your environment is unstable.

The number one cause of instability in most database environments is SQL Server permissions.  And specifically, the problem is that developers, QA engineers, application teams, managers, VPs, etc. have more access to systems than they should.

I’ve been a lead DBA in large shops for twenty years, and I’ve seen this more times than I can count.  Let me walk you through a typical lockdown.

For a very short executive summary, skip to the bottom of this article.

1. The Lockdown Email

Often, a manager will get tired of all of the extra people in the database causing fires, and so he initiates a lockdown.  He tells the database team: “Everyone who doesn’t need to be in the database needs to go. And everyone else should have only what they need to do their job.”

The DBAs don’t know what permissions everyone needs, so they send out an email to explain what’s up. “It’s time for a lockdown on database access. Please send what permissions you need to do your job.”

This is the beginning of the end. It’s this first step that causes the entire project to fail.  We’ll get to why later, but for now let’s move on to the next step: the knee-jerk reaction.

2. The Knee-jerk Reaction

Now, everyone who got the email replies that they do indeed have the access they need to do their jobs.  Steve – who does nothing but write reports – definitely needs sysadmin*. And Carol, who does quality assurance, totally needs to run, alter, create, and delete every job on the system…even though she only manages two jobs.

The rest of the staff are pretty much the same.  Nobody wants the DBAs to trim their permissions.

*Sysadmin gives a user total control of the system, including the ability to drop entire databases and grant other users’ permissions.

3. The Heavy Hand

Next, the DBAs take a deep breath and explain that the lockdown is going to happen; everyone has to start being realistic about what permissions they actually need, and not just what they want.

It doesn’t matter how many meetings you hold, or how polite or rude the DBAs are about it, this is happening because the environment is just too unstable.

Now the rest of this lockdown will continue in one of two ways.

SQL Server Permissions4. The Pushback

Option 1: Everyone ignores the lockdown and hopes the problem will go away.

Option 2: The users all whine to their managers and higher ups that their permissions are being taken away. “We won’t be able to do our jobs! The whole place will come to a screeching halt! Customers won’t get what they need because we don’t have the permissions to service them properly!” The overall message is that they need to be in production, and they need to have full access. Because, of course, they never know what will come up and they’ll need to react at a moment’s notice.

5. The Implementation

Option 1: The managers, vice presidents, and maybe even the CIO gets scared and scraps the whole project. They just can’t risk compromising production support.

Option 2: The DBAs are able to convince those in charge that the users will have the permissions they need and all will be well. The lockdown continues as scheduled.

6. The First Production Issue

If we made it all the way through to this point (following option 2 above), we still have the first post-lockdown production issue.  That will test the newly locked-down environment.

When the issue hits, some developer encounters a minor issue with his permissions. This is the perfect opportunity to completely blow the situation up, so he immediately runs to the VP and screams that customers are suffering! That the entire company is at stake!

The VP then freaks out, because this developer has been there for years and he’s never let them down before. He’s got to be telling the truth (and not exaggerating at all).  So, the VP orders the DBAs to reinstate the developer’s sysadmin access.

The rest of the security lockdown soon crumbles, because once one exception is made, the rest will soon follow.

I’ve seen this whole thing play out many, many times.

The Lockdown Analysis: What went wrong?

That’s a cautionary tale if I’ve ever seen one.  Now let’s take a minute to talk about what went wrong.

Remember when I said that the DBA’s announcement email in step 1 was the big mistake?  Here’s why:  The DBAs and their manager did not properly inform the vice presidents and CIO.  I have gone both routes, and it almost always fails when I do it as outlined above. It always succeeds when I insert this following step.

Step 1A: The Upper Brass Email

Before sending out the email or letting anyone else know they’re thinking of trimming permissions, the DBA should meet – if possible, all at once – with all the managers, the VPs, and the CIO (you). At that meeting, the DBA should tell everyone what they’re planning to do and why.  Then they must explain that when they send out the email, there will be plenty of pushback.

This meeting’s talk should look something like this:

Developers don’t want to lose the access they’ve had for years, even though they never use most of it.  So they’re going to come up with all kinds of knee-jerk reactions.  They’re going to come to you guys and say that customers will suffer, and that deadlines are going to slip, and the quality of the company will generally be compromised. 

It’s okay. None of that will happen. 

All of that posturing is really just a tantrum, because people don’t want things to change.  And we don’t give in to tantrums. 

The success of this is going to rely on how good the information they give us is.  If they tell us all the permissions they need, then we’ll probably get it right the first time.  If they don’t, then there may be a couple of minor security hiccups that cause us to go back and give them another permission.  It only takes a couple minutes, and they can wait that long. 

They’ll tell you that the issue is so important that it can’t possibly wait that long.  “I have to handle it right now!” they’ll say.  But think about that: What if an issue comes in while a dev is in the bathroom, or at lunch?  It would have to wait for at least a few minutes, and probably a lot longer.  And this is no different. 

I just wanted you to know that we’re not only trimming permissions, but that the guys are going to come to you with doom and gloom and pleas for emergency permissions.  And you need to know that now so you don’t fall for it.  It’s nothing but a tantrum and this needs to happen.  

Right now we’ve got three major production issues a week, and we can trace them all back to someone making a mistake with something they shouldn’t even be able to do. These permissions are really important to them, but you know what? In a couple weeks they won’t even care anymore.  They’ll be able to do anything they need to, and they won’t know the difference. 

The only real difference is that after the lockdown, if they do something they shouldn’t, whether on purpose or by mistake, the system will stop them.

That is what your DBAs should be telling you.  And that is what you should tell yourself if they don’t.

The Dangers of an “Unlocked” Shop

We alluded to it earlier: production issues happen when someone who shouldn’t has access, makes a mistake in production. I’ve seen dropped tables, altered schemas, renamed procedures, changed permissions, and even deleted databases in production that came directly from some dev or report writing mucking around in production, where they have no business being. And that’s just the tip of the iceberg.

You should also conduct this lockdown because of external threats. SQL Server permissions are nothing to scoff at.  Why do you think hackers steal data from companies so often?  Because too many people and applications have permissions they shouldn’t have.  Hackers are a very real threat and once they get into your environment they can only do what you’ve already given them permissions to do.

You can’t look at it like you’re giving your own developers and applications SQL Server permissions. You have to look at it like you’re giving thousands of hackers those permissions.  Because if they get into your server, they’re going to use those permissions against you.

Trust Your Security Experts

Stop letting your developers run your shop.  Trust your DBAs.  Because here’s a secret that your developers aren’t going to tell you: they don’t know anything about security or permissions.  Your DBAs (hopefully) do.

Even though you’ve known this developer for five years, that doesn’t mean the developer knows what he’s talking about.  Remember that the loudest wheel gets the grease.  When it comes to production privileges, the developers are always the loudest wheels.  But instead of greasing them with tough love, you grease them with extra permissions.  And that’s why your environment is so unstable.

Executive summary: SQL Server shops are unstable because too many people have high level permissions. Have your DBAs conduct a lockdown, but first, warn all of the managers, VPs, and the CIO that people are going to complain and predict doom and gloom. There is no doom and gloom, just a lot of tantrums and dangerously unsecured servers.

Download FREE Maintenance maintenance modules:

  • Minion Backup: Free SQL Server Backup utility with world-class enterprise features and full lifecycle management.
  • Minion Reindex: Free SQL Server Reindex utility with unmatched configurability.
  • Minion CheckDB: Free SQL Server CHECKDB that solves all of your biggest CHECKDB problems.