What Really Causes Performance Problems?

Every IT shop has its problems with performance: some localized, and some that span a server, or even multiple servers. Technologists tend to treat these problems as isolated incidents – solving one, then another, and then another. This happens especially when a problem is recurring but intermittent. When a slowdown or error happens every so often, it’s far too easy to lose the big picture.

Some shops suffer from these issues for years without ever getting to the bottom of it all. So, how can you determine what really causes performance problems?

Every IT shop has its problems with performance: some localized, and some that span a server, or even multiple servers. Technologists tend to treat these problems as isolated incidents – solving one, then another, and then another. This happens especially when a problem is recurring but intermittent. When a slowdown or error happens every so often, it’s far too easy to lose the big picture.

Some shops suffer from these issues for years without ever getting to the bottom of it all.  So, how can you determine what really causes performance problems?

First, a story

A developer in your shop creates an SSIS package to move data from one server to another. He makes the decision to pull the data from production using SELECT * FROM dbo.CustomerOrders.  This works just fine in his development environment, and it works fine in QA, and it works fine when he pushes it into production.  The package runs on an hourly schedule, and all is well.

What he doesn’t realize is that there’s a VARCHAR(MAX) column in that table that holds 2GB of data in almost every row…in production.

Things run just fine for a couple months.  Then without warning, one day things in production start to slow down.  It’s subtle at first, but then it gets worse and worse.  The team opens a downtime bridge, and a dozen IT guys get on to look at the problem.  And they find it!  An important query is getting the wrong execution plan from time to time.  They naturally conclude that they need to manage statistics, or put in a plan guide, or whatever other avenue they decide to take to solve the problem.  All is well again.

A couple of days later, it happens again.  And then again and then again.  Then it stops.  And a couple weeks later they start seeing a lot of blocking.  They put together another bridge, and diagnose and fix the issue.  Then they start seeing performance issues on another server that’s completely unrelated to that production server.  There’s another bridge line, and another run through the process again.

What’s missing here?

The team has been finding and fixing individual problems, but they haven’t gotten to the root of the issue: the SSIS package data pull is very expensive.  It ran fine for a while, but once the data grew (or more processes or more users came onto the server), the system was no longer able to keep up with demand.  The symptoms manifested differently every time.  While they’re busy blaming conditions on the server, or blaming the way the app was written, the real cause of the issues is that original data pull.

Now multiply this situation times several dozen, and you’ll get a true representation of what happens in IT shops all over the world, all the time.

What nobody saw is that the original developer should never have had access to pull that much data from production to begin with.  He didn’t need to pull all of the columns in that table, especially the VARCHAR(MAX) column.  By giving him access to prod – by not limiting his data access in any way – they allowed this situation to occur.

What Really Causes Performance Problems?

Just as too many cooks spoil the broth, too many people with access to production, will cause instability. Instability is probably the biggest performance killer. But IT shops are now in the habit of letting almost anyone make changes as needed, and then treating the resulting chaos one CPU spike at a time.

This is why performance issues go undiagnosed in so many shops.  The people in the trenches need the ability to stand back and see the real root cause of issues past the singular event they’re in the middle of, and it’s not an easy skill to develop.  It takes a lot of experience and it takes wisdom, and not everyone has both.  So, these issues can be very difficult to ferret out.

Even when someone does have this experience, they’re likely only one person in a company of others who aren’t able to make the leap.  Management quite often doesn’t understand enough about IT to see how these issues can build on each other and cause problems, so they’ll often refuse to make the necessary changes to policy.

So really, the problem is environmental, from a people point of view:

  • Too many people in production makes for an unstable shop.
  • It takes someone with vision to see that this is the problem, as opposed to troubleshooting the symptoms.
  • Most of the time, they’re overridden by others who only see the one issue.

What’s the ultimate solution?

In short: seriously limit the access people have in production. It’s absolutely critical to keep your production environments free from extra processes.

Security is one of those areas that must be constantly managed and audited, because it’s quite easy to inadvertently escalate permissions without realizing it.  This is where Minion Enterprise comes in: I spent 20 years in different shops, working out the best way to manage these permissions, and even harder, working out how to make sure permissions didn’t get out of control.

Minion Enterprise gives you a complete view of your entire shop to make it effortless to audit management conditions on all your servers.

That’s the difference between performance monitoring and management monitoring.  The entire industry thinks of performance as a single event, when in reality, performance is multi-layered.  It’s comprised of many events, management-level events where important decisions have been ignored or pushed aside.  And these decisions build on each other.  One bad decision – giving developers full access to production – can have drastic consequences that nobody will realize for a long time.

Sign up for your trial of Minion Enterprise today.

MinionWare

The $2,000 cup of coffee

What’s the most expensive cup of coffee you’ve ever bought? Was it $6, $8, $15? Try a cup of coffee that cost $2,000. And worst yet, it was made at home. Let me tell you a tale of disaster recovery…

CoffeeWhat’s the most expensive cup of coffee you’ve ever bought?  Was it $6, $8, $15?  Try a cup of coffee that cost $2,000.  And worst yet, it was made at home.

I know you’re asking what could possibly make a cup of coffee that expensive. Well, it’s not actually the coffee, it’s the container.  Usually, even the most expensive coffee is served in a nice cup, worthy of the brew. This coffee, the coffee my wife got recently, was served inside my laptop.

That’s right.  She made a lovely cup of coffee (I can only assume it was lovely, since I drink tea), and then proceeded to dump it all inside my laptop.  I never really got the full story, but I do know that before the coffee was consumed, it was in three containers.  First, it was in a cup, then in a Lenovo Yoga 2 Pro laptop, then in a Lenovo Yoga 2 Pro brick.

Coffee, and Disaster Recovery

I’m writing this blog to remind you all to make sure you’ve got your disaster plans in place and tested.  Make sure you have everything you can’t afford to lose.  Some strategies to use are:

  1. Keep everything in as centralized place as possible.  That means put as much as you can in your My Docs folder so you only have one master folder to back up.
  2. Keep as much as possible online somewhere.  Whether you use O365, DropBox, or whatever else, keep it online.  Don’t even store stuff on your box that you can’t afford to lose.
  3. Don’t rely on backup software.  I’ve had many times where backup software fails me when I go to restore it.  It usually takes the form of corrupt, inaccessible backups.  I prefer to set up a script to copy my files instead.  If I’m not storing my stuff online, then at the very least I’ll keep an external drive to copy files over to, once a week or so.
  4. Even if you have online storage, also keep an external drive, and keep it disconnected.  About two years go, I got a virus that encrypted all of my files, including my OneDrive and DropBox content.  It encrypted everything my box could get to.  But if you have an external drive, keep it disconnected until it’s time to copy files to it; then disconnect it again.  This way, a piece of ransomware won’t be able to get to your offline files.
  5. Regularly back up databases or VM images, if you have those on your box.
  6. If you have code on your box that you’re working on, nothing beats an online code vault.  GitHub is probably the most popular.  I use it for every bit of my code, and I rest assured that if anything happens to my box, then at least my code is safe… to a reasonable point in time, that is.

Be Vigilant

Those are a few steps to prevent you from being caught unawares.  Be diligent.  Don’t skip even once, and don’t relax…like I did.

Any other time, my laptop would have been closed, because I access it from another workstation via RDP.  But I was doing something on it, and I didn’t close the lid when I was finished.  I figured, “Oh, I’ll get to it in a bit.”  And then it happened.  For the past forever I’d been religious about keeping my laptop lid closed, and the one time I relaxed for an hour, this happens.  So you really can’t relax even once.  A disaster can strike at any time.

And some day it will.  Are you prepared?  How tested is your DR plan?  If someone were to pour coffee on your box right now would it be an inconvenience, or would you be in big trouble?  Always ask yourself this when you’re saying how ready you are for something to happen.