Definition

Downtime is a term used to describe when a service is unavailable to its intended recipients. While downtime can be planned months in advance, it is typically not and is often an unwelcomed surprise. Most downtime events are unplanned and caused by a failure or are triggered on short notice and occur as a result of an attempt to fix a service that is not performing at its optimal level.
 
Unplanned downtime also known as outages fall into 2 categories:
1. The unplanned unplanned outage which might be easier to remember if it was called an unscheduled unplanned downtime. These outages occur when a service becomes unexpectedly unavailable by the IT assets own volition. For example, a server crashes which causes an internal installation of Microsoft Exchange to become unavailable. The end result is that the company's email is now unavailable. Nobody knew that this was going to happen. It just did. Now, the entire company is without email.
2. Or, the planned unplanned outage which might be easier to remember if it was called a scheduled unplanned outage. In these scenarios, there is a system degradation type incident occurring. Essentially, the service is so slow it is having a negative effect on production. Or, there is a bug in an application which has been detected and an emergency change needs to be deployed immediately to fix it. In either scenarios, multiple tickets get logged with the Service Desk. A resolution plan is determined by IT which requires a planned unplanned outage to occur.
 
Here is an example of a "planned" unplanned outage:  It has been determined that the email system is crawling. The IT leader determines that the resolution requires that the Microsoft exchange server to be rebooted. The communication manager sends the downtime notice to the entire company. "Email will not be available between 1-1:30 p.m. while we execute an emergency maintenance initiative. We apologize for the inconvenience."
 
At 1 p.m., the IT team takes down of the email service in a controlled manner. They restart the services and then communicate to the users that email is now available. The IT team did not plan to do this at the beginning of the day. But, once they understood the solution to the problem required downtime, they executed and communicated the event to the users.
Best Practices: High-powered IT leaders love "planned" outages:
 
Planned or scheduled outages are proactive events scheduled well in advance. Their purpose is to execute preventative maintenance tasks and/or deploy approved changes. They are critical to keeping the systems secure and operating at peak performance. IT needs downtime. An IT leader can either proactively plan for them or re-actively respond to them. Planned downtime allows you to maintain the systems to reduce the volume of unplanned outages.
 

Signs & Symptoms

Downtime is the number one cause of financial harm yet most IT leaders don't understand the signs and symptoms of an environment that has too many. Although it's easy to surmise that the systems are offline more than they should be especially when management is enraged but there are legitimate signs and symptoms which will allow you to reduce the frequency and impact of unplanned outages.

  • Unauthorized Changes - This is one of the largest tell tale signs that an environment is exposed to a high number of unplanned outages. If people are allowed to implement changes without following strict management sanctioned change management process, there will be lots of downtime and hours upon hours of human effort trying to figure out what went wrong and how to fix it.
  • High Amounts of Unplanned Work - High percentages of unplanned work has a direct correlation to high levels of unplanned downtime. If the team is spending more than 5% of their time reacting to service failures then the environment can be defined as reactive. A study has shown that high powered IT departments spend less than 5% of their total effort on unplanned tasks.
  • Low Throughput of Effective Change - Since the majority of the team's efforts are spent putting out fires, there is little time left to spend on implementing the changes needed to stay competitive. If you can't implement high number of changes each year, the business will fall behind which is why a lot of companies slide out of the Fortune 1,000. Their IT staff just can't adapt. Low throughput of effective change is a symptom of downtime.
  • Server to Administrator Rations < 100:1 - This is an informal metric which allows you to see how many administrators you have per server. If on average you require one administrator to manage less than 100 servers, the chances are the resources are spending too much time executing unplanned work as a result of too many outages or poor management. Keep in mind that there is a direct correlation between the administrator to servers ratio and throughput of effective change.
  • Lack of Indicator Measurements - If you aren't measuring performance, one does not have the metrics needed to show results and justify management type decisions. If measurements are in place but not understood, the same is true. If you have no measurements, then success is largely based on subjectivity. In the information technology world, availability, reliability, maintainability and serviceability are critical indicators for the management of success. High-powered leaders live by these and many other indicators.
  • SLA Commitment Breaches - Vendors are contracted to respond within a certain amount of time and in a certain way. If a vendor is being paid to keep a service available then they need to be held accountable to the defined SLA. Understanding how vendors are performing in relation to the contracted SLA is critical to your business's success. A breach can send an IT team into chaos causing a high volume of unplanned work and a low change throughput. Managing a vendor is just as important as choosing the right vendor.

Cause

Man or machine? A study by Gartner concluded that 80% of unplanned downtime incidents are caused by people and are usually rooted in the change management area. The IT leader's most important function is the ability to implement continual change which is the reason it is critical that change management is management sanctioned - i.e. the backbone of every IT department's culture.

Although humans are well trained in technology, they are not trained very well in processes. Unfortunately, most companies processes are weak. In addition, people naturally like to take short cuts. When the short cut habits are combined with a poor change management culture, unplanned outages occur frequently. These types of "Cowboy" cultures are the high powered IT leader's worst nightmare. Getting a grip on change management and following standard operating procedures are the first areas they focus in on as "new" IT Director. 

Additional Causes:

  • Memory leaks, server response failure, hardware failures, storage failures, disk space, disk mirroring, application failure, data retrieval issue
  • Overloads, Usage Spikes/Surges
  • Bad dependencies
  • Uneven database sharding
  • Failure domains
  • Power Outages from Natural Disasters/Weather Events
  • Third-Party Supplier or Cloud Outages
  • Virus/Malware/Hacks