8 Ways to Prevent Unplanned Downtime

Prevent System Downtime

1. Establish a Proactive Maintenance Schedule 
Every operating system, database software and application requires maintenance tasks to be completed. If these tasks are ignored, there is no question issues will arise, especially if on a Windows platform. Implementing a proactive maintenance schedule is the most cost-effective way of executing these tasks and minimizing the risk of unplanned downtime. Whether done in-house or with the help of an outside support partner, establishing this program requires the following steps:

1. Inventory all equipment and software assets categorizing them according to their importance
2.
Understand the tasks and categorized them by when they need to be performed: daily, weekly, monthly, quarterly, bi-annual and annual.
3. Create Procedures.
4. Train your team on new procedures
5.Create schedules
6. Track, adjust and improve over time
 

2. Execute Pre-Business System Checks
Checking critical business systems before production hours begin is a cost effective way to ensure services are available and ready for daily operations. These checks also help you detect unplanned outages or other unknown errors and gives the IT team a head start to correct any problems before they have a negative effect on their organization.

3. Implement Measurements & Indicators
There are 4 main indicators effective IT leaders use to measure as well as reduce the frequency and impact of unplanned downtime on their organization - Availability, Reliability, Maintainability and Serviceability. The fact these indicators are being measured and being used to make management decisions is a sign of a person who has their house in order.

Availability: A sign that unplanned downtime is probably happening more than it should is the lack of this measurement. Effective IT leaders know it is critical to measure Availability or "up-time" because it becomes an objective tool to be leveraged when under duress. Availability is calculated as the number of hours IT has committed to having the service available (A), minus the number of hours it was not available (B), divided by itself (A), times 100. The formula looks like this ->[(A-B)/A * 100]. It is communicated as a %, i.e. 99.999%.

Reliability: How long have their systems been running before the next failure occurs is typically unimportant to an overburdened IT leader. This measurement is also known as Mean Time Between Failures (MTBF). It speaks volumes about an IT department's performance.

Maintainability: This is another indicator of an IT department's capabilities. This measurement is also called Mean Time To Repair (MTTR) and asks the question - On average, how long does it take your IT team to resolve a failure? Did you know 80% of the time spent during this downtime is wasted on unproductive activities? Effectively managing this time is key.

Serviceability: How about your vendor's ability to meet their Service Level Agreement? This is called Serviceability and the SLAs include availability, reliability and maintainability. If you do not measure your vendor, how do you know how well they are performing? SLA commitment breaches happen more often than you think. A reactionary-type environment will not catch these types of failures.

4. Adhere To A Strict Change Management Process

A management sanctioned change management policy needs to be implemented which means no changes get implemented into the production environment unless approved by a Change Advisory Board. Unauthorized changes must stop!

All changes must follow a predefined change management process which includes a proven roll back process. And there must be consequences for implementing unapproved changes.

5. Create & Maintain Standard Operating Procedures
SOPs increase availability. Since 70% of outages are caused by humans, SOPs can play a major role in reducing mistakes:

•  The creation and maintenance of an SOP library will ensure that procedures can be followed consistently.

•  A defined SOP change management process ensures that any changes to SOPs are understood by the team. At Allari, we use a BPC (Business Process Change) methodology to implement our customers' business process changes over multiple global shifts which facilitates rapid awareness and adoption. Generally, this allows us to get a process change request disseminated to the global team in less than 24 hours.

6. Activate Event Monitoring
Automated monitoring tools can communicate specific critical events providing advanced warning of an unplanned outage about to occur. They also allow you to respond to an outage before the users even have to report it.

•  Monitoring memory utilization can prevent downtime. Constantly pushing 90% threshold is a sure sign that the IT asset is becoming overloaded.

•  Monitoring disk space can prevent downtime and loss data. Server disk drives with over 90% utilization are usually a sign that the system will soon be running out of storage, especially with database and backup servers.

7. Maintain A Knowledge Repository
Unplanned outages will occur. Learning from them is critical to reducing the impact they may have the second time. Thus, the creation and maintenance of a knowledge repository which stores the root cause and corrective action(s) taken to end the outage will be invaluable to the staff. Adding this information needs to become a process understood, respected and followed by the team.

8. Additional Preventive Tools
Understanding SLA commitment breaches, percent of unplanned work, volume of effective change throughput, server to admin ratios are also indicators which can help a leader prevent downtime.

 

RELATED SERVICES

Guard against unplanned downtime by proactively scheduling the daily, weekly & monthly tasks designed to keep your business systems operating at peak performance year-round with Allari's best-practice Support Plans. 

Review the tasks and calculate your monthly cost for the following technologies: 

EnterpriseOne Support Plan

SQL Server Support Plan

Oracle Database Support Plan

MORE ARTICLES YOU MAY LIKE

How to Leverage Your IT Services Provider to Help Transform the Organization

As the digital revolution continues personnel inside the IT organization are on the front lines in efforts to make the tech-centric strategies of the business become a reality.

 

5 Reasons Multitasking is Bad for IT Productivity

When it comes to IT operations, multitasking seems to be a prerequisite. Quite often it's even written into the job posting. However, research is revealing that multitasking may do more damage than good.

It's the Process, Stupid!

The greatest invention in the last 200 years isn't a product, but rather the scientific method, the process which has been used to create millions of products. Today, when change is exponential, a focus on process over products is even more important.

 

Challenge

UNPLANNED DOWNTIME

Definition
Downtime is a term used to describe when a service is unavailable to its intended recipients. While downtime can be planned months in advance, it is typically not and is often a surprise.

Most downtime events are unplanned and caused by a failure or are triggered on short notice and occur as a result of an attempt to fix a service that is not performing at its optimal level.

Signs & Symptoms
Unplanned downtime is the number one cause of financial harm yet most IT leaders don't understand the signs and symptoms of an environment that experiences too much unplanned downtime.

Sure it's easy to surmise that the systems are offline more than they should be especially when management is enraged but there are legitimate signs and symptoms which will allow you to reduce the frequency and impact of unplanned outages.

  • Unauthorized Changes
  • High amounts of Unplanned Work
  • Low Throughput of Effective Change
  • Server to Administrator Rations < 100:1
  • Lack of Indicator Measurements
  • SLA Commitment Breaches
Top 3 Ways To Prevent Downtime

1. Implement Preventive Maintenance Schedules

2. Execute Pre Business System Checks

3. Implement Measurements & Indicators