8 Ways to Prevent Downtime

Prevention-with-marker-on-.jpg

1. Set Preventative Maintenance Schedules
Every operating system, database software and application requires maintenance tasks to be completed. If these tasks are ignored, there is no question issues will arise especially if on a Windows platform. If you have the internal capacity to remember and to execute the tasks when they are triggered, then you will need to follow these steps:

1. Inventory all equipment and software assets categorizing them according to their importance
2.
Understand the tasks and categorized them by when they need to be performed: daily, weekly, monthly, quarterly, bi-annual and annual.
3. Create Procedures. Train your team
4. Create schedules
5. Track, adjust and improve over time

2. Execute Pre Business System Checks
Checking the critical systems before business hours begin is a cost effective way to ensure systems are ready for the business. Many outages have been detected this way giving the IT team a head start to restore services before it has a negative effect on their organization.

3. Implement Measurements & Indicators
There are 4 main indicators high-powered IT leaders use to measure as well as reduce the frequency and impact of downtime on their organization - Availability, Reliability, Maintainability and Serviceability. The fact that these indicators are being measured and used to make management decisions is a sign of a person who has their house in order.

1. Availability - A sign that unplanned downtime is probably happening more than it should is the lack of this measurement. High powered IT leaders know it is critical to measure Availability or up time because it becomes an objective tool to be leveraged when under duress. Availability is calculated as number of hours IT has committed to having the service available (A) minus the number of hours it was not available (B) divided by itself (A) times 100. The formula looks like this ->[(A-B)/A * 100]. It is communicated as a %, i.e. 99.999%.
2. Reliability - How long have their systems been running before the next failure occurs is typically unimportant to a overburdened IT leader. This measurement is also known as Mean Time Between Failures (MTBF). It speaks volumes about an IT department's performance.
3. Maintainability - Maintainability is another indicator of an IT department's capabilities. This measurement is also called Mean Time To Repair (MTTR) and asks the question - On average, how long does it take your IT team to resolve a failure? Did you know 80% of the time spent during this downtime is wasted on unproductive activities? Effectively managing this time is key.
4. Serviceability - How about your vendor's ability to meet their Service Level Agreement? This is called Serviceability and the SLAs include availability, reliability and maintainability. If you do not measure your vendor, how do you know how well they are performing? SLA commitment breaches happen more often than you think. A reactionary type environment will not catch these types of failures.

4. Adhere To A Strict Change Management Process

1. A management sanctioned change management policy needs to be implemented which means no changes get implemented into the production environment unless approved by a Change Advisory Board. Unauthorized changes must stop!
2. All changes must follow a predefined change management process which includes a proven roll back process.
3. There must be consequences for implementing unapproved changes.

5. Create & Maintain Standard Operating Procedures
SOPs increase availability. Since 80% of outages are caused by humans, SOPs can play a major role in reducing mistakes:

1. The creation and maintenance of a SOP library will ensure that procedures can be followed consistently.
2. A SOP change management process ensures that changes to SOPs are understood by the team.
The process will ensure urgent changes to any operational processes are quickly understood and executed accordingly by the team. At Allari, we have a BPC (Business Process Change) methodology to implement our customers' business process changes over multiple global shifts to facilitate the quick adoption of our customer's process. We can usually get a process change request published to the global team in less than 24 hours.

6. Activate Event Monitoring
Automated monitoring tools can communicate specific critical events providing advanced warning of an outage about to occur. They also allow you to respond to an outage before the users have a chance to report the outage.

1. Monitoring memory utilization can prevent downtime. Constantly pushing 90% threshold is a sure sign that the IT asset is becoming overloaded.
2. Monitoring disk space can prevent downtime and loss data. Server disk drives with over 90% utilization are usually a sign that the system will soon be running out of storage especially with database & backup servers.

7. Maintain A Knowledge Repository
Outages will occur. Learning from them is critical to reducing the impact they may have the 2nd time. Thus the creation and maintenance of a knowledge repository which stores the cause of the outage and the resolution will be invaluable to the staff. Adding this information needs to become a process understood, respected and followed by the team.

8. Additional Preventative Tools
Understanding SLA commitment breaches, % of unplanned work, volume of effective change throughput, server to admin rations are also indicators which can help a leader prevent downtime.

MORE ARTICLES YOU MAY LIKE

What is Downtime?

Downtime is another way of saying a system is not available to the users. It is also referred to as an outage. While downtime can be planned months in advance, it is typically not and is often a surprise.

 

5 Reasons Multitasking is Bad for IT Productivity

When it comes to IT operations, multitasking seems to be a prerequisite. Quite often it's even written into the job posting. However, research is revealing that multitasking may do more damage than good.

It's the Process, Stupid!

The greatest invention in the last 200 years isn't a product, but rather the scientific method, the process which has been used to create millions of products. Today, when change is exponential, a focus on process over products is even more important.

 

DOWNTIME

Definition
Downtime is a term used to describe when a service is unavailable to its intended recipients. While downtime can be planned months in advance, it is typically not and is often a surprise.

Most downtime events are unplanned and caused by a failure or are triggered on short notice and occur as a result of an attempt to fix a service that is not performing at its optimal level.


Signs & Symptoms
Downtime is the number one cause of financial harm yet most IT leaders don't understand the signs and symptoms of an environment that experiences too much unplanned downtime.

Sure it's easy to surmise that the systems are offline more than they should be especially when management is enraged but there are legitimate signs and symptoms which will allow you to reduce the frequency and impact of unplanned outages.

  • Unauthorized Changes
  • High amounts of Unplanned Work
  • Low Throughput of Effective Change
  • Server to Administrator Rations < 100:1
  • Lack of Indicator Measurements
  • SLA Commitment Breaches
Related Conditions

Low Throughput of Effective Change

More Videos You May Like
Downtime Video Image.png

Calculate Your Downtime Today!

Top 3 Ways To Prevent Downtime

1. Implement Preventative Maintenance Schedules

2. Execute Pre Business System Checks

3. Implement Measurements & Indicators

Implement Preventative Maintenance Schedules

Take care of your IT assets and they will take care of you. Implement a consistent, high quality preventative maintenance schedule. Let Allari do the chore based tasks while you focus on the important stuff!