1. Establish a Proactive Maintenance Schedule
Every operating system, database software and application requires maintenance tasks to be completed. If these tasks are ignored, there is no question issues will arise, especially if on a Windows platform. Implementing a proactive maintenance schedule is the most cost-effective way of executing these tasks and minimizing the risk of unplanned downtime. Whether done in-house or with the help of an outside support partner, establishing this program requires the following steps:
1. Inventory all equipment and software assets categorizing them according to their importance
2. Understand the tasks and categorized them by when they need to be performed: daily, weekly, monthly, quarterly, bi-annual and annual.
3. Create Procedures.
4. Train your team on new procedures
6. Track, adjust and improve over time
2. Execute Pre-Business System Checks
Checking critical business systems before production hours begin is a cost effective way to ensure services are available and ready for daily operations. These checks also help you detect unplanned outages or other unknown errors and gives the IT team a head start to correct any problems before they have a negative effect on their organization.
3. Implement Measurements & Indicators
There are 4 main indicators effective IT leaders use to measure as well as reduce the frequency and impact of unplanned downtime on their organization - Availability, Reliability, Maintainability and Serviceability. The fact these indicators are being measured and being used to make management decisions is a sign of a person who has their house in order.
Availability: A sign that unplanned downtime is probably happening more than it should is the lack of this measurement. Effective IT leaders know it is critical to measure Availability or "up-time" because it becomes an objective tool to be leveraged when under duress. Availability is calculated as the number of hours IT has committed to having the service available (A), minus the number of hours it was not available (B), divided by itself (A), times 100. The formula looks like this ->[(A-B)/A * 100]. It is communicated as a %, i.e. 99.999%.
Reliability: How long have their systems been running before the next failure occurs is typically unimportant to an overburdened IT leader. This measurement is also known as Mean Time Between Failures (MTBF). It speaks volumes about an IT department's performance.
Maintainability: This is another indicator of an IT department's capabilities. This measurement is also called Mean Time To Repair (MTTR) and asks the question - On average, how long does it take your IT team to resolve a failure? Did you know 80% of the time spent during this downtime is wasted on unproductive activities? Effectively managing this time is key.
Serviceability: How about your vendor's ability to meet their Service Level Agreement? This is called Serviceability and the SLAs include availability, reliability and maintainability. If you do not measure your vendor, how do you know how well they are performing? SLA commitment breaches happen more often than you think. A reactionary-type environment will not catch these types of failures.
4. Adhere To A Strict Change Management Process
A management sanctioned change management policy needs to be implemented which means no changes get implemented into the production environment unless approved by a Change Advisory Board. Unauthorized changes must stop!
All changes must follow a predefined change management process which includes a proven roll back process. And there must be consequences for implementing unapproved changes.
5. Create & Maintain Standard Operating Procedures
SOPs increase availability. Since 70% of outages are caused by humans, SOPs can play a major role in reducing mistakes:
• The creation and maintenance of an SOP library will ensure that procedures can be followed consistently.
• A defined SOP change management process ensures that any changes to SOPs are understood by the team. At Allari, we use a BPC (Business Process Change) methodology to implement our customers' business process changes over multiple global shifts which facilitates rapid awareness and adoption. Generally, this allows us to get a process change request disseminated to the global team in less than 24 hours.
6. Activate Event Monitoring
Automated monitoring tools can communicate specific critical events providing advanced warning of an unplanned outage about to occur. They also allow you to respond to an outage before the users even have to report it.
• Monitoring memory utilization can prevent downtime. Constantly pushing 90% threshold is a sure sign that the IT asset is becoming overloaded.
• Monitoring disk space can prevent downtime and loss data. Server disk drives with over 90% utilization are usually a sign that the system will soon be running out of storage, especially with database and backup servers.
7. Maintain A Knowledge Repository
Unplanned outages will occur. Learning from them is critical to reducing the impact they may have the second time. Thus, the creation and maintenance of a knowledge repository which stores the root cause and corrective action(s) taken to end the outage will be invaluable to the staff. Adding this information needs to become a process understood, respected and followed by the team.
8. Additional Preventive Tools
Understanding SLA commitment breaches, percent of unplanned work, volume of effective change throughput, server to admin ratios are also indicators which can help a leader prevent downtime.