1. Set Preventative Maintenance Schedules
Every operating system, database software and application requires maintenance tasks to be completed. If these tasks are ignored, there is no question issues will arise especially if on a Windows platform. If you have the internal capacity to remember and to execute the tasks when they are triggered, then you will need to follow these steps:
1. Inventory all equipment and software assets categorizing them according to their importance
2. Understand the tasks and categorized them by when they need to be performed: daily, weekly, monthly, quarterly, bi-annual and annual.
3. Create Procedures. Train your team
4. Create schedules
5. Track, adjust and improve over time
2. Execute Pre Business System Checks
Checking the critical systems before business hours begin is a cost effective way to ensure systems are ready for the business. Many outages have been detected this way giving the IT team a head start to restore services before it has a negative effect on their organization.
3. Implement Measurements & Indicators
There are 4 main indicators high-powered IT leaders use to measure as well as reduce the frequency and impact of downtime on their organization - Availability, Reliability, Maintainability and Serviceability. The fact that these indicators are being measured and used to make management decisions is a sign of a person who has their house in order.
1. Availability - A sign that unplanned downtime is probably happening more than it should is the lack of this measurement. High powered IT leaders know it is critical to measure Availability or up time because it becomes an objective tool to be leveraged when under duress. Availability is calculated as number of hours IT has committed to having the service available (A) minus the number of hours it was not available (B) divided by itself (A) times 100. The formula looks like this ->[(A-B)/A * 100]. It is communicated as a %, i.e. 99.999%.
2. Reliability - How long have their systems been running before the next failure occurs is typically unimportant to a overburdened IT leader. This measurement is also known as Mean Time Between Failures (MTBF). It speaks volumes about an IT department's performance.
3. Maintainability - Maintainability is another indicator of an IT department's capabilities. This measurement is also called Mean Time To Repair (MTTR) and asks the question - On average, how long does it take your IT team to resolve a failure? Did you know 80% of the time spent during this downtime is wasted on unproductive activities? Effectively managing this time is key.
4. Serviceability - How about your vendor's ability to meet their Service Level Agreement? This is called Serviceability and the SLAs include availability, reliability and maintainability. If you do not measure your vendor, how do you know how well they are performing? SLA commitment breaches happen more often than you think. A reactionary type environment will not catch these types of failures.
4. Adhere To A Strict Change Management Process
1. A management sanctioned change management policy needs to be implemented which means no changes get implemented into the production environment unless approved by a Change Advisory Board. Unauthorized changes must stop!
2. All changes must follow a predefined change management process which includes a proven roll back process.
3. There must be consequences for implementing unapproved changes.
5. Create & Maintain Standard Operating Procedures
SOPs increase availability. Since 80% of outages are caused by humans, SOPs can play a major role in reducing mistakes:
1. The creation and maintenance of a SOP library will ensure that procedures can be followed consistently.
2. A SOP change management process ensures that changes to SOPs are understood by the team. The process will ensure urgent changes to any operational processes are quickly understood and executed accordingly by the team. At Allari, we have a BPC (Business Process Change) methodology to implement our customers' business process changes over multiple global shifts to facilitate the quick adoption of our customer's process. We can usually get a process change request published to the global team in less than 24 hours.
6. Activate Event Monitoring
Automated monitoring tools can communicate specific critical events providing advanced warning of an outage about to occur. They also allow you to respond to an outage before the users have a chance to report the outage.
1. Monitoring memory utilization can prevent downtime. Constantly pushing 90% threshold is a sure sign that the IT asset is becoming overloaded.
2. Monitoring disk space can prevent downtime and loss data. Server disk drives with over 90% utilization are usually a sign that the system will soon be running out of storage especially with database & backup servers.
7. Maintain A Knowledge Repository
Outages will occur. Learning from them is critical to reducing the impact they may have the 2nd time. Thus the creation and maintenance of a knowledge repository which stores the cause of the outage and the resolution will be invaluable to the staff. Adding this information needs to become a process understood, respected and followed by the team.
8. Additional Preventative Tools
Understanding SLA commitment breaches, % of unplanned work, volume of effective change throughput, server to admin rations are also indicators which can help a leader prevent downtime.