Disaster Recovery Metrics

Unit: 7 Lesson: 4

i need a number that tells me how good we are at not imploding

We love a number that directly states how reliable a service on the Internet is.

Availability

The percentage of time that the system is online over a set amount of time, usually a year. The opposite of availability would be downtime. The main metric for downtime is called Maximum Tolerable Downtime (MTD), and it defines the upper limit for downtime for a service. Downtime is measured as the sum of all time it takes for scheduled maintenance and unplanned outages.

Availability (%)	Annual MTD
99	87h 36m
99.9	8h 45m 36s
99.99	52m 34s
99.999	5m 15s
99.9999	32s
Systems that have almost no scheduled downtime or outages are extremely rare. They are said to have continuous availability. Continuous availability is needed not only in industrial settings, but others where systems failure could bring about injury or loss of life. Such settings include:

life-support systems
air traffic control systems
communication satellites
networked self-driving vehicles
smart traffic signaling systems

Remember how MTD is the final upper limit for how long a service can remain unavailable due to a scheduled maintenance or unplanned outage? There's two more metrics. There's the Recovery Time Objective (RTO) which states how long a service remains unavailable following a disaster. RTO includes the time it takes to find out what happened, find out if it's an issue, perform backups and/or switch to a backup system.

There's also the Work Recovery Time (WRT), which states how long it takes to reintegrate systems and other extra work to restore the service to working, operational condition.

~~[!]~~
The RTO and the WRT must not exceed the MTD!

There's also the Recovery Point Objective (RPO), which states how much data loss is acceptable for the system, measured in time. For example, if a virus destroyed a database, an RPO of 24 hours means that we have a backup from a maximum of 24 hours before the loss, but not anything farther back than that. Any data lost between the RPO and the present needs to be accepted as a loss or reconstructed. Pray you were using RAID.

~~#Netplus~~