Businesses persistently struggle to determine and quantify an acceptable level of risk for their organization, and disaster recovery is no exception; RTO and RPO help reduce the ambiguity at an operational level, so IT teams have a pragmatic framework against which they can plan and execute.
Typically, organizations spend much of their time planning the “hows” of recovery: the technologies they’ll leverage, the features that technology includes, and the extent to which they can recover entire systems in one click. As a result, considerably less time is spent discussing the organizational ramifications of an exercise of this magnitude, including how much data to retain and how quickly it can be restored.
Key Terms Related to RTO and RPO
A number of key terms, including RTO and RPO, have emerged to help define business requirements and measure how well data protection solutions can satisfy these requirements. Note that these requirements will differ between applications.
Recovery Point Objective (RPO)
RPO defines how much data your business can afford to lose, measured in time (e.g. one hour worth of data). If a production system is impaired by data loss or data corruption, recovery is possible by reverting to a backup; RPO defines how far back you are willing to go, accepting loss of all the data beyond the latest recovery point. This in turn defines the granularity, or frequency, of point-in-time copies. If your tolerance for data loss is low, you will need to increase the frequency of your backups and often dedicate a larger amount of storage to house these backups
RPO is derived directly from business requirements. As different applications within the same organization carry different business value, RPO is fundamentally a business application-specific attribute.
In determining RPO, companies should consider the risk of faulty backups. One faulty point in time doubles the achievable recovery point between the two adjacent points in time. Companies should regularly test their backups to ensure that they are recoverable when needed. Even misconfigurations and lapsed licensing can wreak havoc on efforts to return to full production.
Recovery Time Objective (RTO)
RTO defines how much time the organization can afford to lose after a disaster strikes until it is back in business. Generally, this relates to the entire time it takes to operationalize. Depending on the disaster and protection scenario, this is composed of multiple factors, many of which are often overlooked:
- Disaster Declaration: Who in the organization is authorized to declare a disaster and commence recovery. What are the measures they must take before the red button is pushed?
- System Setup: In a disaster scenario, the production site is impaired. Consider the time it takes to set up an operational system at a secondary site.
- Recovery Execution: How long will it take to get the right people to execute recovery?
- Backup Access: How much time would it take to gain access to the backup data? Is it online or requires physical travel? If it is stored on a remote site, how do you gain connectivity if your primary site is down?
- Transfer: Add the time it takes to transfer the data. If data is stored in the same site, transferring a 100GB dataset over a modern 10GbE network takes about 1.5 minutes, and nearly 15 minutes over a 1GbE network.
- System Restart: Take into account the time it takes to restart servers, launch applications, and load the data into production.
It is important for businesses to analyze how their recovery process is impacted by various activities so they can establish a realistic RTO. Companies who do not plan these processes correctly end up spending time organizing and defining an action plan in real-time, which means their actual recovery time won’t meet the designated objective.
Technical RTO (TRTO)
Following this analysis, businesses can zoom in to correctly identify the Technical RTO. This refers to the time consumed within the boundaries of the data protection solution. Steps in this phase may include:
- Spinning up a new set of VMs hosting the application.
- Configuring the VMs correctly and establishing communication.
- Transferring the data from the backup medium to the production storage system.
- Launching the applications and loading the recovered data.
The benefits of relaxing the TRTO must translate to cost savings, as an example by auto-tiering the backup data storage from SSD to spindles.
When referring to RTO, keep in mind the difference between overall recovery process and the technical recovery phase. Have a crisp definition of your TRTO and make it clear which RTO you are referring to.
Retention period defines the duration a business requires data copies to be stored until they may (or must) be discarded. Like RPO and RTO, retention periods are application-specific. In addition, the business value of data often decreases with time, becoming less valuable the older the data gets. Retention requirements may therefore be respectively reduced. The business requirements may be captured using time tiers as defined below.
Service-Level Agreement (SLA) Tiers
Given the changing requirements over time, Retention Periods, TRTO, RTO and RPO are specified using time tiers. This may be referred to as SLA Tiers (SLA is, unfortunately, an overused term). For example, an organization may require the following tiers for their MySQL application:
|<24 hours||5 minutes||1 hour||For the running 24 hours, your business application will be operational within minutes, utilizing hourly point-in-time backups.|
|1-30 days||1 hour||1 hour||For the rest of the month, you want to be able to complete technical recovery within an hour to the nearest hour.|
|30+ days||24 hours||24 hours||For anything older than a month, you will be able to restore a point-in-time from a particular day from your archiving system within 24 hours.|
Consider specific use cases to help you define these requirements. The first tier in the above example addresses a common data damage scenario such as when a VM is accidentally deleted, a file or folder has been overwritten, or a database has been corrupted. Different from a disaster scenario, all systems are operational. the business can sustain some data loss but requires minimal downtime. For this scenario to be practical, there must to be no operational overhead. This means that end users (tenants in a cloud environment) must be able to execute the recovery on their own, without administrative assistance.
Less demanding SLAs allow for cost reduction through utilization of slower, lower cost storage mediums or facilities. Your organization must become comfortable with the trade-offs of losing more data or having longer down-time after a disaster strikes.
The Impact of RTO and RPO
One of the (many) reasons Trilio supports tenant-driven recovery workflows is because it allows organizations to trim their RTO while still mandating SLA-driven RPOs at a management level. This balance allows administrators to define a protection schedule and policy for each workload, but gives your tenants control to manage and restore point-in-time backups without requiring intervention.
Defining RTO and RPO help to strike a balance between disaster preparation and cost efficiency, while promising critical data availability that’s needed to run the business. Data loss may occur even when the infrastructure is uninterrupted, and preparedness is yet another tool at our disposal to limit and mitigate the potential negative impacts of unexpected data loss.