AWS Elastic Disaster Recovery (AWS DRS)

AWS Elastic Disaster Recovery (AWS DRS)

AWS Elastic Disaster Recovery helps you quickly recover from failures by continuously copying data at the block level (the smallest unit of data storage) and providing tools to restore your systems. It can achieve a Recovery Point Objective (RPO) of seconds and a Recovery Time Objective (RTO) of 5-20 minutes.

Recovery Point Objective (RPO)

  1. What is RPO?

    • RPO is the maximum amount of data that could be lost if a disaster occurs. It’s measured as the time between the last data backup and the disaster.
    • For example, if your RPO is 5 minutes, the worst-case scenario is losing up to 5 minutes of data.
  2. How is RPO Measured?

    • RPO is determined by the latest point in time when data was successfully copied from your servers to AWS in a consistent state (meaning no data corruption).
  3. How Does AWS DRS Achieve an RPO of Seconds?

    • AWS DRS uses a replication agent that continuously watches for data changes and immediately copies new data to AWS. This means data is nearly always up-to-date within seconds.
  4. Important Note on Crash-Consistent Recovery:

    • Data that hasn’t been saved from an application’s memory to storage won’t be recovered. Only data that was written to storage is guaranteed to be safe.
  5. Factors Impacting RPO:

    • Network Speed: Data must be copied faster than it’s written. If your network can’t keep up, RPO increases.
    • Example: If your server writes 10 MB of data per second, your network needs to handle at least 10 MB per second to maintain an RPO of seconds.

Recovery Time Objective (RTO)

  1. What is RTO?

    • RTO is the maximum acceptable time to restore your system after a disaster. It’s measured from when recovery starts to when the system is up and running.
    • For example, if your RTO is 10 minutes, your system should be back online within 10 minutes of a failure.
  2. How is RTO Measured?

    • RTO starts when the recovery process begins and ends when the server is operational with network access in AWS.
  3. Factors Impacting RTO:

    • Operating System (OS) Type: Linux servers usually boot faster (around 5 minutes) than Windows servers (up to 20 minutes).
    • OS Configuration: Systems running more applications or services may take longer to boot.
    • Instance Performance: Using a higher-performance instance type (more CPU and RAM) will speed up the recovery process.
    • Volume Performance: High-performance storage volumes with more IOPS (input/output operations per second) will speed up the boot time.

Summary

  • RPO (Recovery Point Objective): Measures how much data you can afford to lose. AWS DRS aims for an RPO of seconds, meaning minimal data loss.
  • RTO (Recovery Time Objective): Measures how quickly you can restore operations. AWS DRS aims for an RTO of 5-20 minutes, meaning your system will be back online within this timeframe after a disaster.

Understanding these concepts and leveraging AWS DRS helps ensure your systems are resilient and can quickly recover from failures, minimizing both data loss and downtime.

For more details, you can refer to the AWS Elastic Disaster Recovery documentation.