Extreme Reliability
Use the principles of high availability and disaster recovery to design a cloud architecture for extreme reliability.
High availability (HA) systems are designed to ensure that they have the maximum potential for uptime and accessibility. To ensure HA, eliminate single points of failure, so that even if components fail the application remains running and available.
A well-architected disaster recovery (DR) plan enables you to recover quickly from disasters and continue to provide services to your users. DR is the process of preparing for and recovering from a disaster. A disaster can be any event that puts your applications at risk, from network outages to equipment and application failures to natural disasters. To design for DR, deploy your mission-critical applications to multiple regions and use asynchronous replication across regions. Plan for DR at all layers of the stack, including Networking, Compute, Object Storage, Database, and Monitoring.
Architecture Recommendations for Extreme Reliability
We recommend a phased approach for extreme reliability. In the first phase, deploy an architecture that provides HA capabilities by leveraging the fault domains within an availability domain. If more resiliency is needed, in the second phase, deploy an architecture that spans multiple regions.
For more information about regions, availability domains, and fault domains, see Regions and Availability Domains.
Phase 1: Distribute Instances Across Fault Domains
High availability systems are designed to avoid single points of failure. One of the key principles for designing a high availability system is to distribute instances across multiple fault domains. By properly leveraging fault domains, you can increase the availability of applications running on Oracle Cloud Infrastructure.
Your application's architecture determines whether you separate or group instances by using fault domains.
Scenario A: Highly Available Application Architecture
In this scenario, you have a highly available application, for example you have two web servers and a clustered database. In this scenario, you should group one web server and one database node in one fault domain and the other half of each pair in another fault domain. This placement ensures that a failure of any one fault domain does not result in an outage for your application.
Scenario B: Single Web Server and Database Instance Architecture
In this scenario, your application architecture is not highly available, for example you have one web server and one database instance. In this scenario, both the web server and the database instance should be placed in the same fault domain to minimize customer outages. This placement ensures that your application is only impacted by the failure of that single fault domain, providing greater application availability overall.
Composite SLAs
When services are used in combination, the overall system availability depends on the availability of each of the subsystems. To maximize the availability of a system with multiple subcomponents, you should minimize the dependencies that the subcomponents have on each other. This means that, depending on your application architecture, you might achieve the greatest reliability for a given amount of engineering effort by leveraging the fault domains within an availability domain, rather than spanning your resources across availability domains.
Phase 2: Deploy Resources in Multiple Regions
To maximize the resiliency of your workloads, deploy your cloud workloads across multiple regions, rather than multiple availability domains.
Deploying to multiple regions lets you minimize the risks associated with a regional outage, because a regional outage can affect all availability domains in the region. Deploying to multiple regions also maximizes the value of the engineering effort that you invest into porting your workloads across multiple data centers.
Scenario C: Multiple Region Architecture
In this scenario, your architecture replicates the same stack across two regions.
To provide a consistent data source in both regions, use replication capabilities such as GoldenGate for the data layer, Autonomous Data Guard for the database layer, and Object Storage replication policies on the source bucket that identify the region and the bucket to replicate to.
For the front-end and application layer, create a load balancer and configure health check capabilities on the backend resources that are deployed across fault domains in both regions. Every time you deploy a new application to production, deploy the application to instances in both regions.
Explore More
Documentation: