Domain 3: Security Architecture Module 25 of 61

High Availability and Site Resilience

Security+ Domain 3 — Security Architecture B — Data Protection and Resilience 12–15 minutes

What the Exam Is Really Testing

Picture this: a hospital's electronic health records system goes down for six hours. Surgeries are delayed. Emergency staff cannot pull patient allergy information. Nobody stole any data — but the impact is real and immediate. That is why the CIA triad includes availability alongside confidentiality and integrity.

Availability is a security objective. A system that is down cannot serve its purpose, and downtime creates risk just as surely as a breach does.

This module is about designing systems that survive failures, and the exam will ask you to match resilience strategies to specific business requirements. The right answer depends on how much downtime is acceptable, how much data loss the organization can tolerate, and what the budget allows.


High Availability Concepts

High availability (HA) means designing systems to operate continuously with minimal downtime. HA eliminates single points of failure so that the failure of one component does not take down the entire service.

HA is measured in "nines":

  • 99.9% (three nines) — Up to 8.76 hours of downtime per year
  • 99.99% (four nines) — Up to 52.6 minutes of downtime per year
  • 99.999% (five nines) — Up to 5.26 minutes of downtime per year

Each additional nine dramatically increases cost and complexity. The exam tests whether you can match availability requirements to business needs rather than always choosing the most expensive option.


Load Balancing

Load balancers distribute incoming traffic across multiple servers. If one server fails, the load balancer redirects traffic to the remaining healthy servers.

Load Balancing Methods

  • Round-robin — Distributes requests sequentially across servers. Simple but does not account for server load.
  • Least connections — Sends traffic to the server with the fewest active connections. Better for uneven workloads.
  • Weighted — Assigns different weights to servers based on capacity. More powerful servers receive more traffic.
  • Health-based — Monitors server health and removes unhealthy servers from the pool automatically.

Security Benefits

  • Distributes DDoS traffic across multiple targets
  • Enables SSL/TLS offloading (load balancer handles encryption)
  • Provides a single point for implementing WAF rules
  • Facilitates rolling updates without downtime

Clustering

Clustering groups multiple servers to work together as a single system. If one node in the cluster fails, another node takes over its workload.

Active-Active

All nodes process traffic simultaneously. If one node fails, the remaining nodes absorb its load. No wasted capacity — every node is working.

Advantage: maximum utilization and performance. Disadvantage: more complex configuration and potential for split-brain scenarios.

Active-Passive

One node handles all traffic (active). The other nodes stand by (passive) and take over only if the active node fails.

Advantage: simpler configuration and clear failover behavior. Disadvantage: passive nodes are idle, representing unused capacity and cost.

The exam tests whether you can choose the right clustering model based on the scenario's requirements for utilization, complexity, and recovery time.


Site Types

When a disaster takes out an entire facility, the organization needs an alternate site to resume operations. The three site types differ in readiness and cost:

Hot Site

A fully equipped facility with hardware, software, data, and network connectivity. Ready to take over operations within minutes to hours.

  • Data is replicated in near-real-time
  • Systems are pre-configured and running
  • Most expensive option
  • Shortest recovery time

Warm Site

A partially equipped facility with hardware and network connectivity but not current data. Takes hours to days to become operational.

  • Hardware is in place but may need configuration
  • Data must be restored from backups
  • Moderate cost
  • Moderate recovery time

Cold Site

A facility with basic infrastructure (power, cooling, network connections) but no hardware or data. Takes days to weeks to become operational.

  • Hardware must be procured and installed
  • Software must be configured from scratch
  • Data must be restored from offsite backups
  • Least expensive option
  • Longest recovery time

The tradeoff:

Hot sites minimize downtime but maximize cost. Cold sites minimize cost but maximize downtime. The right choice depends on how much downtime the business can tolerate.

Geographic Dispersal and Platform Diversity

Geographic Dispersal

Distributing resources across multiple geographic locations protects against regional disasters (earthquakes, hurricanes, power grid failures). If one location is destroyed, operations continue at other locations.

Considerations:

  • Distance must be sufficient to avoid shared regional risks
  • Latency increases with geographic distance
  • Data sovereignty may restrict which locations can be used

Platform Diversity

Using different hardware vendors, operating systems, or cloud providers reduces the risk that a single vulnerability or outage affects all systems.

Example: running workloads on both AWS and Azure. If one provider experiences an outage, the other continues operating.

Security benefit: a vulnerability in one platform does not compromise the entire infrastructure.


Multi-Cloud Strategies

Multi-cloud uses multiple cloud providers for different workloads or redundancy. Benefits include:

  • Vendor independence — Avoids lock-in to a single provider
  • Resilience — Provider outages do not affect all workloads
  • Compliance flexibility — Different providers may offer regions that satisfy data sovereignty requirements
  • Best-of-breed services — Use each provider's strongest offerings

Challenges:

  • Increased operational complexity
  • Inconsistent security controls across providers
  • Staff must be trained on multiple platforms
  • Cost management across multiple billing systems

RPO and RTO

These two metrics drive every resilience and recovery decision:

Recovery Point Objective (RPO)

The maximum acceptable amount of data loss measured in time. RPO answers: how much data can the organization afford to lose?

RPO of 1 hour means the organization can tolerate losing up to 1 hour of data. Backups must occur at least every hour.

Recovery Time Objective (RTO)

The maximum acceptable amount of time to restore operations after a disruption. RTO answers: how long can the organization be down?

RTO of 4 hours means operations must be restored within 4 hours of a failure.

The relationship:

  • Lower RPO requires more frequent backups or real-time replication (more expensive)
  • Lower RTO requires hot sites, automated failover, and pre-configured recovery (more expensive)
  • RPO and RTO together determine the recovery strategy and site type

Capacity Planning

Capacity planning ensures that sufficient resources exist to handle both normal operations and disaster recovery scenarios.

Security considerations:

  • Surge capacity — Can the infrastructure handle sudden traffic spikes (DDoS mitigation, seasonal demand)?
  • Failover capacity — When a primary system fails, does the backup have enough capacity to handle the full workload?
  • Growth planning — Are resources scaled to meet projected future demand?
  • Resource exhaustion — If capacity is exceeded, does the system degrade gracefully or fail completely?

Pattern Recognition

When you see availability and resilience scenarios on the exam:

  • Zero tolerance for downtime — The answer involves hot site, active-active clustering, or real-time replication
  • Budget-constrained recovery — The answer involves warm or cold sites with longer RTOs
  • Regional disaster scenario — The answer involves geographic dispersal
  • Single provider outage — The answer involves multi-cloud or platform diversity
  • "How much data can we lose?" — The answer involves RPO
  • "How fast do we need to recover?" — The answer involves RTO

Trap Patterns

Watch for these common traps:

  • "Always choose the hot site" — Hot sites are the most expensive. If the scenario mentions budget constraints or long acceptable recovery times, a warm or cold site may be more appropriate.
  • Confusing RPO and RTO — RPO is about data loss (backward-looking). RTO is about downtime (forward-looking). Read carefully to determine which the question asks about.
  • "Active-active is always better" — Active-active provides better utilization but is more complex. Active-passive is simpler and appropriate when complexity is a concern.
  • Ignoring capacity in failover — If the backup site has half the capacity of the primary, failover will result in degraded performance. Capacity planning matters.

Scenario Practice


Question 1

An e-commerce company determines that it cannot afford to lose more than 15 minutes of transaction data and must restore operations within 2 hours of a disaster.

Which site type BEST meets these requirements?

A. Cold site with weekly backup restoration
B. Warm site with daily backup tapes stored offsite
C. Hot site with near-real-time data replication
D. Cold site with monthly disaster recovery testing

Answer & reasoning

Correct: C

An RPO of 15 minutes requires near-real-time data replication (daily backups would lose up to 24 hours of data). An RTO of 2 hours requires a pre-configured environment ready to take over quickly. Only a hot site with real-time replication meets both requirements.


Question 2

A company runs its entire infrastructure on a single cloud provider. A multi-hour provider outage causes a complete business disruption.

What strategy would BEST prevent this from recurring?

A. Increase the SLA commitment with the existing provider
B. Implement a multi-cloud strategy with workload distribution across providers
C. Add more instances within the same provider and region
D. Switch to an on-premises data center exclusively

Answer & reasoning

Correct: B

Multi-cloud distributes workloads across multiple providers. A single provider outage no longer causes total business disruption because workloads continue running on the other provider.

Adding instances in the same provider does not protect against provider-wide outages. Moving entirely on-premises eliminates cloud benefits.


Question 3

A company has two data centers in the same city. A regional power grid failure takes both offline simultaneously.

What resilience principle was missing?

A. Load balancing between the two data centers
B. Active-passive clustering for database failover
C. Geographic dispersal of data center locations
D. Higher capacity servers in the primary data center

Answer & reasoning

Correct: C

Both data centers were in the same geographic region and shared the same regional risk (power grid failure). Geographic dispersal places data centers in different regions so that a regional disaster cannot affect all locations simultaneously.


Key Takeaway

Resilience is a business decision, not just a technical one. RPO defines how much data loss is acceptable. RTO defines how much downtime is acceptable. Budget and risk tolerance determine which solution fits. Before answering availability questions, identify those four variables in the scenario — they point directly to the right answer. The most expensive option is not always the best answer. The right answer matches the recovery strategy to the business requirement.

Next Module Module 26: Backup Strategies and Disaster Recovery