High Availability: Designing Systems That Cannot Go Down

TL;DR

High availability is a design property, not a configuration you bolt on at the end: it affects every layer of the system, from how you manage state to what "available" means for your business.
Each SLA "nine" multiplies cost: from 99.9% (~8.7 hours of downtime per year) to 99.999% (~5 minutes), operational complexity scales exponentially.
Active-passive fails in production because the standby node never processed real traffic; active-active avoids that problem but requires resolving data consistency between nodes.
Shallow health checks, poorly tuned heartbeats, and missing quorum are the real causes behind split-brain.
Without chaos engineering (failover drills, game days) you never know if your system survives a real failure — everything else is "PowerPoint HA".

"We have high availability" is one of the most repeated — and least verified — claims in software architecture. Two servers behind a load balancer isn't high availability. It's basic redundancy. And basic redundancy fails exactly when you need it most: under real load, with correlated failures, at the worst possible time.

High availability is a design property, not a toggle. It requires architectural decisions that affect every layer of the system, from how you manage state to how you define what "available" means for your business.

The numbers: what each nine actually means

SLAs are expressed in percentages, but engineers should think in minutes of downtime. The difference between each "nine" is an order of magnitude in complexity and cost:

99.9% (three nines): ~8.7 hours of downtime per year. Sufficient for most internal applications. Achievable with basic redundancy and monitoring.
99.99% (four nines): ~52 minutes per year. Requires automated failover, aggressive health checks, and elimination of single points of failure. This is where most teams underestimate complexity.
99.999% (five nines): ~5 minutes per year. Requires multi-region active-active architecture, zero-downtime deployments, and an operational discipline few teams possess. Cost scales exponentially.

The right question isn't "how many nines do we want?" It's "how much downtime can the business tolerate, and how much are we willing to invest to reduce it?"

HA patterns: Active-Passive, Active-Active, Multi-region

Active-Passive is the most common pattern and the most deceptive. One primary node handles all traffic while a secondary sits in standby. Sounds simple. The problem: the passive node hasn't processed real traffic in weeks. When the primary fails and the passive takes over, you discover it has a different schema version, its connection pool wasn't warm, or a local cronjob was never configured on the secondary. The failover that worked in the runbook doesn't work in reality.

Active-Active solves that: both nodes process traffic simultaneously. If one goes down, the other absorbs the full load. But it introduces another challenge: data consistency. If both nodes can write, you need conflict resolution. This means decisions about synchronous vs. asynchronous replication, eventual consistency, and CRDTs or conflict resolution strategies.

Multi-region takes active-active to the geographic level. Traffic is distributed across regions, and if an entire region goes down, the others absorb the load. Inter-region latency becomes the dominant factor here. Synchronous replication across continents is prohibitive in most cases.

The invisible components: what actually holds HA together

Architecture diagrams show servers and arrows. What they don't show is what makes the system work:

Health checks: Checking that a process responds isn't enough. A real health check validates database connectivity, access to critical services, disk space, and processing capacity. A shallow health check is worse than none — it gives you a false sense that the node is healthy.
Heartbeats: The mechanism by which cluster nodes confirm they're alive. Heartbeat frequency defines how fast you detect failure. Too frequent and you saturate the network. Too infrequent and you react too slowly.
Quorum: In a cluster of N nodes, how many need to agree to make a decision? Quorum prevents split-brain but requires an odd number of nodes and tolerance for network partitions.
Split-brain: The most dangerous HA scenario. Two nodes lose communication and both assume they're primary. Result: two sources of truth, inconsistent data, corruption. Solving it requires fencing (STONITH), quorum disks, or a third arbiter node.

Design decisions: where to put state

State is the enemy of high availability. Stateless services scale and recover easily. Stateful services complicate everything.

Sessions: Don't store them in server memory. Use an external store (Redis, database) or stateless tokens (JWT). If a node goes down, sessions shouldn't be lost.
Cache: Decide whether cache is expendable or critical. If expendable, a cache miss after failover is acceptable. If critical, you need replication.
Database: The most expensive decision. Synchronous replication guarantees consistency but adds latency. Asynchronous replication is faster but accepts data loss during failover. There's no universally correct option — there are trade-offs you must understand.

Something often forgotten: not everything needs replication. Local logs, temporary caches, derived data that can be recalculated — replicating everything multiplies complexity without proportional benefit.

The anti-pattern: "PowerPoint HA"

There's a type of high availability that only works in slide decks. The diagram shows two data centers, replication arrows, and a global load balancer. But nobody has tested the failover. Nobody knows how long it takes. Nobody has validated that data replicates correctly under load.

I've seen "multi-region active-active" architectures where manual failover took 45 minutes because it required DNS changes, configuration updates, services restarted in a specific order, and manual data integrity verification. That's not HA. That's a slow recovery plan disguised as a resilient architecture.

The only way to know if your HA works is to test it. Chaos engineering, failover drills, game days. If you've never deliberately killed a primary node in production, you don't know whether your system survives a real failure.

The cost nobody talks about

Each additional nine multiplies cost. Not just infrastructure — operational complexity, engineering hours, monitoring tools, on-call processes. A system with 99.999% availability requires a team that lives and breathes operations. It's not just a design. It's a culture.

The honest conversation with the business: "How much money do we lose per hour of downtime? How much does it cost to design and operate a system that reduces that downtime?" If the cost of HA exceeds the cost of downtime, you're over-engineering.

A system is not highly available until it has survived a real failure. Diagrams don't fail. Systems do. Design for failure, not for the presentation.

Jorel del Portal

Systems engineer specialized in enterprise software architecture and high availability platforms.

LinkedIn YouTube TikTok Facebook Instagram X Threads My website

High Availability: Beyond the 99.9%

The numbers: what each nine actually means

HA patterns: Active-Passive, Active-Active, Multi-region

The invisible components: what actually holds HA together

Design decisions: where to put state

The anti-pattern: "PowerPoint HA"

The cost nobody talks about

Jorel del Portal

Related articles

Resilience Patterns: Circuit Breaker, Retry, Timeout

Microservices: When to Use Them

Understanding Latency in Distributed Systems