High Availability: Beyond the 99.9%

Availability isn't bolted on at the end. It's designed from the start.

High availability systems design

"We have high availability" is one of the most repeated — and least verified — claims in software architecture. Two servers behind a load balancer isn't high availability. It's basic redundancy. And basic redundancy fails exactly when you need it most: under real load, with correlated failures, at the worst possible time.

High availability is a design property, not a toggle. It requires architectural decisions that affect every layer of the system, from how you manage state to how you define what "available" means for your business.

The numbers: what each nine actually means

SLAs are expressed in percentages, but engineers should think in minutes of downtime. The difference between each "nine" is an order of magnitude in complexity and cost:

The right question isn't "how many nines do we want?" It's "how much downtime can the business tolerate, and how much are we willing to invest to reduce it?"

HA patterns: Active-Passive, Active-Active, Multi-region

Active-Passive is the most common pattern and the most deceptive. One primary node handles all traffic while a secondary sits in standby. Sounds simple. The problem: the passive node hasn't processed real traffic in weeks. When the primary fails and the passive takes over, you discover it has a different schema version, its connection pool wasn't warm, or a local cronjob was never configured on the secondary. The failover that worked in the runbook doesn't work in reality.

Active-Active solves that: both nodes process traffic simultaneously. If one goes down, the other absorbs the full load. But it introduces another challenge: data consistency. If both nodes can write, you need conflict resolution. This means decisions about synchronous vs. asynchronous replication, eventual consistency, and CRDTs or conflict resolution strategies.

Multi-region takes active-active to the geographic level. Traffic is distributed across regions, and if an entire region goes down, the others absorb the load. Inter-region latency becomes the dominant factor here. Synchronous replication across continents is prohibitive in most cases.

The invisible components: what actually holds HA together

Architecture diagrams show servers and arrows. What they don't show is what makes the system work:

Design decisions: where to put state

State is the enemy of high availability. Stateless services scale and recover easily. Stateful services complicate everything.

Something often forgotten: not everything needs replication. Local logs, temporary caches, derived data that can be recalculated — replicating everything multiplies complexity without proportional benefit.

The anti-pattern: "PowerPoint HA"

There's a type of high availability that only works in slide decks. The diagram shows two data centers, replication arrows, and a global load balancer. But nobody has tested the failover. Nobody knows how long it takes. Nobody has validated that data replicates correctly under load.

I've seen "multi-region active-active" architectures where manual failover took 45 minutes because it required DNS changes, configuration updates, services restarted in a specific order, and manual data integrity verification. That's not HA. That's a slow recovery plan disguised as a resilient architecture.

The only way to know if your HA works is to test it. Chaos engineering, failover drills, game days. If you've never deliberately killed a primary node in production, you don't know whether your system survives a real failure.

The cost nobody talks about

Each additional nine multiplies cost. Not just infrastructure — operational complexity, engineering hours, monitoring tools, on-call processes. A system with 99.999% availability requires a team that lives and breathes operations. It's not just a design. It's a culture.

The honest conversation with the business: "How much money do we lose per hour of downtime? How much does it cost to design and operate a system that reduces that downtime?" If the cost of HA exceeds the cost of downtime, you're over-engineering.

A system is not highly available until it has survived a real failure. Diagrams don't fail. Systems do. Design for failure, not for the presentation.

Jorel del Portal

Jorel del Portal

Systems engineer specialized in enterprise software architecture and high availability platforms.