Capacity Planning: How to Size Systems

"We need more servers." That sentence, spoken urgently at 2 AM while a service goes down from saturation, is the symptom of a problem that should have been solved weeks earlier. Capacity planning isn't reacting when infrastructure collapses. It's the discipline of understanding how much capacity you have, how much you'll need, and making decisions before demand exceeds supply.

Most teams don't do capacity planning. They provision by gut feeling, scale by panic, and discover infrastructure limits when users are already suffering.

Capacity planning is not "add more servers when it gets slow"

Reactive scaling is expensive, slow, and risky. When you're already in a saturation incident, provisioning new resources takes time: minutes if you have autoscaling well configured, hours if you need budget approval and manual machine setup. During that window, users experience degradation or total outages.

Capacity planning is a continuous process with three phases: measure current capacity, project future demand, and decide when and how to scale. All three require data, not intuition.

The metrics that matter (and why averages lie)

The four fundamental resources are CPU, memory, disk, and network. But how you measure them determines whether your capacity planning is useful or not.

Average CPU at 45% looks healthy. But if p95 is at 92% and p99 hits 100%, you have a problem the average hides. High percentiles reveal the reality that averages conceal. For capacity planning, always use p95 and p99, never averages.

Throughput vs latency: a service might handle 10,000 requests per second at 50ms latency. When throughput climbs to 12,000, latency doesn't increase linearly — it spikes exponentially. This non-linear curve is what makes saturation so dangerous.
Saturation as a predictive metric: saturation measures how much of a resource is in use relative to its maximum. It's the most important metric for capacity planning because it tells you how much headroom you have before collapse. A resource at 80% saturation doesn't have 20% headroom — it has much less, because performance degrades exponentially past the 70-80% mark.

Prediction models: three approaches that work

Predicting future demand doesn't require sophisticated machine learning models. It requires discipline and historical data.

Trend-based (linear extrapolation): take consumption data from the last 3-6 months and project the trend. If your CPU usage grows 5% monthly, in 4 months you'll be at 80% saturation. Simple, effective, and sufficient for most cases. The limitation: it doesn't capture sudden changes.
Event-driven: product launches, marketing campaigns, seasonal peaks (Black Friday, end of year). These events generate spikes that linear extrapolation can't predict. For each known event, estimate the traffic multiplier based on similar past events and provision ahead of time.
Load testing: don't predict — simulate. Run load tests that replicate the expected traffic pattern and measure where infrastructure breaks. It's the most accurate method, but requires investment in representative test environments and tools like k6, Gatling, or Locust.

In practice, all three complement each other. Extrapolation for baseline, events for peaks, and load testing to validate that your projections hold under real pressure.

Vertical vs horizontal: when to scale in each direction

Not all scaling is equal, and choosing the wrong direction has consequences.

Scaling vertically (more CPU, more RAM, faster disk) is the first option for simplicity. No architecture changes required. A database that needs more memory for its working set benefits directly from a vertical upgrade. But it has a ceiling: eventually you hit the limits of available hardware, and each instance jump is disproportionately more expensive.

Scaling horizontally (more instances behind a load balancer) scales better long-term, but requires your application to support it: distributed state or statelessness, load distribution, eventual consistency in many cases. If your service was designed as a stateful monolith with in-memory state, scaling horizontally isn't simply "add more nodes."

The practical rule: vertical first, horizontal when vertical isn't enough. But always design with the assumption that you'll eventually need horizontal.

In practice: the quarterly process

Capacity planning isn't a one-time project. It's a recurring process. Here's how I implement it:

Quarterly trend review: every quarter, review saturation metrics for all critical services. Identify which ones approach the 70-80% range and project when they'll hit the limit.
Proactive saturation alerts: set alerts at 80% saturation for CPU, memory, disk, and connections. Not at 95% — at that point you're already degraded. 80% gives you room to act.
Document decisions and assumptions: every sizing decision should be documented: why you chose that instance, what growth you assumed, when it should be reviewed. Without this record, six months later nobody knows why the service is the size it is.
Pre-provision for known events: if you know Black Friday is coming, don't wait until the week before. Provision 2-3 weeks in advance and run load tests against the new capacity.

The anti-pattern: over-provisioning out of fear

The opposite extreme of under-provisioning is equally problematic. I've seen organizations provision 10x the needed capacity "just in case." The result: six-figure monthly cloud bills with average utilization at 8%.

Over-provisioning comes from fear, not data. It's the infrastructure version of "throw money at the problem." It's not capacity planning — it's the absence of planning with an infinite budget.

Real capacity planning seeks balance: enough headroom to absorb spikes without wasting resources on idle capacity. That balance requires data, not gut feelings.

The cost of not planning

Lack of capacity planning doesn't manifest as an error in a log. It shows up as latency that creeps up gradually, incidents that always happen during the same peak hours, teams living in firefighting mode instead of building features.

Every saturation incident that could have been prevented is wasted engineering time, lost users, and eroded trust. And unlike a code bug, saturation isn't fixed with a 5-minute hotfix.

Capacity planning is the difference between scaling with control and scaling with panic. Measure, project, decide. Before the system forces your hand.

Jorel del Portal

Systems engineer specialized in enterprise software architecture and high availability platforms.

LinkedIn YouTube TikTok Facebook Instagram X Threads My website

Capacity Planning: The Engineering of Predicting Before Fighting Fires

Capacity planning is not "add more servers when it gets slow"

The metrics that matter (and why averages lie)

Prediction models: three approaches that work

Vertical vs horizontal: when to scale in each direction

In practice: the quarterly process

The anti-pattern: over-provisioning out of fear

The cost of not planning

Jorel del Portal

Related articles

Understanding Latency in Distributed Systems

High Availability: Beyond the 99.9%

Kubernetes in Production