Kubernetes in Production: Decisions Not in the Documentation

K8s isn't the answer to everything. Sometimes it's the wrong question.

Kubernetes in production - real decisions

Kubernetes became the default answer to container orchestration. Job postings require it, reference architectures assume it, conference talks revolve around it. If you're not running K8s, it feels like you're doing something wrong. But running it in production looks nothing like the minikube tutorials suggest.

I've operated K8s clusters in enterprise environments where operational complexity far exceeded the value the platform delivered. I've also seen deployments where Kubernetes was exactly the right call. The difference between those scenarios isn't technical. It's contextual.

When NOT to use Kubernetes

This is the question nobody asks during design phase and everyone asks six months later when the cluster needs a dedicated team just to stay alive.

ECS, Cloud Run, Nomad or even docker-compose on dedicated machines can solve the problem with a fraction of the complexity. The right question isn't "how do I use Kubernetes" but "what problem am I solving and what's the simplest way to solve it."

Decisions the docs don't cover

Resource requests vs. limits

The docs explain the difference. What they don't tell you is how to set the right values. I've seen two extremes, both equally harmful:

What actually works: requests based on real p95 consumption (not averages), limits at 1.5x-2x the request, and continuous monitoring to adjust. VPA (Vertical Pod Autoscaler) in recommendation mode gives you a solid starting point.

Networking: service mesh, ingress and internal DNS

Installing Istio because "it's the right way to do service mesh" without a problem that justifies it is a recipe for pain. A service mesh adds a sidecar proxy to every pod, consumes resources, increases latency and makes debugging harder.

Before Istio, ask yourself: Do I need mTLS between services? Do I need advanced traffic splitting? Do I have more than 20 services communicating with each other? If the answer is no to all three, a standard ingress controller (nginx or Traefik) with Kubernetes network policies is enough.

A problem that surfaces late and hurts a lot: internal DNS. CoreDNS works fine until it doesn't. When hundreds of pods are making constant resolution requests, CoreDNS throughput becomes an invisible bottleneck. Configure ndots correctly in your pods and use FQDNs where possible. That one-line change in dnsConfig can cut DNS queries by 80%.

Storage: the pain of stateful workloads

Kubernetes was designed for stateless workloads. That doesn't mean it can't handle state, but every time you deploy a StatefulSet with PersistentVolumes you're swimming upstream.

Databases on K8s are possible. Recommended? It depends. If you have a team experienced with operators like CloudNativePG or Vitess and a solid storage backend (not the cloud provider's default), go ahead. If not, use a managed service. RDS, Cloud SQL or any DBaaS will be more stable than your Postgres on a StatefulSet managed by a team that doesn't know what to do when a PV gets stuck in Released state.

Observability in an ephemeral world

Pods die and respawn constantly. Logs from a pod that was evicted 5 minutes ago no longer exist if you're not shipping them somewhere. This is a basic problem that catches many teams off guard during their first serious incident.

Minimum viable setup: Fluentd or Fluent Bit as a DaemonSet shipping logs to an aggregator (Loki, Elasticsearch). Prometheus with adequate retention for metrics. Distributed traces if you have more than 3 services. Without this, you're operating blind.

Production debugging

kubectl exec is the first tool you reach for, and the last you should rely on as a solution. It's fine for spot checks: verifying environment variables, testing network connectivity, inspecting the filesystem. It's not a debugging strategy.

What works better: ephemeral containers (GA since K8s 1.25) let you attach a debug container to a running pod without modifying its spec. Combine them with crictl for runtime inspection and kubectl debug node for node-level issues.

But the most effective debugging is the kind you never need to do. If your observability is solid, most problems get diagnosed from metrics and logs without touching the cluster.

K8s in the enterprise: RBAC and multi-tenancy

In enterprise environments, the cluster is shared across teams. This is where design decisions become political in addition to technical.

Namespaces as boundaries: they work as logical separation, not real isolation. A namespace won't prevent a pod from consuming all resources on a node. For that you need ResourceQuotas and LimitRanges per namespace, and real enforcement -- not just YAML definitions that nobody reviews.

Real vs. fictional multi-tenancy: if you need real tenant isolation (for compliance, security or simply trust), namespaces aren't enough. You need separate clusters or solutions like vCluster that create virtual clusters inside a physical one. Soft multi-tenancy via namespaces works for teams within the same organization that trust each other. For everything else, it's an illusion.

RBAC looks simple until you have 15 teams with different needs. My advice: define roles at the namespace level (not cluster roles), use AD/LDAP groups instead of individual users, and audit permissions quarterly. Permissions accumulate. Nobody asks to have access removed.

Kubernetes is a tool. Not an architecture. Don't adopt K8s because it's what everyone uses. Adopt it when the problem it solves is bigger than the complexity it introduces. And when you do, invest in your team before you invest in the cluster.

Jorel del Portal

Jorel del Portal

Systems engineer specialized in enterprise software architecture and high availability platforms.