Thread Dumps and Heap Dumps: JVM Diagnosis

TL;DR

When the JVM fails in production it won't tell you what happened: you have to ask it with thread dumps and heap dumps, complementary tools that are safe to capture without a restart.
A thread dump is a snapshot of every thread (RUNNABLE, WAITING, BLOCKED, TIMED_WAITING) that reveals deadlocks, thread starvation, pool saturation, and lock contention; capture it with jstack, kill -3, or jcmd.
A heap dump is a complete capture of objects in memory and causes a stop-the-world pause proportional to heap size; the -XX:+HeapDumpOnOutOfMemoryError flag is the most important one to configure in production.
Analyze the heap dump with Eclipse MAT: Dominator Tree to see which object retains memory, Leak Suspects for automatic candidates, and Histograms to compare growth across captures.
Real case at Sterling OMS: an unbounded document cache accumulated 1.2 million entries in an ArrayList and had the team restarting the JVM every 5 days; the heap dump pinpointed the exact cause, and a 30-minute TTL stabilized the heap at 3.5 GB.

The JVM is a remarkably efficient black box. It manages memory, threads, garbage collection, and JIT compilation without you having to think about it — until something goes wrong. When it does go wrong in production, the JVM doesn't volunteer information. You have to ask. The tools to ask are two: thread dumps and heap dumps.

I've diagnosed deadlocks at 2 AM, memory leaks that took weeks to surface, and saturated thread pools that took down entire platforms. In every case, the answer was in a dump. The problem is that many engineers have never read one, and when they need to, they don't know where to start.

Thread Dumps: what every thread is doing

A thread dump is an instant snapshot of every thread in the JVM at a given moment. Each thread shows its name, its state, and its full stack trace. It's the fundamental tool for diagnosing concurrency problems, resource contention, and deadlocks.

You'll encounter four thread states, and each tells a different story:

RUNNABLE: the thread is actively executing code or is ready to run. If many threads are RUNNABLE doing the same operation, you have a CPU hotspot.
WAITING: the thread waits indefinitely for another thread to notify it. Typical of empty connection pools or queues with no producers.
BLOCKED: the thread is trying to acquire a lock held by another thread. Many threads BLOCKED on the same monitor means serious contention.
TIMED_WAITING: like WAITING, but with a timeout. Thread.sleep(), Object.wait(timeout), or I/O with a time limit.

Capturing thread dumps

Three main methods, all production-safe. They don't stop the JVM, cause no significant pause, and require no restart:

jstack <PID>: the classic approach. Ships with the JDK. Fast and direct. If the JVM is unresponsive, use jstack -F to force it.
kill -3 <PID>: sends SIGQUIT to the Java process. The dump goes to stdout (or the container log). Works even when jstack can't connect.
jcmd <PID> Thread.print: the modern option. More flexible than jstack with better output formatting. My preference for recent JVMs.

A single thread dump shows one instant. To diagnose intermittent issues, capture three or four at 5-10 second intervals. Threads that appear in the same state and the same line of code across all captures are your suspects.

Diagnostic patterns in thread dumps

After reading hundreds of thread dumps, the patterns repeat. These are the most common:

Deadlock: two or more threads blocked waiting for a lock the other holds. The JVM detects these automatically and reports them at the end of the dump with "Found one Java-level deadlock". If you see this, review the lock acquisition order in your code.
Thread starvation: the thread pool is exhausted. All workers are busy (RUNNABLE or BLOCKED) and new requests pile up in the queue. Happens when the pool is too small or when blocking operations should be asynchronous.
Pool saturation: a variant specific to connection pools. Many threads in WAITING at getConnection() or borrowObject(). The database or downstream service can't keep up with demand.
Lock contention: dozens of threads BLOCKED waiting on the same monitor. A single lock becomes a bottleneck. Fix: reduce the critical section, use finer-grained locks, or consider lock-free structures.

For automated analysis, tools like IBM Thread Analyzer and fastthread.io parse the dump, group threads by state, detect deadlocks, and visualize contention. fastthread.io is particularly useful — it works in the browser: upload the file and get an immediate report.

Heap Dumps: what's in memory

If the thread dump shows what the JVM is doing, the heap dump shows what it's holding. It's a complete capture of the heap: every object, its type, its size, and its references to other objects. It's the definitive tool for diagnosing memory leaks and garbage collection pressure.

A heap dump can be several gigabytes — proportional to the configured heap size. Capturing it causes a stop-the-world pause whose duration depends on heap size. In production, be aware of this impact.

Capturing heap dumps

jmap -dump:format=b,file=heap.hprof <PID>: on-demand capture. Use when you need to diagnose an active problem.
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/path/: the most important flag. Always configure it in production. When the JVM runs out of memory, it automatically generates the dump before dying. Without this, you lose the evidence.
jcmd <PID> GC.heap_dump /path/heap.hprof: modern alternative to jmap. Same functionality, better integration with management tools.

Analysis with Eclipse MAT

Eclipse Memory Analyzer Tool (MAT) is the standard tool for heap dump analysis. Free, robust, and capable of handling multi-gigabyte dumps. Three fundamental analyses:

Dominator Tree: shows objects retaining the most memory. If a single object dominates 60% of the heap, that's your problem. The dominator tree tells you exactly which object prevents the garbage collector from freeing memory.
Leak Suspects: MAT runs heuristics to identify potential leaks automatically. Not always right, but an excellent starting point. The report includes the reference chain from GC root to the suspect object.
Histograms: list of all classes with instance count and total size. Compare histograms from two dumps taken at different times — classes whose count grows are leak candidates.

Common memory leak patterns

Unbounded caches: a HashMap that grows indefinitely because entries are never removed. Fix: use caches with eviction policies (Caffeine, Guava Cache) or WeakHashMap.
Unclosed connections: database connections, file streams, or HTTP clients that are opened and never closed. They accumulate in the heap and eventually exhaust OS file descriptors too.
Class loader leaks: typical in application servers. The app is redeployed but the previous class loader isn't released because a thread or static reference holds it. The heap grows with each redeployment until the JVM crashes.

Real case: memory leak in Sterling OMS

A production IBM Sterling environment ran business processes handling purchase orders. The heap grew steadily: 4 GB after startup, 6 GB after 48 hours, OutOfMemoryError after a week. The team restarted the JVM every 5 days as a "fix".

We configured -XX:+HeapDumpOnOutOfMemoryError and waited. When the OOM hit, we analyzed the dump with Eclipse MAT. The Dominator Tree revealed a single java.util.ArrayList instance retaining 2.3 GB of heap. That list lived inside a processed document cache that never ran eviction.

MAT's Leak Suspects pointed directly to the chain: GC Root → ThreadLocal → BusinessProcessContext → DocumentCache → ArrayList with 1.2 million entries. Every processed document was added for "reuse" but never removed. The fix was a size limit and a 30-minute TTL. The heap stabilized at 3.5 GB.

Don't guess what's happening in the JVM. Measure it. A thread dump takes 2 seconds to capture and can save you hours of speculation. A heap dump is the difference between "I think there's a leak" and "I know exactly where the leak is".

Jorel del Portal

Systems engineer specialized in enterprise software architecture and high availability platforms.

LinkedIn YouTube My website

Thread Dumps and Heap Dumps: The JVM X-Ray

Thread Dumps: what every thread is doing

Capturing thread dumps

Diagnostic patterns in thread dumps

Heap Dumps: what's in memory

Capturing heap dumps

Analysis with Eclipse MAT

Common memory leak patterns

Real case: memory leak in Sterling OMS

Jorel del Portal

Related articles

Production Troubleshooting

Observability: More Than Just Monitoring

Understanding Latency in Distributed Systems