Cassandra Observability: Compaction, Tombstones, Repair, Latency, and Hot Partitions

If you try to monitor a distributed, masterless database like Cassandra using the same dashboard you use for a monolithic relational database, you will misdiagnose every single incident.

Situation

Apache Cassandra operates on fundamentally different assumptions than relational systems like PostgreSQL or MySQL. It is an AP system in the CAP theorem context: highly available, partition tolerant, and eventually consistent. Data is distributed across a ring of nodes, writes are appended to memory and disk sequentially, and deletes are executed by inserting a marker called a “tombstone.”

When teams adopt Cassandra, they often plug it into their existing monitoring stack. They set alerts on CPU utilization, disk space, and memory consumption. But in Cassandra, a node running at 80% CPU might be perfectly healthy and churning through background compaction, while a node at 20% CPU might be silently dropping mutations because it is overwhelmed by tombstones during read repair. Generic infrastructure metrics are insufficient; you must observe Cassandra’s internal state machine.

Symptoms

A Cassandra cluster experiencing distress exhibits unique failure modes that rarely trigger standard host-level alarms until it is too late:

The Tombstone Overwhelm: Read latency spikes for a specific table. CPU is low, but the application is timing out. The node is scanning and discarding thousands of deleted records (tombstones) to return a single live row.
The Compaction Debt: Disk usage begins climbing relentlessly. The node is writing data faster than the background compaction threads can merge the SSTables, leading to read latency degradation as queries must scan dozens of fragmented files.
The Partition Hotspot: One node in a 10-node cluster is pegged at 100% CPU while the other nine sit at 15%. A single customer or entity is receiving a disproportionate share of traffic, overwhelming the node responsible for that token range.
The Repair Drift: Nodes return inconsistent data depending on the consistency level (LOCAL_QUORUM vs ONE). Anti-entropy repair processes have fallen behind or failed, leading to stale reads.

First Five Checks

When a Cassandra pager alert fires—especially for p99 latency spikes—these are the five internal metrics you must check:

Check Pending Tasks (nodetool tpstats): This shows the thread pool statistics. The critical metrics are Pending and Dropped messages. If MutationStage or ReadStage have high pending counts, the node is saturated. If there are dropped mutations, data is not being written.
Evaluate Compaction Backlog (nodetool compactionstats): Look at pending tasks. A small number is normal. A number in the hundreds or thousands indicates compaction has fallen permanently behind the write rate.
Analyze Tombstone Ratios (Log inspection or JMX metrics): Check the system.log for warnings about Scanned over X tombstones. If this number exceeds the tombstone_warn_threshold, read queries are doing massive amounts of wasted work.
Verify Client Request Latency via JMX/Metrics: Look at ClientRequest.Latency.Read and ClientRequest.Latency.Write at the 99th percentile (p99). Cassandra is highly optimized for writes; if write latency spikes, disk I/O is usually the bottleneck.
Examine Partition Sizes (nodetool tablestats): Look for the Compacted partition maximum bytes. If a single partition exceeds 100MB, you have a data modeling problem causing a hotspot, not an infrastructure problem.

Decision Tree

When diagnosing a Cassandra latency spike, use the following operational flow:

flowchart TD
    A[p99 Latency Spike Detected] --> B{Is it Read or Write Latency?}
    B -->|Write| C[Check Pending Tasks]
    C --> C1{Are Mutations Dropping?}
    C1 -->|Yes| C2[Node is Overwhelmed: Add Capacity or Shed Load]
    C1 -->|No| C3[Check Disk I/O Wait]
    C3 -->|High| C4[Storage Bottleneck: Upgrade Disks]
    
    B -->|Read| D[Check Pending Tasks]
    D --> D1{Are ReadStages Pending?}
    D1 -->|No| D2[Check Tombstone Warnings in Logs]
    D2 -->|High| D3[Tombstone Overwhelm: Change Data Model or Lower GC Grace]
    D2 -->|Low| D4[Check Compaction Backlog]
    D4 -->|High| D5[Fragmented Reads: Tune Compaction Throughput]

Remediation Options

Tune Compaction Throughput (Medium Speed, Low Risk): If compaction is falling behind, you can dynamically increase compaction_throughput_mb_per_sec using nodetool setcompactionthroughput.
- Tradeoff: Compaction is highly I/O intensive. Increasing throughput might clear the backlog but can temporarily degrade read and write latencies.
Add Nodes to the Ring (Slow, Permanent Fix): If the entire cluster is legitimately saturated (high CPU, high pending tasks, dropping mutations across the ring), you must bootstrap new nodes.
- Tradeoff: Bootstrapping involves streaming data across the network, which adds load to the existing struggling nodes. Do not wait until the cluster is at 95% capacity to scale.
Lower gc_grace_seconds (Fast, High Risk): If tombstones are crushing read performance on a specific table, and you do not require a long window for resurrecting dead data via repair, you can lower gc_grace_seconds via ALTER TABLE.
- Tradeoff: If a node goes down for longer than the new gc_grace_seconds and misses a delete, that deleted data will “resurrect” when the node comes back online.

Rollback Plan

If you tune compaction throughput too aggressively and disk I/O saturates causing widespread query timeouts, revert compaction_throughput_mb_per_sec to its previous conservative value (e.g., 16 MB/s) using nodetool setcompactionthroughput 16. Note: setting the value to 0 removes the limit entirely — it does not pause compaction. If background compaction is actively destroying cluster stability, use nodetool stop COMPACTION to halt the specific running tasks until I/O pressure subsides.

Automation Opportunity

Deploy an automated script that polls JMX metrics for Dropped Mutations across all nodes. If a node begins dropping mutations for more than 5 minutes, automatically route application traffic away from that specific node’s local datacenter (if running multi-DC) or trigger a high-severity incident, because dropped mutations mean permanent data loss if not recovered via hinted handoff or repair.

Leadership Summary

Acknowledge the Cassandra Tax: Cassandra requires ongoing background maintenance (compaction and repair). You must provision your clusters so that they run at no more than 50-60% capacity during normal operations to leave headroom for this maintenance.
Data Modeling is Operations: 90% of Cassandra performance issues are caused by bad data models (large partitions or heavy deletes), not bad hardware.
Monitor the 99th Percentile: Cassandra is known for stable average latencies but terrifying tail latencies during JVM garbage collection or heavy compaction. Always alert on p99, never on the average.

What to Do Next

Problem: Cassandra’s most destructive failure modes — tombstone read amplification, compaction debt, hot partitions — don’t register on CPU or memory dashboards until the cluster is already in distress, because a node scanning 50,000 tombstones to return one row can run at 20% CPU while its read latency is at 10 seconds.
Solution: Ingest nodetool tpstats (pending and dropped task counts), nodetool compactionstats (pending compaction tasks), and tombstone scan warnings from system.log as time-series metrics alongside host metrics — these are the only signals that surface Cassandra-specific distress before it becomes visible to users.
Proof: Artificially generate thousands of deletes on a test table in staging and verify that read latency alerts fire before the problem appears on CPU charts — if CPU is the first signal, the monitoring doesn’t give enough lead time.
Action: Configure JMX metrics ingestion (Datadog JMX integration or Prometheus JMX exporter) this week and add a panel tracking ClientRequest.Latency.Read p99 and Pending CompactionExecutor tasks — these two metrics together explain most Cassandra incidents.

Situation

Symptoms

First Five Checks

Decision Tree

Remediation Options

Rollback Plan

Automation Opportunity

Leadership Summary

What to Do Next

Rajiv

Related Posts

Prometheus + Grafana for Database Engineers: Open-Source Monitoring That Actually Works

PostgreSQL Observability: Vacuum, Bloat, Locks, Replication Lag, and Query Plans

The Database Observability Baseline: What Every DBA Dashboard Must Show