Cassandra Write Path Fundamentals for Database Engineers
Cassandra’s write performance reputation is correct but incomplete — writes are fast because Cassandra converts random writes into sequential I/O, and the operational cost of that conversion is paid later through compaction, which can saturate disk throughput if the strategy does not match the workload.
Situation
Database engineers familiar with PostgreSQL or MySQL approach Cassandra expecting tunable durability, indexing flexibility, and a query optimizer. Cassandra’s durability and performance model works differently: the write path is optimized for sequential I/O at the cost of deferred merge work, and the query model is constrained by the partition key and clustering columns defined at schema creation.
Cassandra is used in production for workloads requiring high write throughput, time-series data, and geographic multi-region replication — systems where the write path’s operational characteristics are the primary design constraint.
The Problem
The fundamental problem Cassandra solves is random write throughput. Traditional relational databases perform writes by updating rows in-place on disk pages, which requires random I/O to locate the correct page. At high write rates across large datasets, this random I/O pattern saturates disk throughput.
Cassandra converts all writes into sequential operations: every write appends to the commit log (sequential disk write) and updates an in-memory structure (Memtable). When the Memtable exceeds a threshold, it is flushed to disk as an immutable SSTable (Sequential String Table) file. The database never updates SSTables in place — mutations are always new writes. This makes the write path fast, but it defers the cost of merging and garbage-collecting old data to compaction.
The core question: which compaction strategy minimizes the operational cost of the deferred merge work for the workload’s specific access pattern?
The Write Path
flowchart TD
A[write request — partition key and columns] --> B[commit log — sequential append — fsync]
B --> C[Memtable — in-memory sorted structure]
C --> D{Memtable full or flush triggered?}
D -->|no — within threshold| E[write acknowledged to client]
D -->|yes — threshold exceeded| F[flush Memtable to SSTable on disk]
F --> G[new immutable SSTable file]
G --> H{compaction threshold reached?}
H -->|no| I[multiple SSTables accumulate]
H -->|yes| J[compaction — merge SSTables — discard tombstones]
J --> K[fewer larger SSTables]
Commit Log
Every write is first appended to the commit log — a sequential append-only file on disk. Cassandra uses the commit log for crash recovery: if the process dies before the Memtable is flushed, the commit log replays the unwritten data on restart. The commit log is the durability guarantee.
Cassandra’s commitlog_sync setting controls when the commit log is fsynced to disk:
periodic(default): writes are acknowledged after being written to the OS buffer; an fsync happens periodically (default 10,000ms). This is fast but risks losing up to 10 seconds of writes if the node crashes.batch: fsync happens before the write is acknowledged. Durable but slower — adds the fsync latency to every write.
Most high-throughput production deployments use periodic mode with the understanding that a crash can lose up to commitlog_sync_period_in_ms of data.
Memtable
After the commit log append, the write is applied to the Memtable — an in-memory sorted data structure partitioned by the partition key and ordered by clustering columns. Multiple concurrent writes accumulate in the Memtable until it is flushed. Reads that target recently written data are served from the Memtable without hitting disk.
The Memtable is bounded by memtable_heap_space_in_mb and memtable_offheap_space_in_mb. When the Memtable exceeds the threshold or when a flush is triggered manually, Cassandra writes it to disk as an immutable SSTable and starts a new Memtable.
SSTable and Compaction
SSTables are immutable files. An update to an existing row writes a new SSTable entry with a higher timestamp — the old value is not removed. A delete writes a tombstone — a marker indicating the row was deleted. Tombstones accumulate in SSTables until compaction.
Reads must check all SSTables for the most recent version of a row (plus the Memtable). As SSTable count grows, read latency increases because more files must be checked. Compaction merges SSTables, applies the recency rule (highest timestamp wins), removes tombstones beyond the gc_grace_seconds threshold, and produces fewer, larger SSTables. This reduces read amplification at the cost of write amplification (new SSTable files written during compaction).
In Practice
Cassandra’s documentation describes three compaction strategies, each with different tradeoffs (Apache Cassandra compaction):
Size-Tiered Compaction Strategy (STCS) — the default. Groups SSTables of similar sizes into tiers and merges within each tier when the count exceeds a threshold (default 4). Write amplification is low — fewer bytes are rewritten per compaction cycle. Read amplification is higher because many SSTables can accumulate before a tier triggers. STCS is appropriate for write-heavy workloads where read latency is less critical.
Leveled Compaction Strategy (LCS) — maintains SSTables in levels where each SSTable in a level covers a disjoint key range. A given partition key exists in exactly one SSTable per level (except Level 0). This keeps read amplification low — finding a row requires checking at most one SSTable per level — but write amplification is significantly higher because SSTables are rewritten frequently to maintain the level invariant. LCS is appropriate for read-heavy workloads where predictable read latency is required.
Time Window Compaction Strategy (TWCS) — groups SSTables by time window and compacts within each window. SSTables from old, expired windows are compacted into a single file and then not recompacted. This is optimal for time-series data where old data is rarely updated, because it avoids repeatedly rewriting old SSTables. Cassandra’s TWCS documentation is specific about a key requirement: time-to-live (TTL) must be set consistently on all data in a TWCS table, or tombstones from rows without TTL will never be fully compacted away (TWCS documentation).
Tombstone accumulation as an operational hazard. In Cassandra’s documented behavior, tombstones for deleted rows accumulate across SSTables until compaction runs and gc_grace_seconds elapses. If a partition accumulates a large number of tombstones before compaction (due to high delete rates, low compaction throughput, or misconfigured gc_grace_seconds), reads on that partition must scan through all tombstones before returning results. Cassandra’s coordinator logs a warning at 1,000 tombstones per read and throws a TombstoneOverwhelmingException at 100,000. High tombstone counts are the most common cause of unexpected read latency on write-optimized Cassandra tables.
Where It Breaks
| Scenario | What breaks | Why |
|---|---|---|
| STCS on read-heavy workload | Read latency grows as SSTable count increases between compaction cycles | STCS allows many same-size SSTables to accumulate; reads must check each one |
| LCS on write-heavy workload | Compaction I/O saturates disk throughput | High write amplification from maintaining level invariants requires continuous rewriting |
| TWCS with mixed TTL and non-TTL data | Tombstones never fully compacted in old windows | Non-TTL rows in old time windows prevent old SSTable retirement |
commitlog_sync: batch at high write rate | Write throughput drops significantly | Each write waits for an fsync; batching does not fully absorb the overhead at high concurrency |
| Large partition with many updates | Read latency spikes; repair timeouts | Large partitions accumulate many SSTable entries; repair must process the full partition |
gc_grace_seconds set to 0 | Deleted rows reappear after node repair | Tombstones are the mechanism for propagating deletes during hinted handoff; removing them before repair risks resurrection |
| Unbounded Memtable heap | JVM GC pauses | Memtable allocation competes with JVM heap for Cassandra processes; excessive heap causes long GC pauses |
What to Do Next
- Problem: Cassandra’s sequential write path makes writes fast, but the deferred compaction cost creates a continuous background I/O load that can saturate disk and cause read latency spikes if the compaction strategy does not match the workload.
- Solution: Select STCS for write-heavy append workloads, LCS for read-heavy workloads with updates and point lookups, and TWCS for time-series tables with consistent TTL — and verify tombstone accumulation rates on high-delete tables using
nodetool cfstats. - Proof: Run
nodetool compactionstatsto see pending compaction tasks and measure live disk I/O during compaction; if compaction cannot keep up with write rate (pending task count grows continuously), the strategy or write rate is mismatched. - Action: Identify your highest-volume Cassandra tables this week, confirm which compaction strategy each uses, and check
nodetool cfstatsfor tombstone count — any table with tombstones per read above 1,000 warrants immediate investigation.