B-tree vs LSM Tree: The Storage Engine Tradeoff
The storage engine is the most consequential architectural decision in a database, and the core tradeoff has not changed in fifty years: B-trees are fast to read; LSM trees are fast to write. Your workload determines which penalty you can afford.
Situation
Most engineers working with relational databases have never chosen a storage engine — PostgreSQL uses a B-tree heap by default, and the choice was made for them. Engineers working with Cassandra, RocksDB, or FoundationDB are using LSM trees, often without knowing why the database was designed that way.
The two structures dominate modern database storage: B-trees (balanced tree indexes used in PostgreSQL, MySQL InnoDB, Oracle) and LSM trees (log-structured merge trees used in Cassandra, LevelDB, RocksDB, and HBase). Each trades read performance for write performance in a different direction.
The Problem
Choosing or operating a database without understanding the storage engine’s read/write tradeoffs leads to predictable operational failures. A B-tree database under sustained high-write workloads shows write amplification and fragmentation. An LSM-tree database that is read-heavy shows read amplification as the engine scans multiple levels of sorted files. You cannot tune your way out of the wrong structural choice.
What is the actual tradeoff, and when does each structure win?
The Structures
B-trees store data in a balanced tree of fixed-size pages, typically 8KB in PostgreSQL. An UPDATE modifies the page in place after finding it via the tree. Reads are efficient: traverse from root to leaf, read the page. Writes require finding the right page, potentially splitting it (causing write amplification), and updating parent pointers. B-trees are random-write structures — every update touches disk in place.
LSM trees never update in place. Writes go to an in-memory buffer (memtable), which is periodically flushed to an immutable sorted file (SSTable) on disk. Reads must check the memtable and potentially multiple SSTable levels to find the current version. Background compaction merges SSTables, reclaiming space and reducing the number of levels to check. LSM trees are sequential-write structures — disk writes are always sequential appends.
B-tree read: O(log n) — traverse tree, read page
B-tree write: O(log n) — find page, modify in place (random I/O)
LSM write: O(1) amortized — append to memtable, flush sequentially
LSM read: O(L) — check L levels of SSTables for latest version
| Attribute | B-tree | LSM tree |
|---|---|---|
| Write path | Random in-place page modification | Sequential append to memtable → SSTable flush |
| Read path | Tree traversal, one disk read at leaf | Multi-level SSTable scan (read amplification) |
| Write throughput | Good for balanced workloads | Excellent; consistently low write latency |
| Read throughput | Excellent for point lookups and range scans | Moderate; degrades as SSTable level count grows |
| Space overhead | Fragmentation accumulates; autovacuum reclaims | Space amplification during compaction windows |
| Background work | Autovacuum, checkpoint, bgwriter | Compaction (CPU and I/O intensive at peak) |
| Best workload | OLTP: balanced reads/writes, point lookups, range scans | Write-heavy: IoT, time-series, event streams |
| Databases | PostgreSQL, MySQL InnoDB, Oracle, SQLite | Cassandra, RocksDB, HBase, FoundationDB |
In Practice
PostgreSQL’s documented design uses heap files with B-tree indexes. The B-tree is the correct structure for OLTP workloads with balanced reads and writes, point lookups, and range scans. PostgreSQL’s MVCC model (dead tuples in the heap) means writes also accumulate page fragmentation that autovacuum must reclaim — the cost of in-place updates.
Cassandra’s documented design uses an LSM tree (via SSTables). Cassandra is optimized for write-heavy workloads: time-series, IoT, event streams, and any pattern where writes vastly outnumber reads. The tradeoff is that reads are more expensive (scanning multiple SSTables), and compaction consumes I/O bandwidth during which read latency can increase.
Where It Breaks
| Workload | B-tree result | LSM result |
|---|---|---|
| High write throughput | Write amplification; page splits; fragmentation | Sequential append; consistent write latency |
| Point lookups (read-heavy) | Fast; single tree traversal | Slower; must check multiple SSTable levels |
| Range scans | Fast; sorted pages | Moderate; sorted within SSTables, merge across levels |
| Compaction pressure | Autovacuum reclaims dead tuples continuously | Background compaction spikes I/O; read latency degrades |
What to Do Next
- Problem: Operating a write-heavy workload on a B-tree engine or a read-heavy workload on an LSM engine produces predictable performance degradation that cannot be tuned away.
- Solution: Classify your workload by read/write ratio, access pattern (point vs range), and acceptable latency variance before selecting an engine.
- Proof: On a B-tree database, measure write amplification via
pg_stat_bgwriter; on an LSM database, measure read amplification via SSTable level counts in the engine’s metrics. - Action: Identify your top three most write-intensive tables today and measure their dead tuple ratio — that is the B-tree’s write tax showing up as storage overhead.