B-tree vs LSM Tree: The Storage Engine Tradeoff

The storage engine is the most consequential architectural decision in a database, and the core tradeoff has not changed in fifty years: B-trees are fast to read; LSM trees are fast to write. Your workload determines which penalty you can afford.

Situation

Most engineers working with relational databases have never chosen a storage engine — PostgreSQL uses a B-tree heap by default, and the choice was made for them. Engineers working with Cassandra, RocksDB, or FoundationDB are using LSM trees, often without knowing why the database was designed that way.

The two structures dominate modern database storage: B-trees (balanced tree indexes used in PostgreSQL, MySQL InnoDB, Oracle) and LSM trees (log-structured merge trees used in Cassandra, LevelDB, RocksDB, and HBase). Each trades read performance for write performance in a different direction.

The Problem

Choosing or operating a database without understanding the storage engine’s read/write tradeoffs leads to predictable operational failures. A B-tree database under sustained high-write workloads shows write amplification and fragmentation. An LSM-tree database that is read-heavy shows read amplification as the engine scans multiple levels of sorted files. You cannot tune your way out of the wrong structural choice.

What is the actual tradeoff, and when does each structure win?

The Structures

B-trees store data in a balanced tree of fixed-size pages, typically 8KB in PostgreSQL. An UPDATE modifies the page in place after finding it via the tree. Reads are efficient: traverse from root to leaf, read the page. Writes require finding the right page, potentially splitting it (causing write amplification), and updating parent pointers. B-trees are random-write structures — every update touches disk in place.

LSM trees never update in place. Writes go to an in-memory buffer (memtable), which is periodically flushed to an immutable sorted file (SSTable) on disk. Reads must check the memtable and potentially multiple SSTable levels to find the current version. Background compaction merges SSTables, reclaiming space and reducing the number of levels to check. LSM trees are sequential-write structures — disk writes are always sequential appends.

B-tree read:  O(log n) — traverse tree, read page
B-tree write: O(log n) — find page, modify in place (random I/O)

LSM write:    O(1) amortized — append to memtable, flush sequentially
LSM read:     O(L) — check L levels of SSTables for latest version

Attribute	B-tree	LSM tree
Write path	Random in-place page modification	Sequential append to memtable → SSTable flush
Read path	Tree traversal, one disk read at leaf	Multi-level SSTable scan (read amplification)
Write throughput	Good for balanced workloads	Excellent; consistently low write latency
Read throughput	Excellent for point lookups and range scans	Moderate; degrades as SSTable level count grows
Space overhead	Fragmentation accumulates; autovacuum reclaims	Space amplification during compaction windows
Background work	Autovacuum, checkpoint, bgwriter	Compaction (CPU and I/O intensive at peak)
Best workload	OLTP: balanced reads/writes, point lookups, range scans	Write-heavy: IoT, time-series, event streams
Databases	PostgreSQL, MySQL InnoDB, Oracle, SQLite	Cassandra, RocksDB, HBase, FoundationDB

In Practice

PostgreSQL’s documented design uses heap files with B-tree indexes. The B-tree is the correct structure for OLTP workloads with balanced reads and writes, point lookups, and range scans. PostgreSQL’s MVCC model (dead tuples in the heap) means writes also accumulate page fragmentation that autovacuum must reclaim — the cost of in-place updates.

Cassandra’s documented design uses an LSM tree (via SSTables). Cassandra is optimized for write-heavy workloads: time-series, IoT, event streams, and any pattern where writes vastly outnumber reads. The tradeoff is that reads are more expensive (scanning multiple SSTables), and compaction consumes I/O bandwidth during which read latency can increase.

Where It Breaks

Workload	B-tree result	LSM result
High write throughput	Write amplification; page splits; fragmentation	Sequential append; consistent write latency
Point lookups (read-heavy)	Fast; single tree traversal	Slower; must check multiple SSTable levels
Range scans	Fast; sorted pages	Moderate; sorted within SSTables, merge across levels
Compaction pressure	Autovacuum reclaims dead tuples continuously	Background compaction spikes I/O; read latency degrades

What to Do Next

Problem: Operating a write-heavy workload on a B-tree engine or a read-heavy workload on an LSM engine produces predictable performance degradation that cannot be tuned away.
Solution: Classify your workload by read/write ratio, access pattern (point vs range), and acceptable latency variance before selecting an engine.
Proof: On a B-tree database, measure write amplification via pg_stat_bgwriter; on an LSM database, measure read amplification via SSTable level counts in the engine’s metrics.
Action: Identify your top three most write-intensive tables today and measure their dead tuple ratio — that is the B-tree’s write tax showing up as storage overhead.

Situation

The Problem

The Structures

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

CAP Theorem in Operational Terms

Caches, Queues, and Databases: When to Use Each

Consistency Models Your Application Actually Needs