Search Index Drift Workflow: Rebuilds, Dual Writes, CDC, and User-Visible Staleness

Search drift is not a search problem first. It is a truth-management problem that becomes visible through search.

Situation

Most product systems keep their source of truth in a transactional database and serve discovery from a separate search index. The database is optimized for correctness, constraints, and writes. The index is optimized for ranking, tokenization, faceting, filtering, autocomplete, and latency.

That split is normal. PostgreSQL, MySQL, DynamoDB, Spanner, or another system owns the canonical record. Elasticsearch, OpenSearch, Solr, Vespa, Algolia, or a custom retrieval layer owns the read path for search. Between them sits a workflow that turns database mutations into index mutations.

The uncomfortable part is that the index is not merely a cache. Users treat search results as product truth. If a deleted document still appears, if a price update lags, if an access-control change is missing, or if a newly created object is absent, the failure is not described as “eventual consistency.” It is described as “the product is wrong.”

Search index drift is the gap between canonical state and searchable state. Some drift is expected. Unbounded drift is an incident.

The Problem

Teams usually discover drift after adopting one of three write patterns.

The first is application dual write: the request handler writes the database and then writes the search index. This looks simple until partial failure appears. The database commit succeeds, the index write times out, the retry creates stale ordering, or the process crashes between operations. If the two systems cannot share a transaction boundary, the application has accepted a consistency gap.

The second is asynchronous job indexing: writes enqueue work, and workers update the index later. This removes latency from the request path, but it creates a backlog system. Queue lag, poison messages, deploy bugs, and schema incompatibilities become search correctness risks.

The third is periodic rebuild: the team periodically scans the database and recreates the index. Rebuilds are useful, but they are not a complete freshness strategy. A nightly rebuild can repair silent corruption, but it cannot provide minute-level correctness unless the product accepts a full day of visible staleness.

The core question is not “which tool indexes fastest?” It is: how do we bound, observe, repair, and communicate the difference between source-of-truth state and search-visible state?

Core Concept

The practical architecture combines four ideas: change capture, idempotent indexing, rebuildable indexes, and user-visible freshness controls.

flowchart TD
  A[primary database — canonical records] --> B[transaction log — ordered changes]
  B --> C[change capture workers — durable cursor]
  C --> D[index writer — idempotent updates]
  D --> E[active search index — user queries]
  A --> F[bulk rebuild job — full snapshot]
  F --> G[shadow search index — validation target]
  G --> H[index alias switch — controlled cutover]
  C --> I[drift monitor — lag and mismatches]
  I --> J[operator workflow — replay repair rebuild]
  E --> K[user interface — freshness signals]

The database remains the only source of truth. Search documents carry source version metadata: record ID, updated timestamp, logical sequence number, schema version, and deletion marker. Index writes are idempotent, so replaying the same change is safe. Out-of-order writes are rejected when the incoming version is older than the indexed version.

Change data capture is the preferred steady-state path because it follows committed database changes rather than application intent. The application writes the database once. A CDC pipeline reads the transaction log and updates the index. This does not eliminate drift, but it moves drift into a measurable workflow: cursor lag, event age, failure rate, dead-letter volume, and version mismatch count.

Rebuilds remain mandatory. CDC preserves forward progress; rebuilds repair historical mistakes. A rebuild creates a shadow index from a consistent source snapshot, validates document counts and sampled records, warms query paths, then atomically moves an alias or routing pointer. The old index remains available for rollback until confidence is high.

Dual writes are still useful in narrow places. For example, a product may write directly to search for low-risk preview experiences while CDC provides authoritative correction. But dual writes should not be the only correctness mechanism for objects where permissions, money, inventory, or deletion semantics matter.

User-visible staleness must be designed deliberately. Some systems can show “results updated a few seconds ago.” Others need read-after-write behavior for the author of a change, even if global search is eventually consistent. That can be handled by merging canonical database reads for the user’s own recent writes, routing a specific object lookup to the database, or hiding search results whose indexed version is older than a known permission version.

In Practice

Context: Elasticsearch documents its _reindex API and alias-based index management as operational mechanisms for copying documents into a new index and switching traffic through aliases. The documented pattern is that index structure changes and large repairs are handled by creating a new index, filling it, and moving the read alias rather than mutating every serving assumption in place.

Action: Apply that pattern to search drift recovery. Treat every serving index as replaceable. Keep index mappings and analyzers versioned. Build a shadow index from the canonical store, compare counts and sampled documents, then switch the alias when validation passes.

Result: Rebuilds become a normal maintenance operation instead of a one-off incident script. The system can repair missed CDC events, analyzer mistakes, mapping errors, and accidental partial deletes without taking search offline.

Learning: Rebuildability is a correctness property. If the index cannot be recreated from truth, then the index has quietly become truth.

Context: Debezium’s documented architecture captures database changes from transaction logs and emits ordered change events to downstream consumers. PostgreSQL logical decoding and MySQL binlog replication expose the same architectural principle: committed database changes can be read after the fact without placing a second write inside the application request path.

Action: Use CDC as the default index mutation source. Persist consumer offsets. Make index writes idempotent. Store source versions in documents. Send failed records to a dead-letter workflow that can be replayed after the bug is fixed.

Result: The indexing path becomes observable as a pipeline rather than hidden inside application handlers. Operators can measure lag, pause consumers, replay records, and distinguish source write failures from projection failures.

Learning: CDC does not make search strongly consistent. It makes inconsistency bounded, inspectable, and repairable.

Context: Amazon DynamoDB Streams documents an ordered stream of item-level modifications that can trigger downstream processing. The documented pattern is not specific to search: one durable primary write can fan out to derived views.

Action: For key-value or document stores, use the database’s change stream as the trigger for index projection. Preserve deletion events, because missing tombstones are one of the most common sources of user-visible drift.

Result: The index can track creates, updates, and deletes from the same committed mutation source. Replays can reconstruct the projected state if the index writer is deterministic.

Learning: Deletes deserve first-class workflow design. A stale creation is annoying; a stale deletion can be a privacy, permission, or compliance failure.

Where It Breaks

Failure mode	Why it happens	Mitigation
Out-of-order updates	Retries and parallel workers race	Store source versions and reject older writes
Missing deletes	Tombstones expire before indexing catches up	Retain delete events long enough for replay and rebuild
Rebuild cutover errors	Shadow index differs from serving assumptions	Use aliases, validation queries, and rollback windows
CDC backlog	Consumer deploy, poison event, or downstream throttling	Alert on event age, not only queue depth
Mapping drift	Application emits fields the index cannot parse	Version schemas and fail records into replayable quarantine
Permission staleness	Search document carries old access metadata	Version authorization data or verify sensitive results against truth
Silent corruption	Index accepts wrong but valid documents	Run sampled truth-versus-index audits continuously

What to Do Next

Problem: Search drift becomes dangerous when nobody can say how stale the index is. Define freshness SLOs by product surface, not by infrastructure component.
Solution: Use CDC for steady-state propagation, idempotent writers for replay, shadow rebuilds for repair, and alias cutovers for controlled replacement.
Proof: Instrument source version, indexed version, CDC cursor lag, oldest unprocessed event age, dead-letter count, rebuild validation count, and sampled mismatch rate.
Action: Start with one high-value entity. Add version metadata to its search document, build a truth-versus-index audit, and write the runbook for replay, rebuild, and rollback before the next drift incident.

Situation

The Problem

Core Concept

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

Per-App Postgres on Kubernetes Changes the Failure Boundary

Azure Database for PostgreSQL: Flexible Server vs Hyperscale (Citus) Architecture Decision

GCP AlloyDB vs Cloud SQL for PostgreSQL: When to Upgrade