Search Index Drift Workflow: Rebuilds, Dual Writes, CDC, and User-Visible Staleness
Search drift is not a search problem first. It is a truth-management problem that becomes visible through search.
Situation
Most product systems keep their source of truth in a transactional database and serve discovery from a separate search index. The database is optimized for correctness, constraints, and writes. The index is optimized for ranking, tokenization, faceting, filtering, autocomplete, and latency.
That split is normal. PostgreSQL, MySQL, DynamoDB, Spanner, or another system owns the canonical record. Elasticsearch, OpenSearch, Solr, Vespa, Algolia, or a custom retrieval layer owns the read path for search. Between them sits a workflow that turns database mutations into index mutations.
The uncomfortable part is that the index is not merely a cache. Users treat search results as product truth. If a deleted document still appears, if a price update lags, if an access-control change is missing, or if a newly created object is absent, the failure is not described as “eventual consistency.” It is described as “the product is wrong.”
Search index drift is the gap between canonical state and searchable state. Some drift is expected. Unbounded drift is an incident.
The Problem
Teams usually discover drift after adopting one of three write patterns.
The first is application dual write: the request handler writes the database and then writes the search index. This looks simple until partial failure appears. The database commit succeeds, the index write times out, the retry creates stale ordering, or the process crashes between operations. If the two systems cannot share a transaction boundary, the application has accepted a consistency gap.
The second is asynchronous job indexing: writes enqueue work, and workers update the index later. This removes latency from the request path, but it creates a backlog system. Queue lag, poison messages, deploy bugs, and schema incompatibilities become search correctness risks.
The third is periodic rebuild: the team periodically scans the database and recreates the index. Rebuilds are useful, but they are not a complete freshness strategy. A nightly rebuild can repair silent corruption, but it cannot provide minute-level correctness unless the product accepts a full day of visible staleness.
The core question is not “which tool indexes fastest?” It is: how do we bound, observe, repair, and communicate the difference between source-of-truth state and search-visible state?
Core Concept
The practical architecture combines four ideas: change capture, idempotent indexing, rebuildable indexes, and user-visible freshness controls.
flowchart TD
A[primary database — canonical records] --> B[transaction log — ordered changes]
B --> C[change capture workers — durable cursor]
C --> D[index writer — idempotent updates]
D --> E[active search index — user queries]
A --> F[bulk rebuild job — full snapshot]
F --> G[shadow search index — validation target]
G --> H[index alias switch — controlled cutover]
C --> I[drift monitor — lag and mismatches]
I --> J[operator workflow — replay repair rebuild]
E --> K[user interface — freshness signals]
The database remains the only source of truth. Search documents carry source version metadata: record ID, updated timestamp, logical sequence number, schema version, and deletion marker. Index writes are idempotent, so replaying the same change is safe. Out-of-order writes are rejected when the incoming version is older than the indexed version.
Change data capture is the preferred steady-state path because it follows committed database changes rather than application intent. The application writes the database once. A CDC pipeline reads the transaction log and updates the index. This does not eliminate drift, but it moves drift into a measurable workflow: cursor lag, event age, failure rate, dead-letter volume, and version mismatch count.
Rebuilds remain mandatory. CDC preserves forward progress; rebuilds repair historical mistakes. A rebuild creates a shadow index from a consistent source snapshot, validates document counts and sampled records, warms query paths, then atomically moves an alias or routing pointer. The old index remains available for rollback until confidence is high.
Dual writes are still useful in narrow places. For example, a product may write directly to search for low-risk preview experiences while CDC provides authoritative correction. But dual writes should not be the only correctness mechanism for objects where permissions, money, inventory, or deletion semantics matter.
User-visible staleness must be designed deliberately. Some systems can show “results updated a few seconds ago.” Others need read-after-write behavior for the author of a change, even if global search is eventually consistent. That can be handled by merging canonical database reads for the user’s own recent writes, routing a specific object lookup to the database, or hiding search results whose indexed version is older than a known permission version.
In Practice
Context: Elasticsearch documents its _reindex API and alias-based index management as operational mechanisms for copying documents into a new index and switching traffic through aliases. The documented pattern is that index structure changes and large repairs are handled by creating a new index, filling it, and moving the read alias rather than mutating every serving assumption in place.
Action: Apply that pattern to search drift recovery. Treat every serving index as replaceable. Keep index mappings and analyzers versioned. Build a shadow index from the canonical store, compare counts and sampled documents, then switch the alias when validation passes.
Result: Rebuilds become a normal maintenance operation instead of a one-off incident script. The system can repair missed CDC events, analyzer mistakes, mapping errors, and accidental partial deletes without taking search offline.
Learning: Rebuildability is a correctness property. If the index cannot be recreated from truth, then the index has quietly become truth.
Context: Debezium’s documented architecture captures database changes from transaction logs and emits ordered change events to downstream consumers. PostgreSQL logical decoding and MySQL binlog replication expose the same architectural principle: committed database changes can be read after the fact without placing a second write inside the application request path.
Action: Use CDC as the default index mutation source. Persist consumer offsets. Make index writes idempotent. Store source versions in documents. Send failed records to a dead-letter workflow that can be replayed after the bug is fixed.
Result: The indexing path becomes observable as a pipeline rather than hidden inside application handlers. Operators can measure lag, pause consumers, replay records, and distinguish source write failures from projection failures.
Learning: CDC does not make search strongly consistent. It makes inconsistency bounded, inspectable, and repairable.
Context: Amazon DynamoDB Streams documents an ordered stream of item-level modifications that can trigger downstream processing. The documented pattern is not specific to search: one durable primary write can fan out to derived views.
Action: For key-value or document stores, use the database’s change stream as the trigger for index projection. Preserve deletion events, because missing tombstones are one of the most common sources of user-visible drift.
Result: The index can track creates, updates, and deletes from the same committed mutation source. Replays can reconstruct the projected state if the index writer is deterministic.
Learning: Deletes deserve first-class workflow design. A stale creation is annoying; a stale deletion can be a privacy, permission, or compliance failure.
Where It Breaks
| Failure mode | Why it happens | Mitigation |
|---|---|---|
| Out-of-order updates | Retries and parallel workers race | Store source versions and reject older writes |
| Missing deletes | Tombstones expire before indexing catches up | Retain delete events long enough for replay and rebuild |
| Rebuild cutover errors | Shadow index differs from serving assumptions | Use aliases, validation queries, and rollback windows |
| CDC backlog | Consumer deploy, poison event, or downstream throttling | Alert on event age, not only queue depth |
| Mapping drift | Application emits fields the index cannot parse | Version schemas and fail records into replayable quarantine |
| Permission staleness | Search document carries old access metadata | Version authorization data or verify sensitive results against truth |
| Silent corruption | Index accepts wrong but valid documents | Run sampled truth-versus-index audits continuously |
What to Do Next
-
Problem: Search drift becomes dangerous when nobody can say how stale the index is. Define freshness SLOs by product surface, not by infrastructure component.
-
Solution: Use CDC for steady-state propagation, idempotent writers for replay, shadow rebuilds for repair, and alias cutovers for controlled replacement.
-
Proof: Instrument source version, indexed version, CDC cursor lag, oldest unprocessed event age, dead-letter count, rebuild validation count, and sampled mismatch rate.
-
Action: Start with one high-value entity. Add version metadata to its search document, build a truth-versus-index audit, and write the runbook for replay, rebuild, and rollback before the next drift incident.