Most service catalogs fail because they are treated as databases to be filled in, not operational systems that must earn trust every day.

Situation

Platform teams keep reaching for service catalogs because the failure mode is visible everywhere: nobody knows who owns a service, which repository deploys it, whether it is production critical, what runbook applies, or whether the dashboard linked from the wiki is still valid.

The promise is reasonable. A catalog should answer basic operational questions:

  • Who owns this service?
  • Where is the code?
  • How does it deploy?
  • What does it depend on?
  • What is the support path during an incident?
  • Is it production ready?

That promise becomes more attractive as organizations adopt internal developer platforms, CI automation, Kubernetes, incident management, policy checks, and golden paths. Once every team has dozens of services, infrastructure modules, queues, topics, dashboards, feature flags, and jobs, tribal memory stops scaling.

So the platform team creates a service catalog. They import repositories. They ask teams to add metadata. They connect ownership, lifecycle, tier, links, documentation, and dependencies. The first demo looks useful. The homepage has cards. Search works. Leadership sees a map of the estate.

Then the catalog starts to decay.

The Problem

The hard part is not building a catalog. The hard part is making teams believe it.

A service catalog has four common failure modes.

First, adoption is optional in practice even when required in policy. Teams will fill in metadata once if it unblocks a migration, audit, or launch review. They will not keep it current unless the catalog participates in workflows they already care about.

Second, trust collapses faster than coverage improves. One stale owner, one broken dashboard link, or one dependency graph that disagrees with production is enough to teach engineers that the catalog is decorative. After that, they return to Slack, source search, deployment logs, and incident history.

Third, freshness is usually assigned to humans instead of systems. Platform teams ask service owners to maintain YAML, forms, or portal fields. That works for intentional facts such as ownership intent or service tier. It fails for observed facts such as deploy frequency, runtime dependencies, last production change, error budget burn, or alert coverage.

Fourth, incentives are often backwards. Platform teams are measured on catalog completeness. Service teams are measured on shipping and reliability. If the catalog creates work but does not remove work, the rational service team treats it as tax.

The question is not, “How do we get every team to fill out the service catalog?”

The better question is, “Which operational workflows should fail, warn, or improve based on catalog metadata, and which facts can be refreshed automatically?”

The Catalog as a Control Plane

A durable service catalog behaves less like an inventory spreadsheet and more like a control plane for engineering workflows.

It should have three layers of truth.

The first layer is declared truth: ownership, lifecycle, criticality, data classification, on-call path, and intended dependencies. These are human decisions and should live close to the service, usually in versioned configuration.

The second layer is observed truth: repositories, deployments, container images, runtime namespaces, cloud resources, dashboards, alerts, incidents, and dependency traces. These should be discovered from source systems rather than typed into a portal.

The third layer is enforced truth: policies and workflows that use catalog metadata to make engineering easier or safer. Examples include routing alerts to the declared owner, opening production readiness checks when a service declares a higher tier, generating scorecards from CI evidence, and blocking releases only when the failed check is objective and current.

flowchart TD
  A[service repository — declared metadata] --> B[catalog ingestion — validation]
  C[ci pipeline — build and deploy evidence] --> D[observed facts — recent activity]
  E[runtime platform — namespaces and workloads] --> D
  F[incident system — alerts and ownership] --> D
  B --> G[catalog graph — declared and observed truth]
  D --> G
  G --> H[developer portal — search and ownership]
  G --> I[automation workflows — routing and checks]
  G --> J[scorecards — freshness and readiness]
  I -->|creates pull request| A
  J -->|signals drift| A

The design principle is simple: humans should declare intent, systems should refresh evidence, and automation should close the loop when the two diverge.

A catalog entry that says a service is “tier one” should not require a human to also remember every tier one requirement. The declaration should trigger checks for on-call coverage, runbook links, alert policy, rollback documentation, SLOs, and production dependency review.

A catalog entry that says a team owns a service should not be trusted forever. If the repository moved, the last ten deploys came from another team, and the on-call schedule no longer exists, the catalog should show drift.

In Practice

Context: Spotify’s Backstage publicly popularized the internal developer portal pattern and includes a software catalog model for components, systems, APIs, resources, and owners. The documented pattern is not merely “store service metadata.” It is “centralize discoverability while integrating with the tools engineers already use.” See Spotify’s public Backstage materials and the Backstage software catalog documentation.

Action: The useful architectural move is to keep catalog metadata near the producer. Backstage commonly uses catalog-info.yaml files in repositories, then ingests those descriptors into the catalog. That makes review, ownership, and change history part of the normal engineering workflow instead of a separate portal update.

Result: The catalog becomes easier to audit because declared metadata has provenance. A change to ownership or lifecycle can be reviewed like code. The result is not automatic truth, but it is a stronger source of declared intent than a mutable web form with no review path.

Learning: Declared metadata should be versioned, reviewable, and owned by the team that owns the service. But declared metadata alone is not enough. A catalog that only mirrors YAML will still rot when production behavior changes outside the file.

Context: Kubernetes controllers are a well-known architectural pattern for keeping actual state aligned with desired state. The Kubernetes documentation describes controllers as loops that watch cluster state and make changes to move current state toward desired state.

Action: Apply the same pattern to service catalogs. Treat missing metadata, broken links, orphaned resources, and owner drift as reconciliation problems. Instead of asking platform engineers to chase teams manually, generate pull requests, warnings, or scorecard deltas from observed facts.

Result: Freshness becomes a system property. The catalog can say, “This service declares Team A, but the current deployment namespace is administered by Team B,” or “This runbook link has failed validation for fourteen days.” That is more useful than a stale green check.

Learning: Catalog quality improves when drift is detected continuously and correction is routed to the people who can fix it.

Context: Google’s public SRE writing emphasizes that reliability practices must be operationalized through measurable signals, automation, and clear ownership rather than wishful process. Production readiness is valuable only when it changes behavior before failure.

Action: Connect catalog fields to readiness workflows. If a service declares production criticality, require objective evidence: alert routing, rollback path, dashboard availability, SLO ownership, dependency visibility, and incident escalation. Use CI and platform integrations to collect the evidence where possible.

Result: The catalog stops being a phonebook and becomes a reliability interface. Engineers use it because it answers questions during deploys, reviews, and incidents.

Learning: Adoption follows usefulness. If the catalog saves time during real operational work, teams will maintain it. If it exists mainly for platform reporting, teams will route around it.

Where It Breaks

Failure modeWhy it happensBetter design
Low adoptionTeams see metadata as platform paperworkTie catalog entries to deploys, ownership routing, readiness checks, and generated docs
Stale ownershipReorganizations happen faster than cleanupValidate owners against identity systems, on-call schedules, and repository activity
Broken trustEngineers find stale links during incidentsShow freshness timestamps, source provenance, and validation status
Manual dependency mapsRuntime relationships change continuouslyDerive observed dependencies from traces, traffic, infrastructure, and deployment data
Overzealous gatesPlatform team blocks delivery with weak checksGate only on objective, high-confidence evidence and provide automated repair paths
Catalog as reporting layerLeadership wants completeness dashboardsMeasure operational usefulness: routed alerts, fixed drift, successful lookups, readiness deltas

The most dangerous version is the beautiful portal that nobody trusts. It creates the illusion of control while incidents still depend on whoever remembers the old system.

What to Do Next

  • Problem: Your catalog probably mixes declared intent, observed production facts, and aspirational policy in the same fields. Separate them. Make it obvious which system produced each fact and when it was last verified.

  • Solution: Store human-owned declarations in versioned files near the service. Ingest observed facts from CI, runtime platforms, incident systems, source control, and telemetry. Use reconciliation workflows to highlight drift.

  • Proof: Start with three operational questions: who owns this service, what changed last, and where does an incident go? If the catalog cannot answer those during a live incident, do not expand the taxonomy yet.

  • Action: Pick one workflow where catalog correctness matters this quarter. Alert routing, production readiness, service ownership review, or deployment scorecards are good candidates. Make the catalog useful there before asking every team to maintain twenty more fields.