A service catalog that only helps engineers find links is a directory. A service catalog that owns metadata, policy, workflow, and reconciliation is a platform control plane.

Situation

Platform engineering has been pulled into the same failure pattern that hurt earlier DevOps programs: every team wants autonomy, but the organization still needs predictable ownership, deployment safety, compliance evidence, and incident response. The first answer is usually a developer portal. It collects service pages, runbooks, dashboards, API docs, and deployment links behind one searchable interface.

That is useful. It is also insufficient.

The hard part of platform engineering is not discovery. The hard part is keeping thousands of services, pipelines, cloud resources, SLOs, identities, and ownership records aligned while teams continue to move independently. When the catalog is treated as a web UI, the platform becomes an index of stale facts. When it is treated as a control plane, it becomes the place where desired service state is declared, validated, and reconciled.

The Problem

Most catalogs start as convenience layers. A service page shows the owner, repository, deployment status, pager rotation, dependencies, dashboards, and recent incidents. The data is assembled from source control, CI, observability, incident management, and cloud APIs.

The complication is that none of those systems agree by default. Git knows the declared owner. The alerting system knows the current responder. The cluster knows what is actually running. The CI system knows the last artifact. The cloud account knows the runtime permissions. The compliance system knows the required controls. The developer portal knows whatever was imported last.

At small scale, humans correct the gaps. At platform scale, humans become the synchronization mechanism. That is where the portal model breaks.

The operational question is not, “Where can an engineer find the service page?” The real question is: what system decides whether a service is allowed to exist, change, deploy, drift, or page the wrong team?

Core Concept

A real service catalog should model services as managed resources. Each catalog entity needs a desired state, an observed state, policy checks, workflow bindings, and ownership semantics. The UI is only one client of that model. Much like how a Kubernetes controller continuously monitors the API server to reconcile desired pod counts with actual running pods, a catalog control plane continuously evaluates service intent against infrastructure reality.

flowchart TD
    A[service catalog — desired service state] --> B[policy engine — validation]
    A --> C[workflow broker — orchestration]
    B --> D[identity and ownership — authorization]
    B -->|allows change| C
    C --> E[deployment systems — rollout]
    C --> F[cloud APIs — provisioning]
    E --> G[observability — health and SLOs]
    F --> G
    G --> H[drift detector — observed state]
    H -->|reports drift| A

The catalog should answer four control-plane questions.

First, what is the desired state of this service? This requires a strict entity schema defining the owner, lifecycle, tier, runtime, deployment targets, dependency declarations, data classification, and SLOs. A database record is not enough; this state must be version-controlled, auditable, and exposed via an API.

Second, who is authorized to change that state? Ownership is not a label for display. It is an authorization boundary enforced by policy engines like Open Policy Agent. It defines who can merge infrastructure changes, approve production access, or grant compliance exceptions.

Third, what controllers act on that state? The catalog does not execute jobs directly; it acts as an intent broker. A catalog entry should trigger repository scaffolding via CI automation, provision Kubernetes namespaces via GitOps operators, attach IAM secrets policies, and register monitoring endpoints. The catalog binds service intent to downstream automation systems.

Fourth, how is drift detected? If a production workload runs without a matching catalog entity, or if a service tier lacks an SLO definition, a reconciliation loop must detect the mismatch. The platform should emit a drift signal, block deployments, or automatically open a remediation pull request, driving the system back to the declared state.

This is the mental shift: service catalogs are not knowledge bases. They are typed inventories with reconciliation loops.

In Practice

Context: Backstage documents its Software Catalog as a centralized system for tracking ownership and metadata across software components, websites, libraries, and data pipelines. The documented pattern is not merely a set of bookmarks; it is a structured entity model with owners, systems, domains, APIs, and lifecycle metadata. See the Backstage Software Catalog documentation.

Action: Treat catalog descriptors as source-controlled service declarations. Require every production service to define ownership, lifecycle, system membership, dependency relationships, and operational links in a machine-readable format. Validate those descriptors in CI before they are admitted into the catalog.

Result: The catalog becomes a reliable input to other workflows. Search is still useful, but the stronger result is that automation can ask consistent questions: who owns this service, what system does it belong to, what APIs does it expose, and what operational maturity is expected?

Learning: The catalog only becomes authoritative when teams stop treating metadata as documentation and start treating it as deployable configuration.

Context: Kubernetes describes controllers as control loops that watch cluster state and make changes to move observed state toward desired state. That pattern is the core operating model of modern infrastructure, not an implementation detail of Kubernetes alone. See the Kubernetes controller documentation.

Action: Apply the controller pattern to the service catalog. If the catalog says a tier-one service must have an SLO, an on-call rotation, deployment provenance, and rollback automation, then controllers should verify those facts continuously. Missing data should produce a platform signal, not a quarterly spreadsheet exercise.

Result: Compliance and reliability checks move from manual review to continuous reconciliation. The organization can still allow exceptions, but exceptions become explicit state with owners and expiry dates.

Learning: A catalog without reconciliation is an asset database. A catalog with reconciliation is a control plane.

Context: Argo CD documents automated sync as a mechanism that detects differences between desired manifests in Git and live cluster state, then syncs the application when configured to do so. See the Argo CD automated sync documentation.

Action: Use the same desired-state contract for platform workflows. The catalog should not blindly launch jobs from buttons. It should declare intent, route the intent through policy, produce auditable changes, and let downstream systems converge. For deployment, GitOps tools can own cluster reconciliation. For service creation, repository and CI controllers can own scaffolding. For observability, monitoring controllers can own dashboard and alert registration.

Result: The platform has a chain of custody. A service change moves from catalog intent to policy decision to workflow execution to observed state. That makes failures diagnosable. If deployment succeeded but monitoring registration failed, the catalog can show the specific reconciliation gap.

Learning: The button is not the workflow. The workflow is the declared state transition plus the controllers that make it true.

Context: Google SRE guidance frames SLOs as a reliability contract based on user-visible service behavior. See Google’s Service Level Objectives chapter.

Action: Attach SLO expectations to catalog entities by tier and user journey. Do not bury reliability requirements in runbooks. Make them part of the service model that deployment, incident, and observability systems can consume.

Result: Service criticality becomes operationally meaningful. A tier-one service can require stricter rollout policy, stronger alerting, and more complete ownership before production promotion.

Learning: Reliability metadata is only useful when it changes automation behavior.

Where It Breaks

Failure modeWhy it happensControl-plane response
Stale ownershipTeams reorganize faster than catalogs updateSync ownership from identity systems and require valid owners in CI
Button-driven automationPortal actions bypass policy and state reviewConvert actions into declared state changes with approval and audit
Catalog sprawlEvery tool adds fields without a modelDefine a small entity schema and version it deliberately
False authorityThe catalog shows data it does not control or verifyMark source, freshness, and reconciliation status per field
Workflow couplingThe catalog becomes a hard dependency for every deployKeep execution in downstream systems and use the catalog as intent and policy
Exception debtTemporary waivers become permanentStore exceptions as expiring entities with owners
UI-first designTeams optimize pages instead of platform contractsDesign API, schema, and controllers before polishing portal views

What to Do Next

Problem: Your service catalog probably knows many things about production, but it may not decide or reconcile anything. That makes it useful during discovery and weak during change.

Solution: Promote catalog entities into desired-state resources. Give them schemas, owners, lifecycle states, policy requirements, workflow bindings, and observed-state checks.

Proof: Backstage shows the value of structured software metadata, Kubernetes shows the durability of controller reconciliation, Argo CD shows how desired state can drive delivery, and SRE practice shows why reliability metadata must affect operational behavior.

Action: Pick one workflow and make the catalog authoritative for it. Service creation is the cleanest starting point: require a catalog descriptor, validate ownership and tier, create the repository and CI pipeline from that state, register observability, and continuously detect drift. Once that loop works, extend the pattern to deployment readiness, production access, SLO coverage, and incident ownership.