Service Catalog Incident Workflow: Find Owner, Blast Radius, Dependencies, and Last Change
The worst incident workflow starts with a human asking Slack who owns a service while the customer impact is still expanding.
Situation
Modern production systems are no longer single applications with a clear pager, a single deploy pipeline, and a short dependency list. A customer-facing request may cross an edge proxy, identity service, feature flag evaluator, API gateway, queue, worker, data store, cache, and third-party integration before it succeeds. Each component may be deployed by a different team, described in a different repository, and observed through a different dashboard.
Platform teams usually respond by building a service catalog. At first, it looks like a directory: name, description, owner, repository, runbook, dashboard, and pager. That is useful for discovery, but insufficient for incidents. During an outage, responders do not need a prettier wiki page. They need an operational join across four questions:
Who owns this service right now?
What is the blast radius?
What does it depend on, and what depends on it?
What changed last?
A catalog that cannot answer those questions during an incident is inventory, not control-plane infrastructure.
The Problem
The complication is that every required fact lives in a different system of record.
Ownership often lives in a catalog descriptor, team database, or on-call tool. Runtime presence lives in Kubernetes, service mesh telemetry, cloud tags, or deployment manifests. Dependency edges live partly in static metadata, partly in tracing, partly in gateway configuration, and partly in the heads of engineers. Last change lives in CI, CD, Git history, feature flag audit logs, infrastructure pipelines, and rollout controllers.
When responders stitch those systems manually, the workflow fails in predictable ways. The service name in the alert does not match the catalog entity. The owning team changed but the pager route did not. The dependency graph shows intended architecture but not production traffic. The last deployment was harmless, but a feature flag changed five minutes later. The Kubernetes workload has useful labels, but the incident tool never reads them. The result is slow triage and noisy escalation.
The core question is not whether a service catalog should exist. The question is whether the catalog can become the incident workflow’s first reliable read model.
Answer: Treat the Catalog as an Incident Join Graph
The service catalog should not own every fact. It should own identity and relationships, then join authoritative systems at incident time. The durable catalog entity becomes the anchor: service ID, owner, lifecycle, tier, repository, runbook, pager policy, declared dependencies, and expected runtime selectors. Around that anchor, the workflow queries live systems for current state.
flowchart TD
A[alert arrives — service signal] --> B[resolve catalog entity — owner and tier]
B --> C[fetch runtime inventory — clusters and regions]
B --> D[expand dependency graph — upstream and downstream]
B --> E[read deploy ledger — last successful change]
C --> F[compute blast radius — users and data paths]
D --> F
E --> G[attach change evidence — commit and rollout]
F --> H[incident brief — owner, radius, dependencies, change]
G --> H
H --> I[route escalation — owning team]
The first design decision is identity. Alerts, traces, logs, Kubernetes workloads, deploy jobs, and catalog records need a shared service key. Without that, the workflow becomes fuzzy matching under stress. The catalog can tolerate aliases, but it should converge on one stable entity reference.
The second decision is freshness. Ownership and repository links can be cached. Runtime inventory and last change should be queried live or from a recently updated index. Blast radius is time-sensitive: a service deployed in one region yesterday may be deployed globally today.
The third decision is confidence. Incident automation should distinguish declared facts from observed facts. A declared dependency says the service is designed to call billing. A trace edge says production traffic actually called billing in the last window. A deployment record says a rollout completed. A runtime label says which workload is running now. These facts should appear together, but not be treated as equivalent.
A useful incident brief is short and evidence-backed:
- Owner: team, current on-call policy, escalation path
- Service: catalog entity, tier, lifecycle, repository
- Runtime: clusters, regions, namespaces, workload names
- Blast radius: entry points, customer paths, data domains, active regions
- Dependencies: upstream callers and downstream services, marked declared or observed
- Last change: deploy, config, flag, schema, infrastructure, and rollback link
- Confidence: missing labels, stale metadata, unmatched alerts, unknown owners
The workflow should be callable from an alert, incident channel, CLI, or chat command. The interface matters less than the invariant: the first response packet is generated from the same graph every time.
In Practice
Context. The public Backstage Software Catalog pattern treats software components as catalog entities with ownership and metadata, rather than scattering that context across repositories and docs. Backstage’s own documentation describes the catalog as a centralized system for tracking ownership and metadata across services, websites, libraries, and other software assets: Backstage Software Catalog. Kubernetes also defines recommended application labels such as app.kubernetes.io/part-of, app.kubernetes.io/version, and app.kubernetes.io/managed-by, which provide a standard way to connect runtime objects back to application identity: Kubernetes well-known labels.
Action. The documented pattern is to let the catalog hold the stable entity model, then use runtime labels, deployment metadata, and observability signals as join inputs. In Kubernetes, selectors and labels are already how controllers group objects. In a catalog-driven incident workflow, the same principle is applied across systems: a service entity points to runtime selectors, the selectors find workloads, the workloads point to versions, and the versions point back to deployment records.
Result. The result is not magic root cause analysis. It is a deterministic triage packet. If an alert names checkout-api, the workflow resolves the catalog entity, finds the owning group, reads current workloads in production, expands known and observed dependencies, and attaches the most recent rollout or configuration change. That packet gives responders a narrower search space before they open dashboards.
Learning. Google’s public SRE writing emphasizes that emergency response improves when incident procedures and tooling are refined, tested, and communicated clearly: Google SRE Emergency Response. The service catalog contributes when it becomes part of that tested response path. A catalog page that humans may or may not open is documentation. A catalog-backed incident brief that appears on every page is operational infrastructure.
Where It Breaks
| Failure mode | Why it happens | Mitigation |
|---|---|---|
| Stale ownership | Teams rename, merge, or transfer services without updating metadata | Require ownership checks in repository and deploy workflows |
| Weak identity | Alert names, repository names, and workload labels drift apart | Enforce a stable service ID across catalog, telemetry, and deployment |
| Static dependency graph | Declared dependencies miss runtime behavior | Combine catalog declarations with traces, mesh telemetry, and gateway logs |
| Last change ambiguity | Deploys, flags, config, and schema changes live in separate tools | Build a change ledger keyed by service ID and time |
| Overconfident automation | The workflow treats missing data as proof of no impact | Show confidence and missing evidence explicitly |
| Catalog as bottleneck | Every tool waits on the catalog team to model new fields | Keep the core schema small and allow owned extensions |
| No incident feedback loop | Responders fix metadata locally but not at the source | Add post-incident catalog corrections as tracked remediation |
The most common failure is pretending the catalog is the source of truth for facts it only mirrors. Runtime state belongs to runtime systems. Deploy state belongs to delivery systems. Ownership may belong to an identity or team-management system. The catalog’s job is to provide the common identity graph and make the joins cheap.
The second common failure is optimizing for browsing instead of response. Search, tags, and polished profile pages help engineers discover services. Incidents need narrower behavior: resolve this signal, identify this owner, expand this graph, show this change, and expose uncertainty.
What to Do Next
- Problem: Incident responders lose time because ownership, blast radius, dependencies, and last change are split across tools. Make the service catalog responsible for joining those facts, not merely displaying them.
- Solution: Define a stable service ID, require it in catalog descriptors, runtime labels, telemetry, and deployment records, then generate an incident brief from that shared identity.
- Proof: Backstage demonstrates the catalog entity pattern for ownership and metadata, Kubernetes demonstrates label-based runtime grouping, and SRE practice emphasizes tested emergency workflows over ad hoc response.
- Action: Start with one critical service tier. Enforce service identity in CI, add runtime label checks in deployment, index the last successful rollout, and wire the incident tool to produce the owner, blast radius, dependency, and last-change packet automatically.