Internal Developer Platform Reference Architecture: Catalog, IaC, CI/CD, Policy, and Observability
An internal developer platform fails when it becomes a portal in front of the same old manual delivery system. The useful platform is not a website, a template repository, or a Kubernetes wrapper. It is a control plane for software ownership, infrastructure intent, delivery evidence, policy decisions, and operational feedback.
Situation
Most engineering organizations reach for platform engineering after the same pattern repeats across teams. Application teams can ship code, but every production change requires a scattered sequence of tickets, tribal knowledge, Slack approvals, copied Terraform, fragile pipeline YAML, and post-release dashboard archaeology.
The result is not just slowness. It is inconsistent risk. One team gets a hardened deployment path with rollback, ownership metadata, and useful telemetry. Another team deploys through a hand-edited workflow with unclear runtime dependencies and no obvious service owner. Both are “using the platform,” but only one is operating inside a reliable delivery system.
The internal developer platform changes the unit of abstraction. Instead of exposing every infrastructure primitive directly, it exposes a productized path from service creation to production operation. The platform owns the boring and dangerous glue: catalog registration, infrastructure provisioning, delivery workflows, policy enforcement, secrets boundaries, observability defaults, and lifecycle metadata.
The Problem
The common failure mode is building the platform as a collection of disconnected tools.
A service catalog knows who owns a service, but the CI system does not use that metadata. Terraform provisions infrastructure, but policy runs later during a security review. CI produces artifacts, but deployment has no proof of the source commit, test run, or approval path. Observability exists, but dashboards are not created until after an incident. The developer portal looks coherent while the delivery path remains stitched together by convention.
This creates five operational problems.
First, ownership is advisory instead of executable. If ownership metadata does not drive routing, approvals, scorecards, and incident escalation, it decays.
Second, infrastructure intent is separated from application lifecycle. Teams can create cloud resources without making those resources visible in the catalog, measurable in cost reports, or connected to service health.
Third, CI/CD becomes a permission bypass. Pipelines accumulate special cases until deployment safety depends on who copied which YAML file two years ago.
Fourth, policy arrives too late. A platform that finds encryption, network, image provenance, or runtime issues after merge has already converted engineering feedback into organizational friction.
Fifth, observability is treated as inspection rather than contract. Dashboards and alerts created by hand are symptoms of an architecture that did not define production readiness at service creation time.
The core question is: how should an internal developer platform connect catalog, IaC, CI/CD, policy, and observability so the golden path is both easier and safer than the manual path?
Core Concept
The answer is a platform control plane with the catalog as the system of record and automation as the enforcement mechanism.
flowchart TD
A[developer request — service change] --> B[service catalog — ownership and scorecards]
B --> C[golden paths — templates and paved workflows]
C --> D[repository — app code and platform contract]
D --> E[CI pipeline — build test attest]
E --> F[IaC plan — environment intent]
F --> G[policy checks — risk and compliance gates]
G --> H[CD controller — progressive delivery]
H --> I[runtime platform — Kubernetes and managed services]
I --> J[observability — traces metrics logs]
J --> B
I --> K[incident workflow — SLO and ownership]
K --> B
The catalog is not a wiki. It is the platform inventory and ownership API. Each service entry should carry owner, lifecycle, tier, runtime, repository, deployment targets, dependencies, runbooks, dashboards, SLOs, and compliance classification. Backstage popularized this model with a software catalog and templates that connect ownership metadata to developer workflows.
The golden path starts with templates, but templates are only the first transaction. A good service template creates the repository, catalog descriptor, CI workflow, IaC module binding, deployment configuration, observability baseline, and operational documentation stub. A better template also creates the first pull request, forcing all generated platform contracts to pass normal review.
IaC is the environment contract. It should express what the service needs, not every low-level resource choice. Platform teams should publish opinionated modules for common patterns: HTTP service, event consumer, scheduled job, private data store, object storage bucket, queue, and cache. The module interface is where the platform encodes defaults for encryption, network placement, backup policy, tagging, and cost attribution.
CI is the evidence factory. It should produce build artifacts, test results, vulnerability scans, software bills of materials where required, provenance attestations, and policy evaluation output. CI should not be the only place where policy lives, but it is the earliest useful place to give developers fast feedback.
CD is the release controller. It should consume evidence from CI, environment intent from IaC, and policy decisions from the platform. Progressive delivery, automatic rollback, deployment windows, and approval rules belong here because they depend on runtime context. A deployment to a low-tier internal service and a deployment to a customer-facing payment path should not have the same gates.
Policy should be centralized in authorship and distributed in execution. The same rule should be runnable during local validation, pull request checks, IaC planning, admission control, and runtime audit. Kubernetes dynamic admission control and policy engines such as Open Policy Agent Gatekeeper demonstrate the pattern: reject unsafe changes before they become live state, then continuously detect drift.
Observability closes the loop. The platform should create default telemetry wiring, service dashboards, alert routes, SLO templates, and dependency views at service birth. Google SRE’s SLO framing is useful here: reliability targets are not decorative metrics; they are decision inputs for release speed, paging, and error budget policy.
In Practice
Context: Spotify’s Backstage documentation describes a software catalog model where components, ownership, documentation, and templates are part of the developer portal system. The documented pattern is that catalog-info.yaml entity descriptors become a shared interface for discovering and operating software, not merely a manually maintained service list.
Action: Use catalog descriptors as code. Require every service to declare ownership, lifecycle, repository, runtime type, and operational links in version control. Generate the descriptor during service creation, then validate it in CI and expose it through the portal.
Result: The platform gains a stable join key between repositories, deployments, dashboards, incidents, and scorecards. This result follows from the catalog pattern itself: once components have durable identities, other systems can attach delivery and operations data to those identities.
Learning: Treat catalog quality as production hygiene. Metadata that does not drive automation will rot; metadata that gates deployment, routes alerts, and powers scorecards tends to stay accurate.
Context: Kubernetes admission control documents the mechanism for intercepting API requests before objects are persisted via ValidatingWebhookConfiguration. OPA Gatekeeper applies policy-as-code to that admission path for Kubernetes resources by evaluating Rego policies against incoming requests.
Action: Run policy in multiple places with the same intent: fast checks in pull requests via CI hooks, plan checks for IaC terraform plans, admission checks at the cluster boundary, and audit checks against live state.
Result: Policy moves from late review to continuous feedback. The documented Kubernetes pattern supports pre-persistence enforcement, while audit mode covers objects that already exist or were created before a rule became mandatory.
Learning: Do not make CI the only enforcement point. CI can be bypassed, misconfigured, or skipped for emergency paths. Runtime admission and audit give the platform a second line of defense.
Context: Google’s SRE material defines SLOs as explicit reliability objectives derived from user expectations and system behavior. A properly defined SLO leverages a Service Level Indicator (SLI) to measure true system availability over a rolling window.
Action: Make observability part of the service template. Generate dashboards, alert routes, SLO placeholders, and runbook links when the service is created. Require higher-tier services to define SLIs before production promotion.
Result: Production readiness becomes reviewable before launch. The platform can compare service tier, alerting, SLO presence, and deployment policy as part of a scorecard.
Learning: Observability is a platform contract. If a team must discover its telemetry model during an incident, the platform delivered infrastructure but not operability.
Where It Breaks
| Failure mode | Why it happens | Mitigation |
|---|---|---|
| Portal without enforcement | The catalog is disconnected from CI, CD, and runtime | Make catalog identity required for deployment |
| Template sprawl | Every team forks the golden path | Version templates and publish migration paths |
| Policy backlash | Rules block delivery without useful feedback | Run rules in warn mode before enforce mode |
| IaC abstraction leakage | Modules hide too much or expose cloud internals | Provide opinionated modules with escape hatches |
| CI/CD exception paths | Urgent releases bypass platform controls | Define break-glass workflows with audit trails |
| Dashboard drift | Observability is created manually | Generate telemetry assets from service metadata |
| Scorecard theater | Metrics measure compliance but not risk | Tie scorecards to operational outcomes and tiers |
What to Do Next
-
Problem: Your platform likely has the right tools but weak connective tissue. Catalog, IaC, CI/CD, policy, and observability are useful only when they share service identity and lifecycle state.
-
Solution: Put the catalog at the center, make golden paths generate complete production contracts, and run policy at pull request, plan, admission, and audit time.
-
Proof: Use documented patterns from Backstage-style catalogs, Kubernetes admission control, OPA Gatekeeper, and SRE SLO practice instead of inventing a bespoke governance model.
-
Action: Pick one service archetype, such as an HTTP API, and build the full path end to end: template, catalog descriptor, IaC module, CI evidence, CD policy, dashboards, alerts, and scorecard. Then make that path easier than filing a ticket.