#ci-cd

58 posts

Aug 12, 2025 7 min read

L2 Deep Dive

The Platform Automation Maturity Model: Scripts, Modules, Catalogs, Pipelines, Control Planes

How platform automation matures from one-off scripts to a governed control plane — and where most teams get stuck between modules and catalogs.

#automation #platform #ci-cd

Jul 15, 2025 7 min read

L2 Deep Dive

Cloud & Platform

Automation Rollback Playbook: Disable, Revert, Repair State, and Reconcile Reality

How to roll back automation safely when it misfires — the four-stage playbook: disable the automation, revert the change, repair state, and reconcile system reality with declared intent.

#automation #platform #ci-cd

Jun 10, 2025 7 min read

L2 Deep Dive

Cloud & Platform

DB Team Automation Roadmap: Backups, Patching, Refreshes, Provisioning, and Guardrails

A sequenced roadmap for database teams to automate backups, patching, refreshes, and provisioning — with guardrails that prevent automation from becoming a risk multiplier.

#automation #platform #ci-cd

May 13, 2025 8 min read

L2 Deep Dive

Cloud & Platform

SRE Automation Backlog: How to Rank Toil by Risk, Frequency, and Recoverability

Ranking SRE toil by recoverability, blast radius, and frequency surfaces which manual failure paths deserve automation investment before the next incident.

#automation #platform #ci-cd

Mar 11, 2025 7 min read

L2 Deep Dive

Cloud & Platform

From Python Script to Platform Capability: Versioning, Ownership, Support, and Release Notes

A Python script becomes a platform liability when it gains organizational dependencies without versioning, an owner, or a defined support contract.

#automation #platform #ci-cd

Feb 11, 2025 7 min read

L2 Deep Dive

Cloud & Platform

Secrets and Credentials in Python Automation: Local Dev, CI, Cloud, and Rotation

Credential handling in Python automation breaks at the boundaries between local dev, CI pipelines, and cloud execution when rotation is an afterthought.

#automation #platform #ci-cd

Jan 14, 2025 7 min read

L2 Deep Dive

Cloud & Platform

Building a Safe Python Migration Runner for Operational Data Changes

A Python migration runner for live operational data needs idempotency guards, dry-run modes, and rollback hooks that schema migrations skip by default.

#automation #platform #ci-cd

Dec 17, 2024 7 min read

L2 Deep Dive

Cloud & Platform

The Deployment Control Plane: CI/CD, Catalog, Policy, Observability, and Human Approval

CI/CD, service catalog ownership, policy gates, and SLO observability wired into a control plane that authorizes each deployment before it ships.

#automation #platform #ci-cd

Dec 10, 2024 7 min read

L2 Deep Dive

Cloud & Platform

Python Database Maintenance Jobs: Safety Checks, Locks, Batches, and Rollback

Python database maintenance jobs that skip lock checks, batch limits, and replication lag awareness will corrupt data or starve live queries under load.

#automation #platform #ci-cd

Nov 19, 2024 7 min read

L2 Deep Dive

Cloud & Platform

Progressive Delivery Reference Architecture: CI, GitOps, Flags, SLOs, and Rollback

GitOps, feature flags, and SLO-gated rollback wired into a CI pipeline that treats deploy, release, verification, and rollback as separate stages.

#automation #platform #ci-cd

Nov 12, 2024 7 min read

L2 Deep Dive

Cloud & Platform

Testing Python Automation: Unit Tests, Contract Tests, Fakes, and Cloud Sandboxes

Four testing layers for Python automation — unit, contract, fakes, and cloud sandboxes — targeting the API drift and retry failures that local CI misses.

#automation #platform #ci-cd

Oct 8, 2024 7 min read

L2 Deep Dive

Cloud & Platform

Python Package Layout for Internal Automation Modules

Filesystem layout, entry points, and dependency isolation when Python automation crosses from script origins to production-critical shared infrastructure.

#automation #platform #ci-cd

Sep 10, 2024 8 min read

L2 Deep Dive

Cloud & Platform

Structured Logging for Automation: The Debug Trail You Need at 2 AM

JSON schemas, correlation IDs, and log-level policies that make automation failures forensically legible before the on-call page arrives at 2 AM.

#automation #platform #ci-cd

Aug 20, 2024 7 min read

L2 Deep Dive

Cloud & Platform

GitHub Actions for Platform Teams: Reusable Workflows, OIDC, Environments, and Audit

GitHub Actions reusable workflows, OIDC credential federation, and environment approval gates — preventing per-repo credential sprawl across a platform.

#automation #platform #ci-cd

Aug 13, 2024 7 min read

L2 Deep Dive

Cloud & Platform

SDK Wrappers: How to Hide Cloud Provider Mess Without Hiding Risk

Cloud SDK wrapper design: how to abstract provider credential and retry complexity without obscuring blast radius or making dangerous operations look safe.

#automation #platform #ci-cd

Jul 9, 2024 7 min read

L2 Deep Dive

Cloud & Platform

Python CLIs for Ops Teams: Arguments, Config, Dry Run, and Exit Codes

Python CLI design for ops scripts: argument parsing, config layering, dry-run modes, and exit codes that make automation safe to run in production.

#automation #platform #ci-cd

Jun 18, 2024 7 min read

L2 Deep Dive

Cloud & Platform

Terraform in CI/CD: Plan, Review, Apply, Lock, and Rollback Boundaries

Terraform in CI/CD requires different gates than application deployments: plan review thresholds, apply lock design, environment promotion, and a rollback boundary that actually works when state diverges.

#automation #platform #ci-cd

Jun 11, 2024 7 min read

L2 Deep Dive

Cloud & Platform

Idempotent Python Jobs: The Difference Between Retry and Duplicate Damage

Python jobs without idempotency guards turn retries into duplicate database writes or double charges — the design patterns that make re-execution safe.

#automation #platform #ci-cd

May 21, 2024 7 min read

L2 Deep Dive

Cloud & Platform

Feature Flags vs Deployments: Separating Release From Risk

Feature flags separate the deploy event from the release decision, letting you control which users absorb new behavior without reverting a deployment.

#automation #platform #ci-cd

Apr 16, 2024 7 min read

L2 Deep Dive

Cloud & Platform

Pipeline Secrets: Why CI Is Often Your Weakest Production Boundary

CI carries production credentials with less access modeling than the services they deploy, making build pipelines a common source of credential exposure.

#automation #platform #ci-cd

Apr 9, 2024 7 min read

L2 Deep Dive

Cloud & Platform

Why Service Catalogs Fail: Adoption, Trust, Freshness, and Platform Team Incentives

Service catalogs fail when treated as static registries instead of operational systems that enforce ownership and freshness continuously.

#automation #platform #ci-cd

Feb 20, 2024 6 min read

L2 Deep Dive

Cloud & Platform

GitOps Is Reconciliation, Not Just YAML in Git

GitOps breaks when the control loop is never implemented—treating YAML-in-Git as the destination instead of the reconciliation loop as the product.

#automation #platform #ci-cd

Feb 13, 2024 7 min read

L2 Deep Dive

Cloud & Platform

Service Catalog Incident Workflow: Find Owner, Blast Radius, Dependencies, and Last Change

Service catalog fields for owner, dependency graph, blast radius, and last deploy that cut incident triage time before Slack threads spiral.

#automation #platform #ci-cd

Jan 23, 2024 8 min read

L2 Deep Dive

Cloud & Platform

CI/CD Pipeline Design: Fast Feedback vs Safe Promotion

Structuring CI/CD pipelines so unit tests give fast feedback without sacrificing the promotion gates that prevent bad builds from reaching production.

#automation #platform #ci-cd

Jan 9, 2024 7 min read

L2 Deep Dive

Cloud & Platform

Catalog-to-CI Integration: Ownership, Deployment History, SLOs, and Change Risk

Linking a service catalog to CI gates enables change risk scoring from ownership, SLO status, and deployment history — beyond pipeline pass/fail alone.

#automation #platform #ci-cd

Dec 12, 2023 7 min read

L2 Deep Dive

Cloud & Platform

Platform Scorecard Rollout: Standards Without Turning the Catalog Into Shelfware

Rolling out a platform scorecard without tying it to CI gates and team OKRs turns engineering standards into documentation that nobody reads.

#automation #platform #ci-cd

Nov 14, 2023 7 min read

L2 Deep Dive

Cloud & Platform

Service Lifecycle Workflow: Create, Promote, Deprecate, Archive, Delete

Service lifecycle management — from creation through deprecation and safe deletion — requires a control system beyond the deployment pipeline.

#automation #platform #ci-cd

Oct 10, 2023 7 min read

L2 Deep Dive

Cloud & Platform

Self-Service Database Provisioning: Catalog Request, Terraform Module, Policy, and Audit

Database provisioning via catalog request and Terraform module: the policy and audit gates that make self-service trustworthy to security and operations.

#automation #platform #ci-cd

Sep 19, 2023 7 min read

L2 Deep Dive

Cloud & Platform

OpenTofu vs Terraform: What Platform Teams Should Actually Evaluate

OpenTofu vs. Terraform on licensing risk, provider supply chain compatibility, state safety, and the migration cost platform teams actually absorb.

#automation #platform #ci-cd

Sep 12, 2023 7 min read

L2 Deep Dive

Cloud & Platform

Service Catalog Data Model: Services, Systems, Resources, Owners, and Dependencies

How services, systems, resources, owners, and dependency edges compose into a service catalog schema that supports incident response and delivery tracing.

#automation #platform #ci-cd

Aug 8, 2023 9 min read

L2 Deep Dive

Cloud & Platform

Backstage, Port, Cortex, and AWS Service Catalog: Different Tools, Different Control Planes

Backstage, Port, Cortex, and AWS Service Catalog compared on control-plane model — which tools provision, which only display, and where each abstraction breaks down.

#automation #platform #ci-cd

Jul 11, 2023 7 min read

L2 Deep Dive

Cloud & Platform

Ownership Metadata: The Small Catalog Field That Fixes Incidents

Ownership fields in the service catalog make the responsible team discoverable at alert time — the missing link that shortens incident duration.

#automation #platform #ci-cd

Jun 13, 2023 6 min read

L2 Deep Dive

Cloud & Platform

Software Templates: Where Developer Portals Become Delivery Systems

Developer portal templates become a delivery system when they enforce scaffolding, CI wiring, and ownership at service creation — not documentation after.

#automation #platform #ci-cd

May 9, 2023 7 min read

L2 Deep Dive

Cloud & Platform

Scorecards: Turning Platform Standards Into Visible Engineering Debt

Scorecards turn platform standards into per-service debt that owners can see, dispute, and retire — the mechanism that makes wiki-page rules enforceable.

#automation #platform #ci-cd

Apr 11, 2023 7 min read

L2 Deep Dive

Cloud & Platform

Golden Paths: The Platform Contract Behind Self-Service Engineering

Golden paths work when the platform publishes a contract — opinionated defaults, SLO guarantees, and upgrade boundaries — not just a curated toolbox.

#automation #platform #ci-cd

Mar 14, 2023 7 min read

L2 Deep Dive

Cloud & Platform

What Belongs in a Service Catalog and What Does Not

Service catalogs work when they enforce ownership, runbooks, and deploy targets — not when they duplicate documentation already in code or wikis.

#automation #platform #ci-cd

Feb 14, 2023 7 min read

L2 Deep Dive

Cloud & Platform

Multi-Account Terraform Architecture: State, IAM, Network, and Promotion Boundaries

Multi-account Terraform design: isolating state, IAM, and network boundaries per environment so a single misconfiguration cannot cross promotion gates.

#automation #platform #ci-cd

Jan 10, 2023 7 min read

L2 Deep Dive

Cloud & Platform

Terraform for Kubernetes Operators: Installing the Platform Without Owning Every App

Terraform boundary design for Kubernetes operators separates control-plane installation from application delivery to prevent ownership and state conflicts.

#automation #platform #ci-cd

Dec 13, 2022 7 min read

L2 Deep Dive

Cloud & Platform

Terraform for RDS and Aurora: What Should Be Automated and What Should Stay Manual

Database automation should encode the repetitive safety controls and leave judgment-heavy decisions to humans — what to automate in RDS and Aurora Terraform modules and what must stay gated on human review.

#automation #platform #ci-cd

Nov 8, 2022 7 min read

L2 Deep Dive

Cloud & Platform

Testing Terraform Modules: Static Checks, Plan Tests, Local Emulators, and Sandboxes

Terraform modules fail because tests are placed at the wrong layer: too late to be cheap, too mocked to be truthful — how to combine static analysis, plan-level assertions, and sandbox environments for reliable module testing.

#automation #platform #ci-cd

Oct 11, 2022 7 min read

L2 Deep Dive

Cloud & Platform

Policy as Code for Terraform: OPA, Sentinel, Checkov, and Human Review

Terraform review fails when humans rediscover the same constraints in every PR — how OPA, Sentinel, and Checkov encode policy gates that catch public storage buckets, unencrypted databases, and missing tags at plan time.

#automation #platform #ci-cd

Aug 9, 2022 6 min read

L2 Deep Dive

Cloud & Platform

Terraform Import Workflow: Bringing Existing Cloud Resources Under Control

Terraform import's dangerous moment is not the command — it is when a team mistakes 'now in state' for 'now under control.' A safe import workflow covering targeted plans, drift checks, and state file validation before any apply.

#automation #platform #ci-cd

Jul 12, 2022 8 min read

L2 Deep Dive

Cloud & Platform

Terraform Drift Triage Workflow: Detect, Classify, Reconcile, Prevent

Terraform drift is a control-plane integrity problem — how to detect it, classify whether it is an emergency or acceptable deviation, reconcile state safely, and prevent future splits without blocking legitimate out-of-band changes.

#automation #platform #ci-cd

Jun 14, 2022 7 min read

L2 Deep Dive

Cloud & Platform

Terraform Module Design Checklist for Database Infrastructure

Database Terraform modules fail when they hide operational decisions behind convenient defaults — a checklist covering parameter groups, backup policies, encryption, and the boundaries that must never be automated away.

#automation #platform #ci-cd

May 10, 2022 7 min read

L2 Deep Dive

Cloud & Platform

Remote State, Locks, and Backends: The Hidden Database Behind IaC

Infrastructure as Code becomes operationally safe only when the state store has concurrency control, durability, auditability, and documented recovery procedures — treating Terraform backends as production databases, not build artifacts.

#automation #platform #ci-cd

Apr 12, 2022 7 min read

L2 Deep Dive

Cloud & Platform

Variables, Locals, and Outputs: The API Surface of Infrastructure Modules

Infrastructure modules fail as software interfaces before they fail as infrastructure — how Terraform variables, locals, and outputs define the API surface that determines whether a module is reusable or a maintenance burden.

#automation #platform #ci-cd

Mar 8, 2022 7 min read

L2 Deep Dive

Cloud & Platform

Terraform Plan Review: What Senior Engineers Look For

Terraform plan review is not a syntax check — it is the last cheap place to catch a production architecture mistake before an API turns intent into infrastructure. What senior engineers actually look for in a plan output.

#automation #platform #ci-cd

Feb 8, 2022 6 min read

L2 Deep Dive

Cloud & Platform

Terraform Workspaces vs Separate State: The Environment Isolation Decision

Most Terraform environment failures come from placing the wrong isolation boundary around state, credentials, approvals, and blast radius — when to use workspaces and when separate state files with separate backends is the correct choice.

#automation #platform #ci-cd

Jan 11, 2022 7 min read

L2 Deep Dive

Cloud & Platform

Terraform Modules: Reuse Boundary or Organizational Trap

The first Terraform module removes duplication. The fiftieth reveals the real architecture: who owns infrastructure decisions, who absorbs breaking changes, and whether the platform is a product or a shared pile of HCL.

#automation #platform #ci-cd

Dec 14, 2021 7 min read

L2 Deep Dive

Cloud & Platform

Automation Incident Review: When the Tool Worked and the System Failed

The hardest automation incidents are not broken tools — they happen when every tool executes exactly as asked while the surrounding system loses the ability to evaluate whether that action is still safe.

#automation #platform #ci-cd

Nov 9, 2021 8 min read

L2 Deep Dive

Cloud & Platform

Runbook to Pipeline: How to Convert Manual Operations Without Creating Risk

Converting a runbook into an automated pipeline is not a transcription exercise — a human operator can stop at bad preconditions, and a pipeline must explicitly encode every check that was previously implicit in that judgment.

#automation #platform #ci-cd

Oct 12, 2021 7 min read

L2 Deep Dive

Cloud & Platform

The Approval Boundary: What Should Humans Still Decide in Automated Delivery

Delivery automation fails not when machines make too many decisions, but when teams forget which decisions still require human judgment — how to draw and enforce the approval boundary without blocking delivery.

#automation #platform #ci-cd

Sep 14, 2021 7 min read

L2 Deep Dive

Cloud & Platform

Automation Readiness Review: Inputs, State, Permissions, Rollback, and Audit

A five-question checklist before running automation in production: are inputs bounded, is state understood, are permissions scoped, is rollback credible, and is the audit trail durable enough to reconstruct what happened.

#automation #platform #ci-cd

Aug 10, 2021 7 min read

L2 Deep Dive

Cloud & Platform

Drift Is Not a Terraform Problem. It Is an Ownership Problem

Terraform drift is not a tooling failure — it is an ownership failure. How to distinguish unauthorized changes from competing systems from legitimate out-of-band fixes, and why reconciliation requires policy before it requires automation.

#automation #platform #ci-cd

Jun 8, 2021 7 min read

L2 Deep Dive

Cloud & Platform

Platform Engineering Starts With Golden Paths, Not Kubernetes

Platform engineering fails when teams start with Kubernetes, service mesh, and GitOps before building the paved path that makes repository creation, CI, secrets, and production deployment discoverable for every service team.

#automation #platform #ci-cd

Apr 13, 2021 6 min read

L2 Deep Dive

Cloud & Platform

Python Automation Scripts Become Products Faster Than Teams Admit

The moment a useful automation script gains dependents, it becomes an undocumented product — and most teams miss the transition until compatibility expectations, support load, and undocumented behavior have already accumulated.

#automation #platform #ci-cd

Feb 9, 2021 6 min read

L2 Deep Dive

Cloud & Platform

Terraform State Is a Production Dependency

Terraform state is not a build artifact — it is the database your infrastructure control plane reads on every plan. How to treat it with the same backup, locking, and recovery discipline as production data.

#automation #platform #ci-cd

Jan 12, 2021 7 min read

L2 Deep Dive

Cloud & Platform

Automation Fails When It Only Replaces Typing

Why automation that encodes manual steps without changing ownership, feedback, and state management produces fragile scripts rather than reliable platform capabilities.

#automation #platform #ci-cd