Cloud &amp; Platform

Terraform in CI/CD: Plan, Review, Apply, Lock, and Rollback Boundaries

Terraform in CI/CD requires different gates than application deployments: plan review thresholds, apply lock design, environment promotion, and a rollback boundary that actually works when state diverges.

Mar 12, 2024 8 min read

L2 Deep Dive

Internal Developer Platform Reference Architecture: Catalog, IaC, CI/CD, Policy, and Observability

Reference architecture for an IDP as a control plane—connecting service catalog, IaC, CI/CD pipelines, policy enforcement, and observability feedback.

Deep Dives

L2 and L3 posts with architecture, reliability, and tradeoff detail.

Dec 16, 2025 8 min read

L2 Deep Dive

The 2026 Automation Roadmap for SRE, DevOps, and Database Teams

The 2026 automation priorities for SRE, DevOps, and database teams: what to finish, what to stop maintaining manually, and where agent workflows are actually production-ready.

Aug 12, 2025 7 min read

L2 Deep Dive

The Platform Automation Maturity Model: Scripts, Modules, Catalogs, Pipelines, Control Planes

How platform automation matures from one-off scripts to a governed control plane — and where most teams get stuck between modules and catalogs.

Jul 15, 2025 7 min read

L2 Deep Dive

Automation Rollback Playbook: Disable, Revert, Repair State, and Reconcile Reality

How to roll back automation safely when it misfires — the four-stage playbook: disable the automation, revert the change, repair state, and reconcile system reality with declared intent.

Jun 10, 2025 7 min read

L2 Deep Dive

DB Team Automation Roadmap: Backups, Patching, Refreshes, Provisioning, and Guardrails

A sequenced roadmap for database teams to automate backups, patching, refreshes, and provisioning — with guardrails that prevent automation from becoming a risk multiplier.

May 13, 2025 8 min read

L2 Deep Dive

SRE Automation Backlog: How to Rank Toil by Risk, Frequency, and Recoverability

Ranking SRE toil by recoverability, blast radius, and frequency surfaces which manual failure paths deserve automation investment before the next incident.

Mar 11, 2025 7 min read

L2 Deep Dive

From Python Script to Platform Capability: Versioning, Ownership, Support, and Release Notes

A Python script becomes a platform liability when it gains organizational dependencies without versioning, an owner, or a defined support contract.

Latest in Cloud & Platform

Apr 1, 2026 2 min read

L1 Field Note

The Math Behind Database Reserved Instances: When to Wait

Why committing to 3-year database reserved instances too early locks in architectural waste.

#cloud #architecture

Mar 18, 2026 2 min read

L1 Field Note

BigQuery Cost Optimization: On-Demand vs Slot Commitments

How to stop runaway BigQuery costs by analyzing query scans, enforcing partitions, and moving to capacity-based pricing.

#cloud #architecture #checklist

Feb 11, 2026 2 min read

L1 Field Note

Database Licensing Cost Across AWS, Azure, GCP, and OCI

A framework for managing commercial database licensing costs across the four major cloud providers.

#databases #cloud #architecture

Feb 4, 2026 3 min read

L1 Field Note

Cloud Database Cost Engineering: How to Reduce Database, Data Warehouse, and Licensing Spend Across Azure, AWS, GCP, and OCI

A comprehensive framework for reigning in cloud database costs, focusing on licensing, right-sizing, and architectural tradeoffs.

#databases #cloud #architecture #checklist

Dec 16, 2025 8 min read

L2 Deep Dive

The 2026 Automation Roadmap for SRE, DevOps, and Database Teams

The 2026 automation priorities for SRE, DevOps, and database teams: what to finish, what to stop maintaining manually, and where agent workflows are actually production-ready.

Aug 12, 2025 7 min read

L2 Deep Dive

The Platform Automation Maturity Model: Scripts, Modules, Catalogs, Pipelines, Control Planes

How platform automation matures from one-off scripts to a governed control plane — and where most teams get stuck between modules and catalogs.

All Cloud & Platform Posts

Apr 1, 2026 2 min read

L1 Field Note

The Math Behind Database Reserved Instances: When to Wait

Why committing to 3-year database reserved instances too early locks in architectural waste.

#cloud #architecture

Mar 18, 2026 2 min read

L1 Field Note

BigQuery Cost Optimization: On-Demand vs Slot Commitments

How to stop runaway BigQuery costs by analyzing query scans, enforcing partitions, and moving to capacity-based pricing.

#cloud #architecture #checklist

Feb 11, 2026 2 min read

L1 Field Note

Database Licensing Cost Across AWS, Azure, GCP, and OCI

A framework for managing commercial database licensing costs across the four major cloud providers.

#databases #cloud #architecture

Feb 4, 2026 3 min read

L1 Field Note

Cloud Database Cost Engineering: How to Reduce Database, Data Warehouse, and Licensing Spend Across Azure, AWS, GCP, and OCI

A comprehensive framework for reigning in cloud database costs, focusing on licensing, right-sizing, and architectural tradeoffs.

#databases #cloud #architecture #checklist

Dec 16, 2025 8 min read

L2 Deep Dive

The 2026 Automation Roadmap for SRE, DevOps, and Database Teams

The 2026 automation priorities for SRE, DevOps, and database teams: what to finish, what to stop maintaining manually, and where agent workflows are actually production-ready.

Aug 12, 2025 7 min read

L2 Deep Dive

The Platform Automation Maturity Model: Scripts, Modules, Catalogs, Pipelines, Control Planes

How platform automation matures from one-off scripts to a governed control plane — and where most teams get stuck between modules and catalogs.

Jul 15, 2025 7 min read

L2 Deep Dive

Automation Rollback Playbook: Disable, Revert, Repair State, and Reconcile Reality

How to roll back automation safely when it misfires — the four-stage playbook: disable the automation, revert the change, repair state, and reconcile system reality with declared intent.

Jun 10, 2025 7 min read

L2 Deep Dive

DB Team Automation Roadmap: Backups, Patching, Refreshes, Provisioning, and Guardrails

A sequenced roadmap for database teams to automate backups, patching, refreshes, and provisioning — with guardrails that prevent automation from becoming a risk multiplier.

May 13, 2025 8 min read

L2 Deep Dive

SRE Automation Backlog: How to Rank Toil by Risk, Frequency, and Recoverability

Ranking SRE toil by recoverability, blast radius, and frequency surfaces which manual failure paths deserve automation investment before the next incident.

Mar 11, 2025 7 min read

L2 Deep Dive

From Python Script to Platform Capability: Versioning, Ownership, Support, and Release Notes

A Python script becomes a platform liability when it gains organizational dependencies without versioning, an owner, or a defined support contract.

Feb 11, 2025 7 min read

L2 Deep Dive

Secrets and Credentials in Python Automation: Local Dev, CI, Cloud, and Rotation

Credential handling in Python automation breaks at the boundaries between local dev, CI pipelines, and cloud execution when rotation is an afterthought.

Jan 14, 2025 7 min read

L2 Deep Dive

Building a Safe Python Migration Runner for Operational Data Changes

A Python migration runner for live operational data needs idempotency guards, dry-run modes, and rollback hooks that schema migrations skip by default.

Dec 17, 2024 7 min read

L2 Deep Dive

The Deployment Control Plane: CI/CD, Catalog, Policy, Observability, and Human Approval

CI/CD, service catalog ownership, policy gates, and SLO observability wired into a control plane that authorizes each deployment before it ships.

Dec 10, 2024 7 min read

L2 Deep Dive

Python Database Maintenance Jobs: Safety Checks, Locks, Batches, and Rollback

Python database maintenance jobs that skip lock checks, batch limits, and replication lag awareness will corrupt data or starve live queries under load.

Nov 19, 2024 7 min read

L2 Deep Dive

Progressive Delivery Reference Architecture: CI, GitOps, Flags, SLOs, and Rollback

GitOps, feature flags, and SLO-gated rollback wired into a CI pipeline that treats deploy, release, verification, and rollback as separate stages.

Nov 12, 2024 7 min read

L2 Deep Dive

Testing Python Automation: Unit Tests, Contract Tests, Fakes, and Cloud Sandboxes

Four testing layers for Python automation — unit, contract, fakes, and cloud sandboxes — targeting the API drift and retry failures that local CI misses.

Oct 15, 2024 7 min read

L2 Deep Dive

CI/CD Observability: Queue Time, Flake Rate, Lead Time, Failure Domains, and Change Risk

Queue time, flake rate, lead time, failure domains, and change risk as CI/CD signals that reveal whether a delivery system is becoming safer or just busier.

Oct 8, 2024 7 min read

L2 Deep Dive

Python Package Layout for Internal Automation Modules

Filesystem layout, entry points, and dependency isolation when Python automation crosses from script origins to production-critical shared infrastructure.

Sep 27, 2024 9 min read

L3 Reference Guide

AWS vs Azure vs GCP vs OCI for Database-Backed Systems: Decision Framework

How to choose between AWS, Azure, GCP, and OCI for database-backed systems by matching managed database failure behavior to your system's dominant recovery requirement.

#architecture #cloud #databases

Sep 17, 2024 6 min read

L2 Deep Dive

Argo CD Deployment Workflow: Sync Waves, Health Checks, Rollbacks, and Drift

Argo CD sync waves, health check gates, rollback triggers, and drift detection — the four mechanisms that separate GitOps deployments from applied YAML.

Sep 10, 2024 8 min read

L2 Deep Dive

Structured Logging for Automation: The Debug Trail You Need at 2 AM

JSON schemas, correlation IDs, and log-level policies that make automation failures forensically legible before the on-call page arrives at 2 AM.

Aug 20, 2024 7 min read

L2 Deep Dive

GitHub Actions for Platform Teams: Reusable Workflows, OIDC, Environments, and Audit

GitHub Actions reusable workflows, OIDC credential federation, and environment approval gates — preventing per-repo credential sprawl across a platform.

Aug 13, 2024 7 min read

L2 Deep Dive

SDK Wrappers: How to Hide Cloud Provider Mess Without Hiding Risk

Cloud SDK wrapper design: how to abstract provider credential and retry complexity without obscuring blast radius or making dangerous operations look safe.

Jul 9, 2024 7 min read

L2 Deep Dive

Python CLIs for Ops Teams: Arguments, Config, Dry Run, and Exit Codes

Python CLI design for ops scripts: argument parsing, config layering, dry-run modes, and exit codes that make automation safe to run in production.

Jun 18, 2024 7 min read

L2 Deep Dive

Terraform in CI/CD: Plan, Review, Apply, Lock, and Rollback Boundaries

Jun 11, 2024 7 min read

L2 Deep Dive

Idempotent Python Jobs: The Difference Between Retry and Duplicate Damage

Python jobs without idempotency guards turn retries into duplicate database writes or double charges — the design patterns that make re-execution safe.

Jun 10, 2024 5 min read

L3 Reference Guide

pgcrypto vs KMS vs HSM: Decision Framework

Engineers often over-rotate to Hardware Security Modules (HSMs) for non-regulatory workloads or under-rotate to database extensions. How to map data classification to the right cryptographic tier.

#architecture #cloud #security

May 21, 2024 7 min read

L2 Deep Dive

Feature Flags vs Deployments: Separating Release From Risk

Feature flags separate the deploy event from the release decision, letting you control which users absorb new behavior without reverting a deployment.

May 14, 2024 7 min read

L2 Deep Dive

Python Automation Needs an API Contract, Not a Folder of Scripts

Python automation without an explicit API contract gives callers no compatibility guarantees, no error contract, and no safe path to evolve behavior.

Apr 16, 2024 7 min read

L2 Deep Dive

Pipeline Secrets: Why CI Is Often Your Weakest Production Boundary

CI carries production credentials with less access modeling than the services they deploy, making build pipelines a common source of credential exposure.

Apr 9, 2024 7 min read

L2 Deep Dive

Why Service Catalogs Fail: Adoption, Trust, Freshness, and Platform Team Incentives

Service catalogs fail when treated as static registries instead of operational systems that enforce ownership and freshness continuously.

Mar 19, 2024 7 min read

L2 Deep Dive

Environment Promotion: Why Dev, Stage, and Prod Drift Apart

Dev-stage-prod drift accumulates when promotion workflows lack enforcement: config, secrets, and infrastructure each follow independent mutation paths.

Mar 12, 2024 8 min read

L2 Deep Dive

Internal Developer Platform Reference Architecture: Catalog, IaC, CI/CD, Policy, and Observability

Reference architecture for an IDP as a control plane—connecting service catalog, IaC, CI/CD pipelines, policy enforcement, and observability feedback.

Feb 20, 2024 6 min read

L2 Deep Dive

GitOps Is Reconciliation, Not Just YAML in Git

GitOps breaks when the control loop is never implemented—treating YAML-in-Git as the destination instead of the reconciliation loop as the product.

Feb 13, 2024 7 min read

L2 Deep Dive

Service Catalog Incident Workflow: Find Owner, Blast Radius, Dependencies, and Last Change

Service catalog fields for owner, dependency graph, blast radius, and last deploy that cut incident triage time before Slack threads spiral.

Jan 23, 2024 8 min read

L2 Deep Dive

CI/CD Pipeline Design: Fast Feedback vs Safe Promotion

Structuring CI/CD pipelines so unit tests give fast feedback without sacrificing the promotion gates that prevent bad builds from reaching production.

Jan 16, 2024 7 min read

L2 Deep Dive

Checkout Failure Triage: Payment, Inventory, Order Write, or Downstream Event

Triage checklist for isolating checkout failures across payment gateway, inventory reservation, order write, and event propagation boundaries.

Jan 9, 2024 7 min read

L2 Deep Dive

Catalog-to-CI Integration: Ownership, Deployment History, SLOs, and Change Risk

Linking a service catalog to CI gates enables change risk scoring from ownership, SLO status, and deployment history — beyond pipeline pass/fail alone.

Dec 17, 2023 7 min read

L2 Deep Dive

Event Sourcing for Orders: Useful Pattern or Audit Log Theater

Event sourcing on an order service is justified when you need point-in-time state reconstruction, not just an append-only audit trail that nobody queries.

Dec 12, 2023 7 min read

L2 Deep Dive

Platform Scorecard Rollout: Standards Without Turning the Catalog Into Shelfware

Rolling out a platform scorecard without tying it to CI gates and team OKRs turns engineering standards into documentation that nobody reads.

Nov 17, 2023 7 min read

L2 Deep Dive

Payment Idempotency: How to Avoid Double Charges and Missing Orders

Payment idempotency keys and atomic state transitions prevent the double-charge failure where a transaction succeeds while surrounding systems log failure.

Nov 14, 2023 7 min read

L2 Deep Dive

Service Lifecycle Workflow: Create, Promote, Deprecate, Archive, Delete

Service lifecycle management — from creation through deprecation and safe deletion — requires a control system beyond the deployment pipeline.

Oct 18, 2023 8 min read

L2 Deep Dive

Inventory Reservation: Why Simple Counters Fail Under Promotions

Under promotion load, inventory counters fail not from arithmetic errors but from the gap between read-check-decrement cycles and promises already made.

Oct 17, 2023 7 min read

L2 Deep Dive

The Terraform Platform Operating Model: Modules, Catalogs, CI, Policy, and Support

Terraform platform failures trace to operating model drift — how modules, catalogs, CI gates, and policy enforcement should be owned at the platform layer.

#cloud #architecture #failures

Oct 10, 2023 7 min read

L2 Deep Dive

Self-Service Database Provisioning: Catalog Request, Terraform Module, Policy, and Audit

Database provisioning via catalog request and Terraform module: the policy and audit gates that make self-service trustworthy to security and operations.

Sep 19, 2023 7 min read

L2 Deep Dive

OpenTofu vs Terraform: What Platform Teams Should Actually Evaluate

OpenTofu vs. Terraform on licensing risk, provider supply chain compatibility, state safety, and the migration cost platform teams actually absorb.

Sep 12, 2023 7 min read

L2 Deep Dive

Service Catalog Data Model: Services, Systems, Resources, Owners, and Dependencies

How services, systems, resources, owners, and dependency edges compose into a service catalog schema that supports incident response and delivery tracing.

Aug 8, 2023 9 min read

L2 Deep Dive

Backstage, Port, Cortex, and AWS Service Catalog: Different Tools, Different Control Planes

Backstage, Port, Cortex, and AWS Service Catalog compared on control-plane model — which tools provision, which only display, and where each abstraction breaks down.

Jul 11, 2023 7 min read

L2 Deep Dive

Ownership Metadata: The Small Catalog Field That Fixes Incidents

Ownership fields in the service catalog make the responsible team discoverable at alert time — the missing link that shortens incident duration.

Jun 13, 2023 6 min read

L2 Deep Dive

Software Templates: Where Developer Portals Become Delivery Systems

Developer portal templates become a delivery system when they enforce scaffolding, CI wiring, and ownership at service creation — not documentation after.

May 9, 2023 7 min read

L2 Deep Dive

Scorecards: Turning Platform Standards Into Visible Engineering Debt

Scorecards turn platform standards into per-service debt that owners can see, dispute, and retire — the mechanism that makes wiki-page rules enforceable.

May 6, 2023 6 min read

L2 Deep Dive

GCP Database Cost Review: Cloud SQL, Spanner, Bigtable, Memorystore, and BigQuery

Cloud SQL, Spanner, Bigtable, Memorystore, and BigQuery each bill differently — cost overruns trace to applying the wrong model to the wrong workload.

#architecture #system-design #cloud

Apr 11, 2023 7 min read

L2 Deep Dive

Golden Paths: The Platform Contract Behind Self-Service Engineering

Golden paths work when the platform publishes a contract — opinionated defaults, SLO guarantees, and upgrade boundaries — not just a curated toolbox.

Apr 6, 2023 7 min read

L2 Deep Dive

GCP E-Commerce Inventory Architecture: Spanner, Pub/Sub, Dataflow, and BigQuery

Spanner prevents inventory oversells under concurrent checkouts; Pub/Sub and Dataflow push stock events to BigQuery without blocking reservation writes.

#architecture #databases #cloud

Mar 14, 2023 7 min read

L2 Deep Dive

What Belongs in a Service Catalog and What Does Not

Service catalogs work when they enforce ownership, runbooks, and deploy targets — not when they duplicate documentation already in code or wikis.

Feb 20, 2023 7 min read

L2 Deep Dive

GCP Reference Architecture: Cloud Run, Load Balancing, Cloud SQL, Memorystore, and Pub/Sub

Cloud Run autoscales compute, but Cloud SQL connection limits, Memorystore eviction, and Pub/Sub backpressure are where capacity planning actually lives.

#architecture #cloud #databases

Feb 14, 2023 7 min read

L2 Deep Dive

Multi-Account Terraform Architecture: State, IAM, Network, and Promotion Boundaries

Multi-account Terraform design: isolating state, IAM, and network boundaries per environment so a single misconfiguration cannot cross promotion gates.

Jan 10, 2023 7 min read

L2 Deep Dive

Terraform for Kubernetes Operators: Installing the Platform Without Owning Every App

Terraform boundary design for Kubernetes operators separates control-plane installation from application delivery to prevent ownership and state conflicts.

Jan 6, 2023 7 min read

L2 Deep Dive

Azure Landing Zone for Data Systems: Identity, Network, Key Vault, and Policy

Azure landing zone for data systems: the identity, network, Key Vault, and Policy decisions that prevent post-deployment security failures.

Dec 22, 2022 8 min read

L2 Deep Dive

Azure E-Commerce Order Pipeline: Service Bus, Functions, SQL, and Cosmos DB

Azure checkout fails when order acceptance, payment, inventory reservation, and fulfillment are treated as one clean transaction — how Service Bus, Functions, Azure SQL, and Cosmos DB handle the recoverable steps that follow commitment.

#architecture #system-design #cloud

Dec 13, 2022 7 min read

L2 Deep Dive

Terraform for RDS and Aurora: What Should Be Automated and What Should Stay Manual

Database automation should encode the repetitive safety controls and leave judgment-heavy decisions to humans — what to automate in RDS and Aurora Terraform modules and what must stay gated on human review.

Dec 7, 2022 7 min read

L2 Deep Dive

Azure Service Bus vs Event Hubs: Commands, Events, and Replay

Azure Service Bus and Event Hubs solve different problems — commands vs events, ordered queues vs partitioned streams, at-most-once delivery vs replay — and teams that choose the wrong one rebuild the integration under load.

Nov 8, 2022 7 min read

L2 Deep Dive

Testing Terraform Modules: Static Checks, Plan Tests, Local Emulators, and Sandboxes

Terraform modules fail because tests are placed at the wrong layer: too late to be cheap, too mocked to be truthful — how to combine static analysis, plan-level assertions, and sandbox environments for reliable module testing.

Oct 11, 2022 7 min read

L2 Deep Dive

Policy as Code for Terraform: OPA, Sentinel, Checkov, and Human Review

Terraform review fails when humans rediscover the same constraints in every PR — how OPA, Sentinel, and Checkov encode policy gates that catch public storage buckets, unencrypted databases, and missing tags at plan time.

Sep 13, 2022 8 min read

L2 Deep Dive

Terraform State Surgery: When to Move, Split, or Repair State

Terraform state surgery is a production change to the control plane that decides what infrastructure exists — when to move, split, import, or repair state, and how to do it without triggering unintended replacements.

#cloud #architecture #failures

Aug 9, 2022 6 min read

L2 Deep Dive

Terraform Import Workflow: Bringing Existing Cloud Resources Under Control

Terraform import's dangerous moment is not the command — it is when a team mistakes 'now in state' for 'now under control.' A safe import workflow covering targeted plans, drift checks, and state file validation before any apply.

Jul 12, 2022 8 min read

L2 Deep Dive

Terraform Drift Triage Workflow: Detect, Classify, Reconcile, Prevent

Terraform drift is a control-plane integrity problem — how to detect it, classify whether it is an emergency or acceptable deviation, reconcile state safely, and prevent future splits without blocking legitimate out-of-band changes.

Jul 10, 2022 8 min read

L2 Deep Dive

AWS Reference Architecture: ALB, ECS, RDS, ElastiCache, and SQS

The standard AWS web-tier stack works until the first dependency slows down, the cache goes cold, or a queue starts redriving poison messages — the failure modes hidden inside the ALB, ECS, RDS, ElastiCache, and SQS reference architecture.

Jun 14, 2022 7 min read

L2 Deep Dive

Terraform Module Design Checklist for Database Infrastructure

Database Terraform modules fail when they hide operational decisions behind convenient defaults — a checklist covering parameter groups, backup policies, encryption, and the boundaries that must never be automated away.

May 10, 2022 7 min read

L2 Deep Dive

Remote State, Locks, and Backends: The Hidden Database Behind IaC

Infrastructure as Code becomes operationally safe only when the state store has concurrency control, durability, auditability, and documented recovery procedures — treating Terraform backends as production databases, not build artifacts.

Apr 12, 2022 7 min read

L2 Deep Dive

Variables, Locals, and Outputs: The API Surface of Infrastructure Modules

Infrastructure modules fail as software interfaces before they fail as infrastructure — how Terraform variables, locals, and outputs define the API surface that determines whether a module is reusable or a maintenance burden.

Mar 8, 2022 7 min read

L2 Deep Dive

Terraform Plan Review: What Senior Engineers Look For

Terraform plan review is not a syntax check — it is the last cheap place to catch a production architecture mistake before an API turns intent into infrastructure. What senior engineers actually look for in a plan output.

Feb 8, 2022 6 min read

L2 Deep Dive

Terraform Workspaces vs Separate State: The Environment Isolation Decision

Most Terraform environment failures come from placing the wrong isolation boundary around state, credentials, approvals, and blast radius — when to use workspaces and when separate state files with separate backends is the correct choice.

Jan 11, 2022 7 min read

L2 Deep Dive

Terraform Modules: Reuse Boundary or Organizational Trap

The first Terraform module removes duplication. The fiftieth reveals the real architecture: who owns infrastructure decisions, who absorbs breaking changes, and whether the platform is a product or a shared pile of HCL.

Dec 14, 2021 7 min read

L2 Deep Dive

Automation Incident Review: When the Tool Worked and the System Failed

The hardest automation incidents are not broken tools — they happen when every tool executes exactly as asked while the surrounding system loses the ability to evaluate whether that action is still safe.

Nov 9, 2021 8 min read

L2 Deep Dive

Runbook to Pipeline: How to Convert Manual Operations Without Creating Risk

Converting a runbook into an automated pipeline is not a transcription exercise — a human operator can stop at bad preconditions, and a pipeline must explicitly encode every check that was previously implicit in that judgment.

Oct 12, 2021 7 min read

L2 Deep Dive

The Approval Boundary: What Should Humans Still Decide in Automated Delivery

Delivery automation fails not when machines make too many decisions, but when teams forget which decisions still require human judgment — how to draw and enforce the approval boundary without blocking delivery.

Sep 14, 2021 7 min read

L2 Deep Dive

Automation Readiness Review: Inputs, State, Permissions, Rollback, and Audit

A five-question checklist before running automation in production: are inputs bounded, is state understood, are permissions scoped, is rollback credible, and is the audit trail durable enough to reconstruct what happened.

Aug 10, 2021 7 min read

L2 Deep Dive

Drift Is Not a Terraform Problem. It Is an Ownership Problem

Terraform drift is not a tooling failure — it is an ownership failure. How to distinguish unauthorized changes from competing systems from legitimate out-of-band fixes, and why reconciliation requires policy before it requires automation.

Jul 13, 2021 7 min read

L2 Deep Dive

Why Self-Service Infrastructure Still Needs Guardrails

Self-service infrastructure fails when the platform distributes provisioning power without distributing policy, rollback paths, and cost controls — turning every service team into a production risk vector.

Jun 8, 2021 7 min read

L2 Deep Dive

Platform Engineering Starts With Golden Paths, Not Kubernetes

Platform engineering fails when teams start with Kubernetes, service mesh, and GitOps before building the paved path that makes repository creation, CI, secrets, and production deployment discoverable for every service team.

May 11, 2021 7 min read

L2 Deep Dive

CI/CD Pipelines Are Distributed Systems With Bad Observability

CI/CD pipelines fail as distributed coordination systems long before they fail as broken scripts — why build badges hide partial failures, flaky retries, and ordering gaps that only appear under real delivery load.

Apr 13, 2021 6 min read

L2 Deep Dive

Python Automation Scripts Become Products Faster Than Teams Admit

The moment a useful automation script gains dependents, it becomes an undocumented product — and most teams miss the transition until compatibility expectations, support load, and undocumented behavior have already accumulated.

Mar 9, 2021 7 min read

L2 Deep Dive

Service Catalogs Are Not Portals. They Are Control Planes

A service catalog that helps engineers find links is a directory. One that owns metadata, policy, workflow, and reconciliation is a platform control plane — and only the second one solves the real scaling problem.

#architecture #cloud

Feb 9, 2021 6 min read

L2 Deep Dive

Terraform State Is a Production Dependency

Terraform state is not a build artifact — it is the database your infrastructure control plane reads on every plan. How to treat it with the same backup, locking, and recovery discipline as production data.

Jan 12, 2021 7 min read

L2 Deep Dive

Automation Fails When It Only Replaces Typing

Why automation that encodes manual steps without changing ownership, feedback, and state management produces fragile scripts rather than reliable platform capabilities.