Telemetry Cost Control: Why Observability Data Itself Needs Governance

There is a terrifying inflection point in platform engineering where it becomes more expensive to monitor a database than it is to actually run the database.

Situation

As engineering teams scale, the default mandate is often “log everything.” Developers add INFO level logs for every incoming request, database engineers enable query auditing to track every SQL statement, and APM tools capture 100% of request traces. In a SaaS observability platform, pricing is usually driven by ingest volume and metric cardinality.

When a database handles 10,000 transactions per second, generating a 2KB log for every transaction results in 1.7 terabytes of log data per day. By the end of the month, the team receives a six-figure invoice for log storage and metric ingestion. Telemetry, originally designed to protect the system, becomes a financial liability that requires its own governance, architecture, and optimization strategy.

Symptoms

An ungoverned observability pipeline exhibits several clear financial and operational symptoms:

The Cardinality Explosion: A developer adds a user_id tag to a Datadog metric to track latency per user. Suddenly, a single metric generates 500,000 unique time series, resulting in thousands of dollars in overage charges.
The Needle in the Haystack: During an incident, engineers cannot find the relevant ERROR log because it is buried under 40 million INFO and DEBUG logs generated in the same five-minute window.
The Trace Hoard: The APM system is storing 100% of traces for a high-throughput /healthcheck endpoint that never fails, wasting massive amounts of expensive hot storage.
The Retention Tax: Teams store raw, un-aggregated database audit logs in hot, searchable indexes for 13 months “just for compliance,” ignoring cheaper cold storage options.

First Five Checks

To regain control of your telemetry pipeline, you must audit the flow of data from your infrastructure to your observability vendor. Start with these five checks:

Audit Metric Cardinality: Query your metric platform’s internal usage statistics. Identify any custom metric tagged with an unbounded dimension, such as user_id, session_id, or query_hash. Unbounded tags must be removed or moved to logs/traces.
Check APM Trace Sampling Rates: Review your tracing configuration. If you are executing head-based sampling at 100%, you are wasting money. Most systems only need to sample 1-5% of successful requests to generate statistically significant latency percentiles.
Analyze Log Ingestion Volume by Service: Determine which service (or database) is producing the most log volume. Often, a single misconfigured service stuck in DEBUG mode drives 60% of the entire log bill.
Review Index Retention Rules: Check how long logs are kept in “hot” (instantly searchable) storage. Operational logs rarely need to be searched after 14 days.
Examine Noisy Log Patterns: Use your log aggregator’s pattern-finding tool. If 40% of your logs are identical "Successfully connected to DB" messages, that pattern should be dropped at the agent level before it crosses the network.

Decision Tree

When implementing telemetry governance, use this flow to determine how to route and store observational data.

flowchart TD
    A[Telemetry Data Generated] --> B{Is it a Metric, Log, or Trace?}
    B -->|Metric| C{Does it have unbounded tags?}
    C -->|Yes| C1[Reject Metric at Agent]
    C -->|No| C2[Ingest to TSDB]
    
    B -->|Log| D{Is it INFO/DEBUG?}
    D -->|Yes| D1[Drop at Agent or Route to Cold Storage S3]
    D -->|No| D2[Ingest ERROR/WARN to Hot Index]
    
    B -->|Trace| E{Did the request fail or violate SLO?}
    E -->|Yes| E1[Keep 100% of Trace]
    E -->|No| E2[Sample at 1% for Baseline]

Remediation Options

Tail-Based Trace Sampling (High Impact, High Effort): Unlike head-based sampling (which randomly picks 1% of requests), tail-based sampling analyzes the completed trace. It discards normal, fast requests but keeps 100% of traces that contain errors or violate latency SLOs.
- Tradeoff: Requires deploying collector infrastructure (like OpenTelemetry Collectors) to buffer traces in memory while waiting for the request to finish before making the keep/drop decision.
Log Exclusion Rules (Fast, High Reward): Configure your observability agent (e.g., Fluent Bit, Vector, Datadog Agent) to silently drop useless log patterns before they leave the host.
- Tradeoff: If an engineer needs those dropped logs for local debugging, they will have to SSH into the box or temporarily disable the exclusion rule.
Tiered Storage Routing (Medium Effort, High Value): Route compliance data (like database audit logs) directly to an S3 bucket (Cold Storage) where it costs pennies, and only route actionable operational logs to your expensive SaaS indexing platform (Hot Storage).
- Tradeoff: Searching cold storage requires rehydration or using tools like Amazon Athena, which is slower than querying a hot Elasticsearch cluster.

Rollback Plan

If you implement aggressive log filtering and an engineer cannot debug a critical issue because the necessary logs were dropped, the rollback plan is to immediately disable the agent-level exclusion rule via configuration management (Terraform/Ansible) and restart the telemetry agents. Do not permanently delete the logs; temporarily route the full firehose to S3 so they can be queried asynchronously if needed.

Automation Opportunity

Deploy an OpenTelemetry Collector pipeline that acts as a central data governor. Automate the configuration so that anytime the system detects an anomalous spike in total log volume (e.g., a developer accidentally left TRACE logging on), the Collector automatically dynamically throttles the ingestion from that specific service, protecting the overall observability budget.

Leadership Summary

Not All Data is Useful: The value of observational data decays exponentially. A log message from 5 minutes ago is critical for triage; a log message from 5 months ago is useless noise unless mandated by compliance.
Move Intelligence to the Edge: Do not send all raw data to the cloud and filter it there (you still pay for ingestion). Use intelligent agents to drop noise and aggregate metrics at the host level.
Cost Allocation Forces Good Behavior: The fastest way to reduce an inflated observability bill is to show the bill directly to the engineering team generating the logs.

What to Do Next

Problem: “Log everything” becomes financially untenable at scale — a database processing 10,000 TPS generating a 2KB log per transaction produces 1.7 TB of log data per day, making the observability bill a larger line item than the database infrastructure it monitors.
Solution: Insert an OpenTelemetry Collector or Fluent Bit pipeline between your databases and your SaaS vendor to own the filtering rules: drop INFO/DEBUG logs at the agent, apply tail-based trace sampling, and route compliance data to S3 cold storage instead of hot indexes.
Proof: Query your metric platform’s internal cardinality report — any single metric family consuming more than 10% of total custom metric series is a cardinality explosion in progress and the fastest path to an unexpected billing overage.
Action: Identify your most voluminous, useless log pattern using your aggregator’s pattern-finder, write an agent-level exclusion rule to drop it before it crosses the network, and calculate the projected monthly savings — this is the fastest ROI of any observability optimization.

Situation

Symptoms

First Five Checks

Decision Tree

Remediation Options

Rollback Plan

Automation Opportunity

Leadership Summary

What to Do Next

Rajiv

Related Posts

Agentic SRE Architecture: Skills, Agents, MCP Servers, and Human Approval Loops

FinOps Observability: Tie Cloud Cost to Workload, Team, Product, and Customer

Datadog Bits AI SRE: What an AI On-Call Teammate Changes for DBAs