PostgreSQL Connection Storm Runbook

“Sorry, too many clients already” means PostgreSQL has rejected a connection before your application could run a single query. Every connection to PostgreSQL is a forked OS process consuming memory — typically 5–10 MB of RAM per connection — so max_connections is a hard ceiling that cannot be stretched without consequences. Once you hit it, the failure mode is not graceful degradation; it is hard rejection of new connections until existing ones close.

Situation

PostgreSQL’s process-per-connection architecture dates to a period when connection counts were measured in dozens, not thousands. Each connection forks a backend process, inherits a memory allocation, and holds that allocation for the duration of the connection regardless of whether a query is running. At 200 connections, this overhead is manageable. At 1,000 connections, PostgreSQL is spending more memory serving idle backends than it is serving active queries.

The default max_connections = 100 reflects this constraint — it is not a conservative setting that exists to be raised. The PostgreSQL documentation explicitly notes that increasing max_connections requires increasing shared_buffers proportionally, and that the memory overhead of idle connections is real and measurable.

The Problem

Connection storms occur in three patterns: application connection leaks (connections opened and never closed), pool exhaustion from too many services competing for the same pool, and deployments that spin up new application instances without shutting down old ones cleanly. The idle in transaction state is particularly damaging because those connections are holding transactions open, which blocks vacuum and prevents transaction ID advancement.

Without a centralized connection multiplexer, every new microservice or horizontal pod autoscaling event directly multiplies the active TCP connections to the database host. Eventually, the database runs out of available connection slots or OS memory, triggering catastrophic connection rejection. How do you scale application instances without proportionally scaling database connection overhead?

Core Concept

The structural solution is to decouple application connection counts from PostgreSQL process counts using connection pooling, specifically PgBouncer in transaction mode, while implementing aggressive server-side transaction timeouts to prevent zombie state accumulation.

Symptoms

Signal	Where to see it	What it means
Application errors: “sorry, too many clients already”	Application logs	`max_connections` ceiling hit — no new connections possible
`count(*)` near `max_connections` value	`pg_stat_activity`	Connection headroom nearly exhausted
High count of `idle in transaction` state	`pg_stat_activity`	Connections holding open transactions, blocking vacuum
One client IP with > 50 connections	`pg_stat_activity` grouped by `client_addr`	Connection leak on a specific application server
No PgBouncer or pgpool in the stack	Infrastructure review	Direct connection architecture that cannot scale safely
Memory pressure on the PostgreSQL host	OS metrics	Each idle connection consuming 5–10 MB RAM

First Five Checks

Count connections by state — get the distribution of active, idle, and idle-in-transaction connections:

SELECT
  state,
  count(*) AS connection_count,
  max(now() - state_change) AS oldest_in_state
FROM pg_stat_activity
WHERE pid != pg_backend_pid()
GROUP BY state
ORDER BY connection_count DESC;

High idle counts mean connections are staying open without doing work — a pooling problem. High idle in transaction counts mean applications are opening transactions and not committing or rolling back — a connection leak or long-running operation pattern.

Check the connection ceiling — confirm max_connections and how close you are:

SHOW max_connections;

SELECT count(*) AS total_connections,
       (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') AS max_connections,
       count(*) * 100 / (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') AS pct_used
FROM pg_stat_activity;

Anything above 80% of max_connections is operational risk. At 90%, connection failures are likely during traffic spikes. PostgreSQL reserves a small number of connections for superusers via superuser_reserved_connections (default 3), so regular users lose access before the absolute ceiling.

Count idle-in-transaction connections — these are the most damaging:

SELECT
  count(*) AS idle_in_txn_count,
  max(now() - xact_start) AS oldest_open_txn
FROM pg_stat_activity
WHERE state = 'idle in transaction';

Any oldest_open_txn value above 5 minutes should be treated as an incident. These connections are holding their transaction’s snapshot, preventing vacuum from advancing the horizon, and consuming a process slot doing nothing.

Connection distribution by client address — identify connection hogs:

SELECT
  client_addr,
  usename,
  count(*) AS connections,
  sum(CASE WHEN state = 'idle in transaction' THEN 1 ELSE 0 END) AS idle_in_txn
FROM pg_stat_activity
WHERE pid != pg_backend_pid()
GROUP BY client_addr, usename
ORDER BY connections DESC
LIMIT 10;

A single application server holding 80 connections to PostgreSQL while a second server holds 2 is a strong signal of either a connection leak or misconfigured pool sizing on the first server.

Check for a connection pooler — if there is no PgBouncer or pgpool in front of PostgreSQL, that is the fix:

# Check whether PgBouncer is running on the standard port
nc -z localhost 6432 && echo "PgBouncer present" || echo "No pooler on 6432"

# Or check from the PostgreSQL side — poolers identify themselves
SELECT client_addr, application_name, count(*)
FROM pg_stat_activity
WHERE application_name ILIKE '%pgbouncer%'
   OR application_name ILIKE '%pgpool%'
GROUP BY client_addr, application_name;

If no pooler is present and connection counts are near the ceiling, adding PgBouncer in transaction mode is the fastest structural fix available. Nothing else will prevent recurrence under load.

Decision Tree

flowchart TD
    A[Connections near max_connections] --> B{idle in transaction count high?}
    B -->|yes| C[Set idle_in_transaction_session_timeout]
    B -->|no| D{idle connection count high?}
    D -->|yes| E{Pooler in front of Postgres?}
    E -->|no| F[Add PgBouncer in transaction mode]
    E -->|yes| G{Pool sized correctly?}
    G -->|no| H[Reduce pool_size per service]
    G -->|yes| I{One client addr dominant?}
    I -->|yes| J[Investigate connection leak on that host]
    I -->|no| K[Too many services — reduce direct connections]
    D -->|no| L{Connection rate spiking?}
    L -->|yes| M[Check deploy — new instances not closing old]
    L -->|no| N[Increase max_connections as last resort]

Remediation Options

Option 1 — Add PgBouncer in transaction mode (fastest structural fix)

PgBouncer in transaction mode multiplexes many application connections onto a small number of PostgreSQL backend processes. A typical configuration allows 1,000 application connections to share 20 PostgreSQL connections if the average transaction is short.

Install and configure PgBouncer with a minimal pgbouncer.ini:

[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 20
server_idle_timeout = 600
log_connections = 0
log_disconnections = 0

Application changes: point connection strings to PgBouncer’s port (6432) instead of PostgreSQL’s port (5432). This is the only change required at the application layer.

Transaction mode has one constraint documented in the PgBouncer documentation: prepared statements tied to a specific backend do not survive across transactions in transaction mode. Applications using PREPARE statements must either use the statement cache inside PgBouncer or be moved to session mode.

Option 2 — Set idle_in_transaction_session_timeout

For immediate relief from accumulated idle in transaction connections, set a server-side timeout:

-- Immediate change, no restart required
ALTER SYSTEM SET idle_in_transaction_session_timeout = '5min';
SELECT pg_reload_conf();

-- Verify it took effect
SHOW idle_in_transaction_session_timeout;

After reload, any session that stays in idle in transaction state for more than 5 minutes will be automatically terminated by PostgreSQL. The application will see a connection error and must handle reconnection.

This parameter was added in PostgreSQL 9.6. It does not affect sessions with actively running queries — only sessions that have an open transaction but are not executing SQL.

Option 3 — Increase max_connections (last resort)

Increasing max_connections requires a PostgreSQL restart and must be paired with a proportional increase in memory:

# Edit postgresql.conf
max_connections = 200
# shared_buffers should be at least 128MB per 100 connections as a starting point
shared_buffers = 2GB

# Restart required
pg_ctl restart -D /var/lib/postgresql/data

This is the last resort because it treats the symptom — not enough connection slots — without addressing the underlying cause, which is direct connections rather than pooled connections. Each additional connection slot adds OS process overhead. The PostgreSQL wiki notes that raising max_connections above 200 without a pooler in front rarely solves connection exhaustion; it only defers it.

Rollback Plan

idle_in_transaction_session_timeout: Revert immediately with ALTER SYSTEM SET idle_in_transaction_session_timeout = 0; SELECT pg_reload_conf(); — zero disables the timeout. No restart required.
PgBouncer addition: PgBouncer is a proxy; removing it means pointing application connection strings back to the direct PostgreSQL port. No PostgreSQL changes are needed. PgBouncer itself can be stopped or removed at any time.
max_connections increase: Decreasing max_connections requires a restart. Before decreasing, verify that active connections at the new lower limit will not be rejected. Query SELECT count(*) FROM pg_stat_activity first to confirm actual utilization.

Automation Opportunity

A Prometheus alert on pg_stat_activity_count by state is the standard monitoring approach. If you do not have Prometheus, this pg_cron query captures connection utilization hourly for capacity planning:

SELECT cron.schedule('connection-capacity-log', '0 * * * *', $$
  INSERT INTO ops.connection_log (ts, total, idle, idle_in_txn, active, max_conn)
  SELECT
    now(),
    count(*),
    count(*) FILTER (WHERE state = 'idle'),
    count(*) FILTER (WHERE state = 'idle in transaction'),
    count(*) FILTER (WHERE state = 'active'),
    (SELECT setting::int FROM pg_settings WHERE name = 'max_connections')
  FROM pg_stat_activity
  WHERE pid != pg_backend_pid();
$$);

Alert thresholds worth setting: total > 0.8 * max_connections for capacity warning, idle_in_txn > 10 for transaction hygiene alert, idle_in_txn with age > 5 minutes for immediate escalation.

In Practice

The PgBouncer documentation describes transaction mode as suitable for any application that does not use session-level PostgreSQL features across transactions: advisory locks, SET LOCAL, LISTEN/NOTIFY, prepared statements in session scope, and temporary tables. For applications that do use these features, session mode provides pooling with fewer constraints but with lower connection multiplexing ratios.

The documented pattern from the PostgreSQL documentation on max_connections is that each additional connection adds approximately 400 bytes of shared memory overhead, plus the per-process allocation (typically 5–10 MB). The PostgreSQL wiki explicitly recommends that databases serving more than a few hundred concurrent application connections place a pooler in front rather than raising max_connections beyond 200.

Where It Breaks

Failure mode	Trigger	Fix
PgBouncer transaction mode breaks application	Application uses prepared statements or `SET LOCAL` across transactions	Switch specific pools to session mode; or migrate to `pg_prepared_statements` cache
`idle_in_transaction_session_timeout` causes unexpected rollbacks	Application holds open transactions intentionally for long operations	Increase the timeout for those connections, or refactor to commit-per-batch
Increasing `max_connections` causes OOM	New connection ceiling consumes available RAM	Reduce `max_connections` and add PgBouncer instead
PgBouncer pool exhausted under burst load	`default_pool_size` too small for concurrent query volume	Increase `default_pool_size`; add read replicas for read traffic
Application does not retry on connection error	`idle_in_transaction_session_timeout` terminates and app crashes	Add connection retry logic with exponential backoff

What to Do Next

Problem: PostgreSQL rejects connections hard when max_connections is exhausted — no graceful degradation, just immediate errors for every new connection attempt.
Solution: Add PgBouncer in transaction mode between applications and PostgreSQL to multiplex application connections onto a small pool of PostgreSQL backends, and set idle_in_transaction_session_timeout = '5min' to prevent zombie transactions from consuming connection slots.
Proof: After adding PgBouncer, SELECT count(*) FROM pg_stat_activity on the PostgreSQL side should show a small stable number (equal to default_pool_size) regardless of how many application-side connections exist.
Action: Run the connection-by-state query from Check 1 against your production database today. If idle in transaction count exceeds 5, set idle_in_transaction_session_timeout immediately — it requires only a config reload, not a restart.

Checklist

Query pg_stat_activity grouped by state to see total, idle, idle-in-transaction, and active counts
Compare total connections to max_connections — flag if > 80% used
Check idle in transaction count and age of oldest open transaction
Group connections by client_addr to identify any single-host leak
Confirm whether PgBouncer or pgpool is present and accepting connections
If no pooler: install PgBouncer in transaction mode before the next traffic event
Set idle_in_transaction_session_timeout = '5min' and reload config
Verify pool_mode in PgBouncer config is transaction for OLTP workloads
Confirm application handles connection errors with retry logic
Review max_connections setting — resist raising it without adding a pooler
Add a monitoring alert at 80% of max_connections utilization
Log connection counts hourly to build a capacity baseline for the next 30 days