PostgreSQL Connection Storm Runbook
“Sorry, too many clients already” means PostgreSQL has rejected a connection before your application could run a single query. Every connection to PostgreSQL is a forked OS process consuming memory — typically 5–10 MB of RAM per connection — so max_connections is a hard ceiling that cannot be stretched without consequences. Once you hit it, the failure mode is not graceful degradation; it is hard rejection of new connections until existing ones close.
Situation
PostgreSQL’s process-per-connection architecture dates to a period when connection counts were measured in dozens, not thousands. Each connection forks a backend process, inherits a memory allocation, and holds that allocation for the duration of the connection regardless of whether a query is running. At 200 connections, this overhead is manageable. At 1,000 connections, PostgreSQL is spending more memory serving idle backends than it is serving active queries.
The default max_connections = 100 reflects this constraint — it is not a conservative setting that exists to be raised. The PostgreSQL documentation explicitly notes that increasing max_connections requires increasing shared_buffers proportionally, and that the memory overhead of idle connections is real and measurable.
The Problem
Connection storms occur in three patterns: application connection leaks (connections opened and never closed), pool exhaustion from too many services competing for the same pool, and deployments that spin up new application instances without shutting down old ones cleanly. The idle in transaction state is particularly damaging because those connections are holding transactions open, which blocks vacuum and prevents transaction ID advancement.
Without a centralized connection multiplexer, every new microservice or horizontal pod autoscaling event directly multiplies the active TCP connections to the database host. Eventually, the database runs out of available connection slots or OS memory, triggering catastrophic connection rejection. How do you scale application instances without proportionally scaling database connection overhead?
Core Concept
The structural solution is to decouple application connection counts from PostgreSQL process counts using connection pooling, specifically PgBouncer in transaction mode, while implementing aggressive server-side transaction timeouts to prevent zombie state accumulation.
Symptoms
| Signal | Where to see it | What it means |
|---|---|---|
| Application errors: “sorry, too many clients already” | Application logs | max_connections ceiling hit — no new connections possible |
count(*) near max_connections value | pg_stat_activity | Connection headroom nearly exhausted |
High count of idle in transaction state | pg_stat_activity | Connections holding open transactions, blocking vacuum |
| One client IP with > 50 connections | pg_stat_activity grouped by client_addr | Connection leak on a specific application server |
| No PgBouncer or pgpool in the stack | Infrastructure review | Direct connection architecture that cannot scale safely |
| Memory pressure on the PostgreSQL host | OS metrics | Each idle connection consuming 5–10 MB RAM |
First Five Checks
- Count connections by state — get the distribution of active, idle, and idle-in-transaction connections:
SELECT
state,
count(*) AS connection_count,
max(now() - state_change) AS oldest_in_state
FROM pg_stat_activity
WHERE pid != pg_backend_pid()
GROUP BY state
ORDER BY connection_count DESC;
High idle counts mean connections are staying open without doing work — a pooling problem. High idle in transaction counts mean applications are opening transactions and not committing or rolling back — a connection leak or long-running operation pattern.
- Check the connection ceiling — confirm max_connections and how close you are:
SHOW max_connections;
SELECT count(*) AS total_connections,
(SELECT setting::int FROM pg_settings WHERE name = 'max_connections') AS max_connections,
count(*) * 100 / (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') AS pct_used
FROM pg_stat_activity;
Anything above 80% of max_connections is operational risk. At 90%, connection failures are likely during traffic spikes. PostgreSQL reserves a small number of connections for superusers via superuser_reserved_connections (default 3), so regular users lose access before the absolute ceiling.
- Count idle-in-transaction connections — these are the most damaging:
SELECT
count(*) AS idle_in_txn_count,
max(now() - xact_start) AS oldest_open_txn
FROM pg_stat_activity
WHERE state = 'idle in transaction';
Any oldest_open_txn value above 5 minutes should be treated as an incident. These connections are holding their transaction’s snapshot, preventing vacuum from advancing the horizon, and consuming a process slot doing nothing.
- Connection distribution by client address — identify connection hogs:
SELECT
client_addr,
usename,
count(*) AS connections,
sum(CASE WHEN state = 'idle in transaction' THEN 1 ELSE 0 END) AS idle_in_txn
FROM pg_stat_activity
WHERE pid != pg_backend_pid()
GROUP BY client_addr, usename
ORDER BY connections DESC
LIMIT 10;
A single application server holding 80 connections to PostgreSQL while a second server holds 2 is a strong signal of either a connection leak or misconfigured pool sizing on the first server.
- Check for a connection pooler — if there is no PgBouncer or pgpool in front of PostgreSQL, that is the fix:
# Check whether PgBouncer is running on the standard port
nc -z localhost 6432 && echo "PgBouncer present" || echo "No pooler on 6432"
# Or check from the PostgreSQL side — poolers identify themselves
SELECT client_addr, application_name, count(*)
FROM pg_stat_activity
WHERE application_name ILIKE '%pgbouncer%'
OR application_name ILIKE '%pgpool%'
GROUP BY client_addr, application_name;
If no pooler is present and connection counts are near the ceiling, adding PgBouncer in transaction mode is the fastest structural fix available. Nothing else will prevent recurrence under load.
Decision Tree
flowchart TD
A[Connections near max_connections] --> B{idle in transaction count high?}
B -->|yes| C[Set idle_in_transaction_session_timeout]
B -->|no| D{idle connection count high?}
D -->|yes| E{Pooler in front of Postgres?}
E -->|no| F[Add PgBouncer in transaction mode]
E -->|yes| G{Pool sized correctly?}
G -->|no| H[Reduce pool_size per service]
G -->|yes| I{One client addr dominant?}
I -->|yes| J[Investigate connection leak on that host]
I -->|no| K[Too many services — reduce direct connections]
D -->|no| L{Connection rate spiking?}
L -->|yes| M[Check deploy — new instances not closing old]
L -->|no| N[Increase max_connections as last resort]
Remediation Options
Option 1 — Add PgBouncer in transaction mode (fastest structural fix)
PgBouncer in transaction mode multiplexes many application connections onto a small number of PostgreSQL backend processes. A typical configuration allows 1,000 application connections to share 20 PostgreSQL connections if the average transaction is short.
Install and configure PgBouncer with a minimal pgbouncer.ini:
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 20
server_idle_timeout = 600
log_connections = 0
log_disconnections = 0
Application changes: point connection strings to PgBouncer’s port (6432) instead of PostgreSQL’s port (5432). This is the only change required at the application layer.
Transaction mode has one constraint documented in the PgBouncer documentation: prepared statements tied to a specific backend do not survive across transactions in transaction mode. Applications using PREPARE statements must either use the statement cache inside PgBouncer or be moved to session mode.
Option 2 — Set idle_in_transaction_session_timeout
For immediate relief from accumulated idle in transaction connections, set a server-side timeout:
-- Immediate change, no restart required
ALTER SYSTEM SET idle_in_transaction_session_timeout = '5min';
SELECT pg_reload_conf();
-- Verify it took effect
SHOW idle_in_transaction_session_timeout;
After reload, any session that stays in idle in transaction state for more than 5 minutes will be automatically terminated by PostgreSQL. The application will see a connection error and must handle reconnection.
This parameter was added in PostgreSQL 9.6. It does not affect sessions with actively running queries — only sessions that have an open transaction but are not executing SQL.
Option 3 — Increase max_connections (last resort)
Increasing max_connections requires a PostgreSQL restart and must be paired with a proportional increase in memory:
# Edit postgresql.conf
max_connections = 200
# shared_buffers should be at least 128MB per 100 connections as a starting point
shared_buffers = 2GB
# Restart required
pg_ctl restart -D /var/lib/postgresql/data
This is the last resort because it treats the symptom — not enough connection slots — without addressing the underlying cause, which is direct connections rather than pooled connections. Each additional connection slot adds OS process overhead. The PostgreSQL wiki notes that raising max_connections above 200 without a pooler in front rarely solves connection exhaustion; it only defers it.
Rollback Plan
idle_in_transaction_session_timeout: Revert immediately withALTER SYSTEM SET idle_in_transaction_session_timeout = 0; SELECT pg_reload_conf();— zero disables the timeout. No restart required.- PgBouncer addition: PgBouncer is a proxy; removing it means pointing application connection strings back to the direct PostgreSQL port. No PostgreSQL changes are needed. PgBouncer itself can be stopped or removed at any time.
max_connectionsincrease: Decreasingmax_connectionsrequires a restart. Before decreasing, verify that active connections at the new lower limit will not be rejected. QuerySELECT count(*) FROM pg_stat_activityfirst to confirm actual utilization.
Automation Opportunity
A Prometheus alert on pg_stat_activity_count by state is the standard monitoring approach. If you do not have Prometheus, this pg_cron query captures connection utilization hourly for capacity planning:
SELECT cron.schedule('connection-capacity-log', '0 * * * *', $$
INSERT INTO ops.connection_log (ts, total, idle, idle_in_txn, active, max_conn)
SELECT
now(),
count(*),
count(*) FILTER (WHERE state = 'idle'),
count(*) FILTER (WHERE state = 'idle in transaction'),
count(*) FILTER (WHERE state = 'active'),
(SELECT setting::int FROM pg_settings WHERE name = 'max_connections')
FROM pg_stat_activity
WHERE pid != pg_backend_pid();
$$);
Alert thresholds worth setting: total > 0.8 * max_connections for capacity warning, idle_in_txn > 10 for transaction hygiene alert, idle_in_txn with age > 5 minutes for immediate escalation.
In Practice
The PgBouncer documentation describes transaction mode as suitable for any application that does not use session-level PostgreSQL features across transactions: advisory locks, SET LOCAL, LISTEN/NOTIFY, prepared statements in session scope, and temporary tables. For applications that do use these features, session mode provides pooling with fewer constraints but with lower connection multiplexing ratios.
The documented pattern from the PostgreSQL documentation on max_connections is that each additional connection adds approximately 400 bytes of shared memory overhead, plus the per-process allocation (typically 5–10 MB). The PostgreSQL wiki explicitly recommends that databases serving more than a few hundred concurrent application connections place a pooler in front rather than raising max_connections beyond 200.
Where It Breaks
| Failure mode | Trigger | Fix |
|---|---|---|
| PgBouncer transaction mode breaks application | Application uses prepared statements or SET LOCAL across transactions | Switch specific pools to session mode; or migrate to pg_prepared_statements cache |
idle_in_transaction_session_timeout causes unexpected rollbacks | Application holds open transactions intentionally for long operations | Increase the timeout for those connections, or refactor to commit-per-batch |
Increasing max_connections causes OOM | New connection ceiling consumes available RAM | Reduce max_connections and add PgBouncer instead |
| PgBouncer pool exhausted under burst load | default_pool_size too small for concurrent query volume | Increase default_pool_size; add read replicas for read traffic |
| Application does not retry on connection error | idle_in_transaction_session_timeout terminates and app crashes | Add connection retry logic with exponential backoff |
What to Do Next
- Problem: PostgreSQL rejects connections hard when
max_connectionsis exhausted — no graceful degradation, just immediate errors for every new connection attempt. - Solution: Add PgBouncer in transaction mode between applications and PostgreSQL to multiplex application connections onto a small pool of PostgreSQL backends, and set
idle_in_transaction_session_timeout = '5min'to prevent zombie transactions from consuming connection slots. - Proof: After adding PgBouncer,
SELECT count(*) FROM pg_stat_activityon the PostgreSQL side should show a small stable number (equal todefault_pool_size) regardless of how many application-side connections exist. - Action: Run the connection-by-state query from Check 1 against your production database today. If
idle in transactioncount exceeds 5, setidle_in_transaction_session_timeoutimmediately — it requires only a config reload, not a restart.
Checklist
- Query
pg_stat_activitygrouped by state to see total, idle, idle-in-transaction, and active counts - Compare total connections to
max_connections— flag if > 80% used - Check
idle in transactioncount and age of oldest open transaction - Group connections by
client_addrto identify any single-host leak - Confirm whether PgBouncer or pgpool is present and accepting connections
- If no pooler: install PgBouncer in transaction mode before the next traffic event
- Set
idle_in_transaction_session_timeout = '5min'and reload config - Verify
pool_modein PgBouncer config istransactionfor OLTP workloads - Confirm application handles connection errors with retry logic
- Review
max_connectionssetting — resist raising it without adding a pooler - Add a monitoring alert at 80% of
max_connectionsutilization - Log connection counts hourly to build a capacity baseline for the next 30 days