PostgreSQL WAL monitoring

At 2am, I got an alert: database disk usage at 85%. Logged in, checked the usual suspects (tables, indexes, temp files), and everything looked normal. Then I checked /var/lib/postgresql/data/pg_wal/ and found 40GB of Write-Ahead Logs. That’s when I learned to monitor WAL properly.

What triggered WAL monitoring? Link to heading

The incident that made me care about WAL was a replica that fell behind during a large data import. PostgreSQL kept accumulating WAL files because it couldn’t confirm the replica had received them. Disk filled up, writes stopped, production went down. Good times.

Since then, I monitor WAL growth proactively. An unexpected increase usually means something interesting is happening - often something bad.

Checking WAL size Link to heading

Check current WAL size:

SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0'));

This shows how far along the WAL sequence you are. Not directly useful on its own, but good for comparing before/after a large operation.

Check WAL directory size (from shell):

du -sh /var/lib/postgresql/data/pg_wal/

This is the number I actually care about - raw disk usage. On a healthy system, this should be relatively stable (a few GB at most).

List WAL files:

ls -la /var/lib/postgresql/data/pg_wal/

Counting files can be revealing. If you see hundreds of 16MB files piling up, you have a problem.

When should you be worried? Link to heading

Normal: WAL directory between 1-5GB, with files being recycled regularly. You’ll see files being created and deleted as checkpoints happen.

Concerning: WAL directory growing over 10GB, especially if it’s growing steadily. This means PostgreSQL can’t recycle old WAL files for some reason.

Critical: WAL directory over 50GB or using more than 10% of your disk. You’re at risk of running out of disk space, which will cause database writes to fail.

I set up alerts at 10GB and 25GB. The first gives me time to investigate, the second means I need to fix it now.

Diagnosing WAL growth Link to heading

Check replication lag (if using replicas):

SELECT
    client_addr,
    state,
    pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) as send_lag,
    pg_wal_lsn_diff(sent_lsn, flush_lsn) as flush_lag
FROM pg_stat_replication;

If send_lag or flush_lag are in the megabytes or gigabytes, your replica is behind and WAL files are accumulating. Common causes:

Network issues between primary and replica
Replica under heavy read load and can’t keep up
Replica disk I/O bottleneck

Check for long-running transactions:

SELECT
    pid,
    now() - xact_start AS duration,
    state,
    query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;

Long-running transactions prevent WAL cleanup because PostgreSQL needs to keep the data for potential rollback. I once found a transaction that had been running for 6 hours because someone left a psql session open with an uncommitted transaction.

Check WAL settings:

SHOW wal_keep_size;
SHOW max_wal_size;
SHOW min_wal_size;

wal_keep_size: Minimum WAL to keep for replicas (default 0, meaning use max_wal_size)
max_wal_size: Maximum size before forcing a checkpoint (default 1GB)
min_wal_size: Minimum WAL to keep around for reuse (default 80MB)

If wal_keep_size is set to something large (like 50GB), that’s your culprit. Someone probably set it high to prevent replica lag but didn’t consider disk space.

Cloud SQL specifics Link to heading

For Cloud SQL, checking WAL settings is different:

gcloud sql instances describe INSTANCE --format="yaml(settings.backupConfiguration)"

Cloud SQL manages WAL retention for you based on backup settings. If you have point-in-time recovery enabled with a long retention period, you’ll have more WAL files kept around.

You can check replica lag in Cloud Console or via:

gcloud sql operations list --instance=INSTANCE --filter="operationType=BACKUP"

Opinion: Cloud SQL’s automatic WAL management is mostly good, but the lack of visibility into why WAL is growing can be frustrating. You can’t just SSH in and check /var/lib/postgresql/data/pg_wal/, which makes debugging harder.

My monitoring setup Link to heading

I have Prometheus scraping these metrics every minute:

-- WAL generation rate (bytes per second)
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0');

-- Replication lag (for each replica)
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)
FROM pg_stat_replication;

-- Oldest transaction age
SELECT max(now() - xact_start) FROM pg_stat_activity WHERE state != 'idle';

Alerting thresholds:

WAL generation > 100MB/s for more than 5 minutes (something big is happening)
Replication lag > 1GB (replica is falling behind)
Transaction age > 1 hour (someone left something open)