2026-04-22 · 9 min read · Field notes

Why most `pg_dump` cron jobs fail when you actually need them

When I audit backup setups, the pattern is remarkably consistent: a cron line that has been sitting in /etc/cron.d/ for three years, tar.gz files piling up on a storage box nobody mounted recently, nobody has run a restore since the original engineer left. Then one morning a primary dies and the archive opens into something, but not the database the application expects.

Here are the five failure modes I keep finding, ranked by how badly they hurt on restore day.

1. The exit code lies, because cron does not care

The classic:

0 2 * * * postgres pg_dump mydb | gzip > /backups/mydb-$(date +\%F).sql.gz

What cron reports: success. What actually happened when the disk filled up at 02:41: gzip exited non-zero, pg_dump's stdout was truncated, the resulting .sql.gz is 90% of a database. Because of the pipe, the exit code of the cron line is gzip's, not pg_dump's. And cron mails on non-zero exit — which never happens, because set -o pipefail is not a thing in raw cron syntax.

A common pattern: a multi-gigabyte dump stops mid-COPY on a big table, the resulting file is 80–90% of the real size, and nobody notices for months — because nothing errors, the file just weighs less than it should.

The fix is boring: wrap in a shell script, set set -euo pipefail, fail loudly. Or skip the pipe entirely with pg_dump --compress=6 -Fc -f /backups/mydb.dump, which at least produces a single file whose exit code is the dump's exit code.

2. `--format=plain`, because that's what the Stack Overflow answer from 2014 said

Plain-SQL dumps look friendly. You can open them in a text editor. You can grep. They're also:

Single-threaded on restore — your 8-core restore host does 1/8th the work.
Unable to restore a subset of objects without hand editing.
Impossible to run the usual integrity checks against (pg_restore -l, the TOC).
Typically 3–4× larger than custom format with the same --compress=9.

Custom format (-Fc) is the default you want: parallel-restorable, has a table of contents, pg_restore -l tells you if the archive is even readable before you try to restore it. For multi-TB databases, directory format (-Fd) with --jobs=N is worth the learning cost.

3. `statement_timeout` bites on the server, not the client

Production databases routinely have statement_timeout = '5min' set as a sane default to kill runaway queries. pg_dump runs long queries. It runs COPY (SELECT * FROM huge_table) TO STDOUT which can take an hour on a 500 GB table. If the server timeout fires, the connection is gone, pg_dump errors — but only sometimes produces a visible error, depending on version.

Common fix in the audit: set statement_timeout = 0 in the backup role's session config, not globally. ALTER ROLE backup_user SET statement_timeout = 0. Then prove it:

psql -U backup_user -c "SHOW statement_timeout"
# 0

And also idle_in_transaction_session_timeout, which bites on slow parallel workers.

4. Retention by `find -mtime +30 -delete`

The cleanup line is almost always wrong in the same way:

find /backups -name "mydb-*.sql.gz" -mtime +30 -delete

What happens when disk fills and last night's dump didn't write? The newest file in /backups is now 31 days old. -mtime +30 deletes it too. Two nights later you have zero backups and a fragile primary.

Retention should be keep N most recent, not delete older than X. And it should refuse to run at all if today's dump doesn't exist. Every backup tool I respect follows this rule.

5. Nobody has ever restored

This is the one that cannot be fixed by better flags. Schemas change. Extensions get added, get dropped. Roles get renamed. A dump that restored cleanly 18 months ago may need three hand-patches today — and nobody knows until they try.

The only reliable fix is to restore. Not simulate. Not pg_restore -l. Actually restore, into a throwaway target, every so often, automatically. If you can't stand up a throwaway target, that by itself is the finding of the audit.

Cheapest version: weekly cron, docker run --rm postgres:16, pg_restore, SELECT count(*) FROM users, log the number. If the script errors or the count is zero, page somebody. 30 lines of bash.

Why resistro exists

The above list is roughly the design spec of resistro: one binary, exit code you can trust, custom-format by default, retention by count with a today's-dump-exists guard, Prometheus-scraped metrics so your monitoring catches missed runs before restore day does.

None of it is magic. It's the pattern five previous engagements told me had to exist somewhere. It didn't, so I wrote it down in Go.

Want a second pair of eyes on your backups?

A 1-day backup audit walks every pipeline you have today, rates each, and leaves you with a written report. Fixed price, no retainer. If everything's already sound, I'll tell you and we're done in half a day.

Book an audit

Why most pg_dump cron jobs fail when you actually need them