Why most pg_dump cron jobs fail when you actually need them
Five failure modes of pg_dump cron jobs, ranked by how badly they bite when you need to restore.
In small self-managed PostgreSQL setups, the same backup mistakes show up surprisingly often: a cron line sitting in /etc/cron.d/ for three years, tar.gz files piling up on a storage box nobody mounted recently, no restore attempted since the original engineer left. Then one morning a primary dies and the archive opens into something, but not the database the application expects.
These are five failure modes that show up repeatedly in restore reviews, ordered by likely impact on restore day.
1. The exit code lies, because cron does not care
The classic:
0 2 * * * postgres pg_dump mydb | gzip > /backups/mydb-$(date +\%F).sql.gz
What cron reports: success. What actually happened when the disk filled up at 02:41: gzip exited non-zero, pg_dump’s stdout was truncated, the resulting .sql.gz is 90% of a database. Because of the pipe, the exit code of the cron line is gzip’s, not pg_dump’s. And cron only mails on non-zero exit — which never happens, because set -o pipefail is not available in raw cron syntax.
A common result: a multi-gigabyte dump stops mid-
COPYon a big table, the file weighs 80–90% of what it should, and nobody notices for months — because nothing errors, the file just looks smaller than expected.
The fix is boring: wrap in a shell script, set set -euo pipefail, fail loudly. Or skip the pipe entirely with pg_dump --compress=6 -Fc -f /backups/mydb.dump, which produces a single file whose exit code is the dump’s own exit code.
2. --format=plain, because that’s what the Stack Overflow answer from 2014 said
Plain-SQL dumps look friendly. You can open them in a text editor, you can grep. They’re also:
- Single-threaded on restore — your 8-core restore host does 1/8th the work.
- Unable to restore a subset of objects without hand editing.
- Impossible to run the usual integrity checks against (
pg_restore -l, the TOC). - Typically 3–4× larger than custom format at the same
--compress=9.
Custom format (-Fc) is what you want: parallel-restorable, has a table of contents, and pg_restore -l tells you whether the archive is even readable before you attempt a restore. For multi-TB databases, directory format (-Fd) with --jobs=N is worth the learning cost.
3. statement_timeout bites on the server, not the client
Production databases routinely have statement_timeout = '5min' set as a sane default to kill runaway queries. pg_dump runs long queries — specifically COPY (SELECT * FROM huge_table) TO STDOUT, which can take an hour on a 500 GB table. If the server timeout fires, the connection drops. pg_dump errors, but depending on version may not produce a visible error message.
Fix: set statement_timeout = 0 in the backup role’s session config, not globally.
ALTER ROLE backup_user SET statement_timeout = 0;
Then verify it actually took:
psql -U backup_user -c "SHOW statement_timeout"
# 0
Also check idle_in_transaction_session_timeout, which bites on slow parallel workers.
4. Retention by find -mtime +30 -delete
The cleanup line is almost always wrong in the same way:
find /backups -name "mydb-*.sql.gz" -mtime +30 -delete
What happens when the disk fills and last night’s dump didn’t write? The newest file in /backups is now 31 days old. -mtime +30 deletes it too. Two nights later you have zero backups and a fragile primary.
Retention should be keep N most recent, not delete older than X. And it should refuse to run at all if today’s dump doesn’t exist. Every backup tool worth using follows this rule.
5. Nobody has ever restored
This is the one that can’t be fixed by better flags. Schemas change. Extensions get added and dropped. Roles get renamed. A dump that restored cleanly 18 months ago may need three hand-patches today — and nobody knows until they try.
The only reliable fix is to restore. Not simulate. Not pg_restore -l. Actually restore, into a throwaway target, regularly, automatically. If you can’t stand up a throwaway target, that by itself is the finding of the audit.
Cheapest version: weekly cron,
docker run --rm postgres:16,pg_restore,SELECT count(*) FROM users, log the number. If the script errors or the count is zero, page somebody. 30 lines of bash.
Why resistro exists
This list also explains several product decisions in Resistro: one binary, exit code you can trust, custom-format by default, retention by count with a today’s-dump-exists guard, Prometheus-scraped metrics so your monitoring catches missed runs before restore day does.
None of this is novel. It is mostly a productization of failure patterns that repeat across small self-managed backup setups.
If you want a second pair of eyes on your backup setup: a 1-day backup audit walks every pipeline you have today, rates each one, and leaves you with a written report. Fixed price, no retainer. If everything checks out, I’ll say so and we’re done in half a day.
