2026-04-22 · 7 min read · Field notes

RTO and RPO for the DACH mid-market, without the enterprise theatre

If you have ever sat in a BSI Grundschutz workshop, you have heard somebody say "our RTO is four hours" with complete conviction. Ask how they know, and the honest answer is: somebody wrote it down in a procurement questionnaire five years ago, and nobody has tested it since.

Let's put the definitions on the table, then throw out the template language, then rebuild something you can actually defend to your board.

The two numbers, short version

Term	Question it answers	Unit
RTO	How long from "it's broken" to "it's running again"?	Minutes or hours
RPO	How much data will we have lost when we restore?	Minutes of data

RTO is about downtime. RPO is about data freshness. They are different dials and they trade off against each other. A 1-minute RPO costs much more than a 4-hour RPO, because it usually means synchronous replication.

Why the enterprise templates don't fit

"Four hours / fifteen minutes" is the canonical big-bank number. It comes from environments with:

Hot-standby replicas in a second data centre, failover tested quarterly.
Written runbooks, tabletop exercises, paid on-call rotations.
A dedicated DR engineer whose full-time job is making those numbers true.

A 20-person DACH SaaS with 400 GB of PostgreSQL has none of those things, usually can't economically justify them, and shouldn't claim the numbers that go with them. You don't have a 4-hour RTO. You have whatever number drops out when somebody actually tries to restore under pressure.

How to measure your real RTO in one afternoon

Pick a recent backup. Any one. The one you would grab if prod just died.
Stand up a throwaway Postgres on a spare box or a docker run postgres:16.
Start a stopwatch.
Restore the backup into the throwaway. Actually run pg_restore. Not pg_restore -l.
Run your application's smoke test against it. Log in. Fetch a page. Check a known record.
Stop the stopwatch.

That number is your RTO, with three caveats you should write next to it:

You knew where the backup was. In a real outage, at 3am, is that documented?
You had time, and were calm. During an outage there is neither.
You didn't need to restore the application code, the DNS, or the secrets. Those often add another hour each.

A typical first drill for a mid-market stack lands somewhere between 45 minutes and 4 hours for the database alone. If you thought you were at 30 minutes and you're actually at 2 hours, that's the finding of the audit.

How to measure RPO

RPO is easier in principle:

RPO ≈ (backup frequency) + (time to stream one backup off-site)

Daily backup that finishes at 03:00 and takes 20 minutes to upload → worst-case RPO is 24h + 20 minutes. Hourly backups → worst-case RPO is 60 minutes + upload time. The "+ upload" matters more than people think — if the backup exists only on the production host and the production host is the thing that died, your RPO is infinity.

There's a second RPO you should track separately: replication lag, if you have a replica. Losing the primary but still having the replica means your RPO is whatever the replication lag was at the moment of failure — often seconds, sometimes much worse under heavy write load. If you have one, measure it in Prometheus and put the 99th percentile in your runbook.

What actually belongs in the board slide

Three numbers, all measured:

Tested RTO — "last drill restored X into a clean target in Y minutes on Z date."
RPO — "we would lose at most N minutes of data, bounded by backup frequency and upload time."
When we last tested — the date of the most recent real restore. This is the most honest metric.

Board doesn't care about technology. Board cares about the third number going stale. "Last tested restore: 47 days ago" is a stronger statement than any ISO 27001 paragraph, because it's falsifiable.

A pattern I see repeatedly: the deck says 4h / 15min, the last successful test restore is 18+ months old. When the drill finally runs, the actual RTO lands at 4–6× the claim — and that's with the backup already mounted on a spare host. The gap between claim and reality is the real business problem.

What I recommend for mid-market stacks

Stack size	RTO target	RPO target	Required investment
< 50 GB, 1 DB host	2h	24h + upload	Daily encrypted backups off-site, monthly test restore. Doable solo.
50–500 GB, 1 primary + replica	1h	1h + upload	Hourly backups, monitored replica, quarterly drill. Weekend project.
> 500 GB or regulated	30 min	15 min	PITR, tested failover, dedicated runbook. Month of setup, quarterly practice.

Notice what's missing: "enterprise HA across three availability zones" is not on this table. Most DACH-mid-market doesn't need it, can't afford it, and doesn't actually reduce business risk compared to a well-practised pg_restore drill.

The test matters more than the target

Pick a number you can defend by drill. Write the date of the last drill on a dashboard. When that date is older than 90 days, block feature work until it gets refreshed. That one rule beats any amount of RTO/RPO documentation that nobody has verified.

Want your RTO measured, not estimated?

A half-day disaster-recovery drill walks a real backup into a clean target with a stopwatch running. You walk away with numbers you can defend at the next board meeting, and a replay script for next time.

Book a drill