Runbook: region failover
What this is
Promote a standby Postgres and move the writer to a new region, with no split-brain and bounded data loss. The conceptual model — why there is exactly one writer, how the fence works — is multi-region.md; this is the step-by-step you follow during an actual failover.
Who runs it: a DB operator (does the Postgres failover) plus a platform operator (verifies probectl resumes writes).
Pre-requisites: streaming replication is healthy; the writer endpoint is a
DNS name or proxy you can re-point; the cluster_state table exists (it is
created by migration 0032_cluster_state.sql).
0. Decide: is a failover warranted?
A failover is for a primary-region loss — the primary is down or network-
isolated — not for transient replication lag. Check /readyz across regions and
the probectl_cluster_replica_lag_seconds metric. If the primary is already
unreachable, probectl will be fencing writes for you (a retryable 503 with the
code writer_unavailable), while reads and telemetry ingest keep flowing.
1. Confirm the standby is caught up (this sets your RPO)
- Synchronous replication: the synchronous standby has every committed row → RPO 0.
- Asynchronous: check the standby's replay lag at the moment of loss with
pg_last_xact_replay_timestamp(). Anything written inside that lag window may be lost — record it. If a fresher standby exists, do not promote a badly lagged one.
2. Promote the standby (DB operator)
Use your Postgres failover tooling: pg_ctl promote, Patroni
switchover/failover, or your managed-DB failover action. After promotion the
new primary is writable and on a new timeline.
3. Stamp the promotion epoch — this is the fence (do not skip)
On the newly promoted primary, run:
SELECT cluster_promote('<new-writer-region>', '<operator>');
This bumps cluster_state.writer_epoch by one and records the new writer region
and actor. The new epoch replicates out to the other standbys. This is the
split-brain fence: the old primary keeps the lower epoch, so probectl
refuses to write to it even if the endpoint briefly points back at it.
Skipping this step risks a split-brain write — do not skip it.
4. Re-point the writer endpoint
Move PROBECTL_DATABASE_URL's DNS name / proxy to the promoted primary:
- Managed DB: usually automatic.
- Patroni / HAProxy: tracks the leader for you.
- Manual DNS: update the record and wait out the TTL.
The per-region read replicas (PROBECTL_DATABASE_READ_URL) keep pointing at
their own local node — reads stay local.
5. Verify (platform operator) — the RTO check
probectl re-probes the database every 5 s and resumes writes automatically — no restart needed. Confirm on a replica in a surviving region:
GET /readyz → cluster.writes_usable: true
cluster.writer.writer_region: <new-writer-region>
cluster.highest_epoch: <bumped value>
(The cluster view is nested under the cluster key of the /readyz JSON; the
writer node's reported region is the writer_region field on cluster.writer.)
A mutating request — e.g. saving a config — should now succeed instead of
returning 503. The elapsed time from primary loss to writes_usable: true is
your realised RTO; compare it against PROBECTL_RTO_SECONDS (default and
provisional target: 60 s).
6. Rebuild the old region as a standby
Bring the former primary back as a standby of the new primary (re-clone, or
pg_rewind it onto the new timeline). It rejoins on the current epoch. Until it
does, probectl correctly fences it (it is either on a lower epoch or in
recovery).
Watch-outs
- Never promote two standbys for the same cluster — there is only ever one
primary.
cluster_promotemakes the winner unambiguous (highest epoch wins); a second promotion that does not also win the endpoint is fenced anyway. - Data residency: do not fail a residency-restricted tenant's data into a region its policy forbids. Strict tenants run siloed with region-pinned stores rather than global replication.
- A failover is not a backup. Keep the documented backup / PITR policy
(recorded in
PROBECTL_BACKUP_RETENTION_NOTE) regardless — see backup-restore.md.