Zero-downtime lifecycle
The goal of this page: upgrade probectl at scale without an outage, and be able to roll back at every step. Four things make that possible, and the sections below are each one of them — the control plane is stateless replicas behind a load balancer, migrations are additive (expand/contract), the agent fleet rolls out version by version, and nothing is irreversible until you choose to make it so.
If you remember one rule, remember this one: during any rolling upgrade, the old release (N) and the new release (N+1) are running side by side, so everything they share — the database schema, the agent protocol — must work for both at once. Every constraint below falls out of that rule.
Rolling control-plane upgrade
The control plane holds no durable state of its own — all state lives in Postgres, ClickHouse, and the TSDB. That's what makes replicas interchangeable, and it's why they upgrade by replace, not restart-in-place:
%%{init: {'theme':'base','themeVariables':{'background':'#0d1117','primaryColor':'#161b22','primaryTextColor':'#e6edf3','primaryBorderColor':'#3b82f6','lineColor':'#8b949e','secondaryColor':'#21262d','tertiaryColor':'#0d1117','clusterBkg':'#161b22','clusterBorder':'#30363d','fontFamily':'ui-monospace, SFMono-Regular, Menlo, monospace'},'flowchart':{'curve':'basis','nodeSpacing':55,'rankSpacing':55,'padding':12}}}%%
flowchart LR
LB[load balancer] --> A[replica N]
LB --> B[replica N]
subgraph upgrade
direction LR
B -. SIGTERM .-> D[draining: /readyz 503]
D -. in-flight drains .-> X[exit]
N1[replica N+1] -- /readyz 200 --> LB
end
- Apply migrations first (they are additive, so they're safe for the still-running release N — see the next section).
- Roll replicas one at a time. On
SIGTERM/SIGINTa replica flips/readyzto503 drainingimmediately — so the load balancer stops sending it new requests — then drains its in-flight requests withinPROBECTL_SHUTDOWN_TIMEOUTbefore exiting. (/healthzstays200the whole time: the process is still serving, it's just no longer accepting new traffic. The two probes mean different things on purpose.) - Bring up the replacement; it starts serving once
/readyzis200— meaning the database is reachable and it's not draining.
Because N and N+1 run side by side during the roll, both must work against the same schema, which the migration policy below guarantees.
Migrations: expand/contract (the rollback contract)
This is the heart of zero-downtime. A safe upgrade and a safe rollback both require one thing: release N's schema has to work with both N's code and N−1's code. You get that by never making a destructive change in a single step. Instead you split it across two releases:
- Expand (release N): add the new column/table/index (nullable, or with a default), backfill it, and start writing both the old and new shapes.
- Contract (release N+1, once nothing reads the old shape anymore): drop the old column/constraint.
Anything that would break N−1 mid-roll is forbidden in a single migration, and the
migration gate rejects it automatically (make migration-gate, a standing CI
job: the checker in internal/store/migrate walks every embedded *.sql in
migrations/):
| Rejected | Why | Do instead |
|---|---|---|
DROP TABLE / DROP COLUMN |
N-1 code still uses it | drop in the next release |
ALTER COLUMN … TYPE |
table rewrite + lock; breaks N-1 | add a new column, backfill, switch |
RENAME COLUMN/TABLE |
N-1 code references the old name | add the new name, backfill, drop later |
ADD COLUMN … NOT NULL without DEFAULT |
rewrite fails on existing rows; N-1 inserts break | add nullable or give a DEFAULT |
ALTER COLUMN … SET NOT NULL |
locks/fails | add a NOT VALID check, VALIDATE later |
TRUNCATE |
destroys data | — |
Allowed (and used throughout): CREATE TABLE/INDEX IF NOT EXISTS, ADD COLUMN IF NOT EXISTS … (nullable or defaulted), DROP POLICY/DROP INDEX, ADD CONSTRAINT … NOT VALID, RLS enable/force. Migrations remain idempotent (safe re-run) and
each applies in its own transaction under an advisory lock (one applier wins; the
rest find the schema already current).
Rollback is then trivial: roll the replicas back to N. N's code already works against N+1's additive schema, because the schema only added — it didn't remove anything N depends on. (The contract migration that finally removes the old shape is deliberately deferred to a later release precisely so this stays true.)
Agent version skew
The control plane is the authority on whether an agent is compatible. At
registration it compares the agent's version to its own using an N/N−1 window
(internal/lifecycle):
- Same major version, with a minor skew ≤
PROBECTL_AGENT_SKEW_WINDOW(default 1) → compatible, in both directions: an N agent talking to an N+1 control plane works, and so does an N+1 agent talking to an N control plane. - A wider skew, a major-version mismatch, or an agent below
PROBECTL_AGENT_MIN_VERSION→ rejected with gRPCFailedPreconditionand an "upgrade required" message. (That status is deliberately distinct from a transient error, so the agent surfaces it to an operator instead of hot-looping forever trying to reconnect.) - A dev/unpinned build (
0.0.0-dev) on either side skips the check entirely.
The payoff: a fleet can run a mix of N and N+1 agents during a rollout, and upgrading the control plane never strands an agent that's one version behind.
| control \ agent | N−1 | N | N+1 | N±2 |
|---|---|---|---|---|
| N | yes | yes | yes | no (upgrade) |
Staged fleet rollout (cohorts + pace)
You don't push a new agent version to the whole fleet at once — you promote it
ring by ring, watching health between rings. internal/lifecycle assigns each
agent to a cohort by a stable hash of its id (stable so an agent never flaps
between rings from one evaluation to the next): a small canary ring first,
then early, then the main fleet. A Rollout carries the target version
and a Stage that the operator advances one ring at a time:
%%{init: {'theme':'base','themeVariables':{'background':'#0d1117','primaryColor':'#161b22','primaryTextColor':'#e6edf3','primaryBorderColor':'#3b82f6','lineColor':'#8b949e','secondaryColor':'#21262d','tertiaryColor':'#0d1117','clusterBkg':'#161b22','clusterBorder':'#30363d','fontFamily':'ui-monospace, SFMono-Regular, Menlo, monospace'},'flowchart':{'curve':'basis','nodeSpacing':55,'rankSpacing':55,'padding':12}}}%%
flowchart LR
S0[Stage 0: none] --> S1[Stage 1: canary] --> S2[Stage 2: + early] --> S3[Stage 3: all]
DesiredVersion(agentID) returns the target version once that agent's cohort has
been released, and the agent's current version otherwise. The cohort split is
configurable (default 5% canary / 20% early, leaving the remaining ~75% as the
main fleet). Note the boundary: this is the cohort + pace model plus the
version-skew enforcement that makes a staged roll safe to attempt. Actually
delivering a self-update to the agent (advertising the desired version down to
it) is a separate follow-up.
Configuration
See configuration.md for the PROBECTL_AGENT_SKEW_WINDOW
and PROBECTL_AGENT_MIN_VERSION keys (agent-transport section) and
PROBECTL_SHUTDOWN_TIMEOUT (the drain window).
Out of scope
This page is about single-region rolling upgrades: migration safety, version
skew, and the staged-rollout model. Two related-but-separate concerns live
elsewhere: multi-region active-active HA and disaster-recovery failover (RPO/RTO)
are in multi-region.md, and per-tenant backup/restore plus
verifiable deletion are a separate compliance capability.