Staged fleet rollout
What this is
How a fleet of probectl agents moves to a new version: in waves, from verified, digest-pinned artifacts, with the agent registry confirming every wave, and any failure halting the train. The goal is that a bad version reaches a small canary first and stops there, instead of taking out the whole fleet at once.
The model — and what is deliberately absent
There is no agent self-update channel. An agent never fetches or executes new
code on its own. Update authority stays with the operator's orchestrator (Helm /
install.sh / your config management), exactly like any other workload — a
self-update channel would be a fleet-wide remote-code-execution primitive, which
is precisely what this design refuses. The control plane's job is to plan
waves from the agent registry and verify each wave back out of it — never to
push bits.
The engine is internal/agent/rollout.go — a pure, fully-tested state machine
(plan → advance → verify → resume) that operator tooling drives; the agent
registry is both its only input and its only evidence. It builds
deterministic waves from the lifecycle cohorts — canary (~5% of the fleet)
→ early (~20%) → main (the rest) — fixed at plan time by a stable hash of each
agent's id, with agents already on the target version excluded. Planning
fails closed on three things:
- an artifact with no recorded signature verification (you must verify it first);
- a target version outside the supported N/N-1 version-skew window against the
control plane (the skew gate,
internal/lifecycle.Policy.Check, accepts ±1 minor — so an agent one minor ahead of or behind the control plane is fine); - an empty or already-up-to-date fleet (nothing to do).
The same skew policy is also enforced live, independent of any rollout: an
agent outside the window is refused at registration with gRPC
FailedPrecondition ("upgrade required" — retrying without upgrading won't
help). The window is configurable (PROBECTL_AGENT_SKEW_WINDOW, default 1),
PROBECTL_AGENT_MIN_VERSION force-retires anything older than an explicit
floor, and development builds skip the check.
Operator flow
0. Verify the artifact — and record it
Per verify-artifacts.md, confirm the artifact was built by this repository's release workflow before you plan anything. The two artifact kinds verify differently:
Container images are published with SLSA provenance + SBOM attestations (not detached cosign signatures). Inspect the provenance and take the exact digest you will deploy:
docker buildx imagetools inspect ghcr.io/imfeelingtheagi/probectl-ebpf-agent:<version>VM binaries are cosign-keyless signed; run the
cosign verify-blobchecks from verify-artifacts.md against the binary and the signedchecksums.txt— the identity pin proves the signature chains to this repository's release workflow running on a release tag.
The plan requires the exact digest, the verification method, and who verified it — an unattested artifact refuses to plan.
1. Plan
Snapshot the fleet from the registry (GET /v1/agents) and plan. Waves render
like canary[3]=pending early[11]=pending main[46]=pending. The wave
membership — the exact agent ids in each wave — is the orchestrator's worklist.
2. Advance one wave
Advance releases exactly one wave (never two, never out of order) and
starts its verify window (default 15 m). Apply that wave with your orchestrator,
by digest:
- Kubernetes (the agent chart):
helm upgrade probectl-agent deploy/helm/probectl-agent --reuse-values --set image.tag="<version>@sha256:<digest>". Scope a wave to a set of nodes withnodeSelectoror a separate release per ring. - VMs:
sudo deploy/agent/install.sh ./probectl-ebpf-agent-<version>on the wave's hosts (aftercosign verify-blob).
3. Verify from the registry — the agents are the evidence
Every wave member must re-register on the target version with a fresh heartbeat (seen within the last 5 m). All good → the wave completes and you can advance the next one. Stragglers still inside the window: keep waiting, re-verify.
4. Halt-on-error
Once the verify window expires, any straggler — still on the old version, reporting nothing, or vanished from the registry entirely ("upgraded, then went dark") — halts the whole rollout and names the offending agents. A halted rollout exposes no current wave, refuses both Advance and Verify, and never resumes on its own.
5. Resume is explicit
After you remediate (roll the node back, replace it, or fix the artifact),
Resume takes a written remediation note and returns the failed wave to the
applying state with a fresh window. That note is the audit trail of what went
wrong mid-rollout.
Properties worth relying on
| Property | Where it is enforced |
|---|---|
| Verified artifacts only | the plan refuses without digest + method + verifier; deploys are by digest |
| No self-update | nothing in the agent fetches code — orchestrator-only |
| Skew gate stays green | the plan refuses any target outside N/N-1 vs the control plane |
| Deterministic waves | stable-hash cohorts, fixed at plan time, with sorted membership |
| No overlap / no skipping | Advance refuses while a wave is still unverified |
| Halt-on-error | registry-verified — stragglers or dark agents past the window freeze the train |
| Mid-rollout safety | N/N-1 means old and new agents coexist on the bus throughout |
Rollback is the same machine pointed backwards: plan a rollout to the previous (still-verified) version. The same skew window that lets waves coexist going forward lets them coexist coming back.