probectl /docs GitHub ↗

Running probectl in production

What it is

This page covers the work of keeping a probectl deployment alive and healthy over its whole life: upgrading it without taking it offline, rolling a new agent version across a fleet a slice at a time, running it in a hardened cryptographic mode for regulated environments, surviving the loss of a whole region, governing how long data lives and how it is masked or deleted, and handing a diagnostician a single file that explains a problem without leaking a single secret.

Think of it as the difference between building a car and operating a fleet of them: the engine is the same, but now you care about oil changes you can do without stopping the car, swapping a part across the whole fleet in waves, running on certified fuel, and a black-box recorder that captures the instruments but never the passengers' conversations.

Five capabilities live here:

  • Zero-downtime lifecycle — upgrade and roll back at every step, with no outage.
  • Fleet rollout — promote a new agent version ring by ring, watching health.
  • Hardened cryptographic mode — a build that operates a government-validated cryptographic module (the Federal Information Processing Standards mode, FIPS).
  • Multi-region high availability (HA) — survive a region failure with no data corruption.
  • Governance and supportability — control data retention, masking, erasure, and per-customer keys; produce a secret-free support bundle on demand.

Terms of art (glossary) such as TLS, mTLS, RLS, BYOK, FIPS, and KEK are defined there.

Why it exists

A monitoring platform is itself a service somebody has to operate at 3 a.m. The worst time to discover your observability tool needs a maintenance window is the moment you most need to watch the network. So probectl is built so that the boring operational tasks — upgrades, failovers, key rotations — happen without a gap in observability.

Each capability answers a concrete fear:

  • "Will the upgrade take us down?" No. The control plane runs as interchangeable copies (replicas) behind a traffic distributor (a load balancer), and you replace them one at a time while the rest keep serving.
  • "Will a bad agent release break the whole fleet at once?" No. A new agent version reaches a small canary slice first, so a regression hurts a few percent, not everyone.
  • "Can we satisfy a regulator who demands a validated crypto module?" Yes — there is a build that operates one.
  • "What happens if a region dies?" The control plane keeps serving reads and ingest everywhere; the database fails closed on writes during the flip so it can never corrupt state.
  • "Can we honour a deletion or residency obligation, and prove it?" Yes — erasure produces a proof document anyone can re-derive.
  • "Can we get a diagnostic file to support without leaking credentials?" Yes — the support bundle is structurally secret-free.

How it works

Zero-downtime upgrades (the one rule)

During a rolling upgrade the old release and the new release run side by side, so everything they share — the database schema, the agent protocol — must work for both at once. It is like resurfacing a bridge one lane at a time: traffic keeps flowing in both lanes, so every intermediate state must carry both.

Two mechanisms make that safe:

  • The control plane holds no durable state of its own. All state lives in the databases, so any replica can serve any request and killing one loses nothing. On a polite shutdown signal a replica immediately reports not ready to the load balancer (so new traffic stops), finishes the requests it already accepted, then exits. A replacement starts serving only once it reports ready.
  • Schema changes are additive (expand, then contract). A destructive change is never done in one step. First you expand: add the new column or table, backfill it, write both shapes. Later, once nothing reads the old shape, a separate release contracts: drops the old one. A standing check rejects any migration that would break the previous release mid-upgrade — drops, renames, type changes, or adding a non-null column with no default. Because the schema only ever added, rollback is trivial: the previous release still works against it.

Agent version skew and staged rollout

The control plane is the authority on whether an agent is compatible. It accepts an agent within a one-version window in both directions — an older agent talking to a newer control plane works, and vice versa — so a fleet can run a mix during a rollout. An agent too far behind is rejected with a clear "upgrade required" message, distinct from a transient error, so it surfaces to an operator instead of looping forever.

You promote a new agent version ring by ring. Each agent lands in a stable cohort by a hash of its identity (stable so it never flaps between rings): a small canary ring first, then early, then the main fleet. You advance one ring at a time, watching health between rings.

Hardened cryptographic (FIPS) mode

Every cryptographic operation in probectl flows through one internal choke point, and a build-time guard blocks any other code from calling a crypto primitive directly. That single choke point is what makes a validated-module build possible: a FIPS 140-3 validated cryptographic module is compiled in transparently, swapping the implementation underneath while every output stays byte-for-byte identical.

The honest, auditable claim: the FIPS build operates the FIPS 140-3 validated Go Cryptographic Module (you verify its certificate number with the validating body yourself). probectl as a product does not hold its own module certificate — the module does. Both the control plane and the agent run a power-on self-test before serving any traffic and fail closed (refuse to start) if it errors, and in the FIPS build the self-test also asserts the validated module is actually active.

Multi-region high availability

probectl runs active-active across regions for the parts that can: every region runs stateless control-plane and ingest replicas, all serving at once. Durable state is a single-writer database with streaming read replicas in the other regions. There is exactly one writable primary at any instant.

The safety core is split-brain fencing. "Split-brain" is the nightmare where two nodes both believe they are the primary and both accept writes, silently diverging. probectl refuses to write unless the target is provably the current primary: it probes the writer endpoint every few seconds and fails writes closed (returning a retry-after) whenever it cannot prove that — while reads keep serving and telemetry ingest never pauses. A monotonic promotion counter catches a stale ex-primary; a lower counter can never reclaim the writer role. The principle: degrade to read-only, never lose or corrupt data.

Be precise about one asymmetry: the durable state database carries over with a seconds-scale recovery point. The high-volume telemetry store does not replicate cross-region by default — its regional recovery point equals your off-region backup cadence unless you opt into telemetry-store replication yourself. Record both numbers in your disaster-recovery plan so the gap is visible to whoever is on call.

Governance: retention, masking, erasure, residency, keys

Governance is one coherent view per customer over five concerns:

  • Classification. Every sensitive data category gets a sensitivity class. The headline: an IP address defaults to personally identifiable information (PII), because under privacy law an IP address is personal data.
  • Redaction. When masking is active, every category at or above a configurable floor (PII by default) is masked. The default strategy keeps a coarse, non-identifying prefix — "blur the house number, keep the street" — so analytics still group by network while no value points at one host. Credentials always drop entirely. Masking is best-effort pseudonymisation, not anonymisation: for irreversible removal, use erasure.
  • Retention. Each store has its own deletion clock; you set how long flow, path, host-level, and trace/log telemetry live.
  • Erasure. Customer-erase and subject-erase are the wide brooms: they remove or project data across every live store and produce a recomputable attestation — a proof anyone can re-derive to confirm the deletion happened. Backups keep their own clock; a governed deletion records your backup-retention deadline rather than reaching into a backup.
  • Per-customer keys / BYOK. On the licensed tier, each customer's sensitive at-rest values can be sealed under that customer's own key, so offboarding becomes a key-destruction event. An unavailable or destroyed key is an error, never a silent fallback to a shared key.

Supportability

A support bundle is a single archive of diagnostic files — version, redacted config, deep-health report, self-metrics, an anonymized topology summary, and a runtime snapshot. The non-negotiable property: it never contains secrets, credentials, or PII, kept true by three independent layers: the config snapshot is built from an allowlist of known-non-secret keys (so a secret added in a later release cannot leak, because it simply is not on the list), the topology file is counts only, and a final scrub replaces this deployment's actual secret values anywhere in the bytes.

Use it

Roll the control plane. Apply migrations first (they are additive, safe for the still-running release), then replace replicas one at a time. Health and readiness are separate on purpose:

# liveness: is the process alive at all?
curl https://control.example/healthz
# readiness: should the load balancer send it traffic right now?
curl https://control.example/readyz

What you should observe: during a graceful shutdown, /healthz stays 200 (the process is still serving what it accepted) while /readyz flips to 503 (draining — stop sending new traffic). The replacement serves only once /readyz returns 200.

Build and verify the hardened crypto mode. A status field reports the live posture (build tag, module active, enforced, self-test passed):

{
  "fips": {
    "build_tag": true,
    "module_active": true,
    "enforced": false,
    "self_test_passed": true
  }
}

What you should observe: self_test_passed: true means the power-on self-test ran and the validated module is live. If the self-test had failed, the process would not have started at all.

Watch the cluster during a failover. The readiness endpoint carries the cluster view; a node stays ready for reads while writes pause:

{
  "status": "ready",
  "cluster": { "topology": { "region": "eu" }, "writer": { "role": "reader" }, "writes_usable": false }
}

What you should observe: writes_usable: false tells operators and automation that writes paused (fenced) during the flip; reads and ingest keep flowing. Writes resume automatically on the next probe once the writer endpoint resolves to the promoted primary — no restart needed.

Produce a support bundle. Either live from the API or offline straight from the binary with no running server:

probectl-control support-bundle -o /tmp/probectl-support.tar.gz

What you should observe: a gzipped archive of JSON files. Database URLs have their passwords stripped; the encryption key shows up only as a boolean envelope_key_configured, never the key itself.

Take a governed, masked export of a customer's data:

curl 'https://control.example/v1/lifecycle/export?redact=true' -o tenant-export.tar.gz

What you should observe: the export manifest carries "redacted": true; IP addresses appear as a coarse prefix (for example 203.0.113.0/24), emails as a***@domain, and credentials are gone entirely, while counts and protocol names survive.

Pitfalls & limits

  • The telemetry store does not replicate cross-region by default. The durable state database recovers in seconds; the high-volume telemetry store recovers to your last off-region backup unless you run telemetry-store replication yourself. This is a deliberate trade — replicating high-cardinality telemetry globally is expensive and often residency-restricted — but never paper over it: record the telemetry recovery point next to the state recovery point.
  • Recovery-point and recovery-time targets are provisional until you validate them. The shipped numbers are engineering estimates; they become committed numbers only once your own failover drills back them on real infrastructure.
  • The hardened crypto mode is a build, not a runtime toggle to a feature. The binary you run is the gate — there is no license switch that turns FIPS on. It is a hardening posture, not a feature surface.
  • Some at-rest encryption is the operator's job, by design. probectl seals sensitive values itself, but it does not re-encrypt the bulk telemetry stores' data files at scale — that is your storage layer's responsibility (encrypted volumes). A bundled preflight check reports which of your data paths are detectably encrypted so this duty is never silently skipped.
  • Lose the encryption key and sealed values become unreadable. Back up the key material like you back up the data. A keyless production startup is a fatal error, never a silent fallthrough to plaintext.
  • Masking is not anonymisation. The default strategy keeps a network prefix and the hash strategy is correlatable. You cannot promise a masked value identifies nobody. For that, erase.
  • Erasure clears live stores, not backups. A governed deletion attests the live-store removal and records your backup-erasure deadline; the backup destination owns its own clock.
  • The live cross-region disaster-recovery exercise is yours to run. The runbook and an automated drill exist, but the real exercise needs real infrastructure and is an operator action.

Reference

  • Upgrade: apply additive migrations first, then replace control-plane replicas one at a time; roll back by replacing them with the prior release.
  • Health probes: GET /healthz (liveness), GET /readyz (readiness, carrying the cluster view: region, writer role, writes_usable, replica lag). Deep per-component health is at GET /v1/diagnostics (admin only); the aggregate equals the worst component.
  • Agent rollout: one-version compatibility window in both directions; staged cohorts (canary → early → main), advanced one ring at a time.
  • Hardened crypto posture: reported under fips on GET /v1/editions (build_tag, module_active, enforced, self_test_passed); a power-on self-test gates startup and fails closed.
  • Multi-region: active-active stateless control plane; single-writer database with streaming replicas; write fencing during failover; telemetry-store cross-region replication is an operator opt-in, not a built-in.
  • Governance: masked export via GET /v1/lifecycle/export?redact=true; subject export/erase via POST /v1/lifecycle/subjects/export and POST /v1/lifecycle/subjects/erase; erasure produces a recomputable attestation; per-customer keys / BYOK on the licensed tier.
  • Support: GET /v1/diagnostics/bundle (live, admin only) or probectl-control support-bundle (offline); the bundle is structurally secret-free.
  • Related capabilities (separate pages): Tenancy & isolation; the Provider / MSP plane (break-glass, metering, white-label, residency, per-tenant keys).

Covers: F28, F32, F33, F34, F35

Rendered live from github.com/imfeelingtheagi/probectl — found a mistake? edit this page.