Journey J6 — Operate in production
A monitoring platform is itself a service somebody has to keep alive at 3 a.m. This page walks the operational path end to end: upgrade with no outage, roll a new agent version across the fleet a slice at a time, prove the hardened cryptographic mode is live, survive the loss of a whole region in a drill, hand a diagnostician a secret-free support bundle, and run a resilience self-test. Terms of art (TLS, mTLS, FIPS, HA) are in the glossary.
Who this is for
A site-reliability engineer or operator running probectl in production. You own upgrades, failover, and the on-call runbook. You are not changing what probectl observes — you are keeping the observer healthy while it observes.
Before you start
- You run the control plane as interchangeable replicas behind a load balancer, and you can apply database migrations. With a single replica there is no zero-downtime path — there is nothing to roll to.
- You can read the API over HTTPS and hold admin rights (the support bundle and
deep diagnostics are admin-only). The examples pass
--cacert ca.crtsocurltrusts a self-signed certificate; drop it against a publicly trusted one. - For the FIPS step you run the hardened-crypto build. FIPS (the Federal Information Processing Standards mode) is a build, not a runtime toggle — the binary you run is the gate.
- For the failover drill you run multi-region high availability (HA) on real infrastructure. The live disaster-recovery exercise is yours to run; probectl ships the runbook and an automated drill, not the infrastructure.
The path
Upgrade with zero downtime and a staged fleet rollout. Apply the additive migrations first — they are safe for the still-running release — then replace control-plane replicas one at a time. Liveness and readiness are separate on purpose: a draining replica stays alive while it stops taking new traffic.
curl https://control.example/healthz # liveness: is the process alive? curl https://control.example/readyz # readiness: send it traffic now?You observe
/healthzstay200during a graceful shutdown while/readyzflips to503(draining); the replacement serves only once/readyzreturns200. For agents, promote a new version ring by ring — a small canary cohort first, then early, then the main fleet — advancing one ring at a time while you watch health, so a regression hurts a few percent, not everyone. Powered by running probectl in production.Run a multi-region high-availability failover drill. probectl runs active-active across regions for the stateless parts; durable state is a single-writer database with streaming read replicas. The safety core is split-brain fencing — it refuses to write unless the target is provably the current primary. Trigger a failover and watch the readiness endpoint, which carries the cluster view.
{ "status": "ready", "cluster": { "topology": { "region": "eu" }, "writer": { "role": "reader" }, "writes_usable": false } }You observe
writes_usable: falseduring the flip: writes pause (fenced) so state can never corrupt, while reads and telemetry ingest keep flowing. Writes resume automatically on the next probe once the writer endpoint resolves to the promoted primary. Be precise in your runbook: the durable state database recovers in seconds, but the high-volume telemetry store does not replicate cross-region by default — its recovery point equals your off-region backup cadence unless you opt into telemetry-store replication. This is backup-cadence recovery, not live replication; record both numbers. Powered by running probectl in production.Confirm FIPS-mode crypto is active. Every cryptographic operation flows through one internal choke point, which is what lets a FIPS 140-3 validated module compile in transparently. Both the control plane and the agent run a power-on self-test before serving traffic and fail closed if it errors. Read the live posture.
curl --cacert ca.crt "https://control.example/v1/editions"You observe a
fipsblock withbuild_tag,module_active,enforced, andself_test_passed.self_test_passed: truemeans the self-test ran and the validated module is live — if it had failed, the process would not have started at all. The honest claim: the build operates the validated Go Cryptographic Module; verify its certificate number with the validating body yourself. Powered by running probectl in production.Generate a tenant-scoped, secret-stripped support bundle. A support bundle is one archive of diagnostic files — version, redacted config, deep-health report, self-metrics, an anonymized topology summary, and a runtime snapshot. Its non-negotiable property is that it never contains secrets, credentials, or personally identifiable information, kept true by an allowlist of known-non-secret config keys, counts-only topology, and a final scrub of this deployment's actual secret values.
curl --cacert ca.crt "https://control.example/v1/diagnostics/bundle" -o probectl-support.tar.gzYou observe a gzipped archive of JSON files: database URLs have their passwords stripped, and the encryption key appears only as a boolean
envelope_key_configured, never the key itself. Powered by running probectl in production.Chaos-test resilience (a self-test). The chaos injector deliberately injects a known network fault — added delay, packet loss, a full outage — to prove your monitoring and SLO alerts actually fire when the network breaks. Be honest about what it is: a self-test, not an API. It perturbs only traffic addressed to its own listener, cannot be triggered remotely, and never mutates a live cluster. A run has a fixed shape — healthy baseline, inject, observe, heal, observe.
healthy baseline → SLO quiet, probes pass, latency normal inject a partition → probes fail for real, the multi-window burn alert fires heal → attainment recovers, the alert clearsYou observe the burn alert fire under the injected fault and clear after the heal. If a known fault does not make the alert fire, that is a failure of the platform's core promise — which is exactly what the self-test catches. Powered by cost, reliability, chaos and carbon.
You're done when
- You have rolled the control plane with
/readyzflipping to503on the draining replica and back to200on its replacement — no outage. - A failover drill shows
writes_usable: falseduring the flip and automatic write resumption after, with both recovery points recorded. GET /v1/editionsreportsself_test_passed: trueunderfips.- A support bundle is produced and confirmed secret-free.
- A chaos self-test fires the burn alert under fault and clears on heal.
Next
Bring the first tenant and its agents online — the path that produces the data everything here keeps alive — in stand up and isolate a tenant.
Journey: J6 · Visits: F28, F32, F33, F35, F47