probectl /docs GitHub ↗

Configuration

This is the full reference for every knob probectl reads at startup. The short version: the control plane and every server-side feature read environment variables (all prefixed PROBECTL_); the agents read a YAML file (with the same env vars as overrides). This page lists each variable, its default, and what it does — and it is the contract, so every row here is checked against the code.

How to read this page:

  • A variable's default is what you get if you set nothing. The defaults are chosen so a fresh install boots in a safe, sovereign posture (no outbound calls, fail-closed on missing TLS/secrets) — probectl's standing security guardrails (see security/threat-model.md).
  • A default of (none) means the value is empty/unset; the feature usually stays off until you give it one.
  • Where it's read: the control plane resolves its config in internal/config/config.go (one Load function that reports every bad value at once and exits non-zero — you never chase config errors one at a time). Each agent has its own loader (internal/agent, internal/ebpf, internal/flow, internal/device, internal/endpoint).

Conventions

  • Control plane (probectl-control): environment variables, PROBECTL_ prefix. Listed in the next section.
  • Agents: a YAML config file is the source of truth; the matching PROBECTL_* env vars override individual fields (handy for containers). Each agent's keys are in its own section below.
  • Secrets are never hardcoded, logged, or placed in URLs/query strings. Sensitive values at rest are sealed with envelope encryption, and any credential in this document may be a secret reference (e.g. vault:…) instead of the raw value — see Secrets integration.

Control plane (probectl-control)

The control plane is the brain: it serves the API/UI, accepts agent connections, runs the alerting/incident/correlation engines, and talks to the datastores. It is stateless — all durable state lives in Postgres/ClickHouse/the TSDB — so every behavioral choice it makes comes from these environment variables, read once at boot. The table below is the base set every deployment uses; the feature-specific sections that follow add more.

Subcommands: probectl-control [serve] (default), probectl-control migrate (apply database migrations and exit), probectl-control version, and probectl-control gen-cert [dir] — a convenience that writes a self-signed tls.crt/tls.key/ca.crt for an HTTPS quickstart (PROBECTL_CERT_HOSTS, default localhost,127.0.0.1, sets the certificate's host names; production brings its own CA-issued cert). The other subcommands are covered with their features: agent-ca init|export, enroll-token, and revoke-agent (agent transport and enrollment — agent/enrollment.md), scim-token (SCIM, below), mcp-stdio and mcp-token (MCP server, below), preflight (the storage-encryption preflight — hardening.md), support-bundle (supportability, below), and backup-seal / backup-open (sealed backups — Tenant lifecycle, below).

A note on the defaults: the listen address is :8080, the database DSN points at a local Postgres with sslmode=require (TLS to the database is the default, not an afterthought), and HSTS is on. These defaults assume you front the process with a TLS-terminating ingress (the shipped Helm/compose posture); set the TLS cert/key pair below to have the process serve HTTPS itself instead.

Variable Default Description
PROBECTL_HTTP_ADDR :8080 API listen address
PROBECTL_HTTP_READ_TIMEOUT 15s HTTP read timeout
PROBECTL_HTTP_WRITE_TIMEOUT 15s HTTP write timeout
PROBECTL_HTTP_IDLE_TIMEOUT 60s HTTP idle (keep-alive) timeout
PROBECTL_SHUTDOWN_TIMEOUT 15s graceful-shutdown drain timeout
PROBECTL_DATABASE_URL postgres://probectl:probectl@localhost:5432/probectl?sslmode=require PostgreSQL DSN; sslmode=require is the default (TLS to the DB out of the box). Dev-only: a local source-dev stack without TLS may explicitly append sslmode=disable to its own DSN
PROBECTL_DATABASE_MAX_CONNS 10 max pool connections (1–1000)
PROBECTL_DATABASE_MIN_CONNS 0 min pool connections
PROBECTL_DATABASE_CONNECT_TIMEOUT 5s per-connection connect timeout
PROBECTL_MIGRATE_ON_BOOT false apply migrations during serve startup
PROBECTL_LOG_LEVEL info debug | info | warn | error
PROBECTL_LOG_FORMAT json json | text
PROBECTL_HSTS_ENABLED true send Strict-Transport-Security
PROBECTL_HSTS_MAX_AGE 8760h HSTS max-age
PROBECTL_TLS_CERT_FILE (none) PEM server certificate; the process serves HTTPS directly when set together with the key
PROBECTL_TLS_KEY_FILE (none) PEM server private key (set together with the cert)
PROBECTL_PUBLIC_TLS false tells the app that TLS terminates at the edge (an ingress in front) even though the app itself serves plaintext. Browsers only see the edge, so this is what flips cookies to Secure when you run behind a TLS ingress
PROBECTL_ALLOW_PLAINTEXT_HTTP false explicit, loud opt-in for a non-loopback plaintext control listener — only valid behind a TLS-terminating ingress (the Helm chart sets it). Without it, plaintext + a non-loopback bind = refuse to start (fail closed)
PROBECTL_SECURITY_CONTACT (none) your vulnerability-disclosure mailbox; published in the served /.well-known/security.txt (left as a template comment when unset)
PROBECTL_ENVELOPE_KEY (none) base64-encoded 32-byte key-encryption key (KEK) for at-rest envelope encryption. The single root secret behind sealed credentials and backups — back it up
PROBECTL_ENVELOPE_KEY_FILE (none) path to the KEK file — loaded, or GENERATED+persisted (0600) on first boot if absent; an explicit PROBECTL_ENVELOPE_KEY wins over it. Shipped compose mounts it on the controldata volume
PROBECTL_ENVELOPE_KEY_ID dev identifier recorded alongside each sealed value (so a future key rotation can tell which key sealed what)
PROBECTL_REQUIRE_AT_REST_ENCRYPTION false when true, the control plane refuses to start if no envelope key resolves — a hard guarantee against accidentally running with plaintext-at-rest
PROBECTL_STORAGE_ENCRYPTION_ATTESTED false operator attestation that the bulk-store volumes are encrypted below the host (e.g. encrypted cloud volumes the startup preflight can't see); logged, and downgrades the preflight warning
PROBECTL_AGENT_GRPC_ADDR (none) agent gRPC listen address; enables the transport when set together with the agent mTLS files below
PROBECTL_AGENT_TLS_CERT_FILE (none) agent-transport server certificate (PEM)
PROBECTL_AGENT_TLS_KEY_FILE (none) agent-transport server private key (PEM)
PROBECTL_AGENT_TLS_CA_FILE (none) CA bundle that signs agent client certificates (PEM)
PROBECTL_BUS_MODE memory result bus: memory (lightweight, in-process) | kafka
PROBECTL_BUS_BROKERS (none) comma-separated host:port Kafka brokers (required for kafka)
PROBECTL_BUS_MEMORY_BUFFER 1024 in-memory bus: per-subscriber channel depth (lightweight mode)
PROBECTL_BUS_MEMORY_OVERFLOW block in-memory bus overflow policy: block (back-pressure publisher) | drop (drop + count, no deadlock)
PROBECTL_BUS_TLS_ENABLED false TLS to the Kafka brokers. Required in kafka mode unless the explicit dev flag below is set
PROBECTL_BUS_TLS_CA_FILE (none) private CA bundle for the brokers
PROBECTL_BUS_TLS_CERT_FILE (none) client certificate (broker mTLS; with _KEY_FILE)
PROBECTL_BUS_TLS_KEY_FILE (none) client key (broker mTLS)
PROBECTL_BUS_SASL_MECHANISM (none) plain | scram-sha-256 | scram-sha-512
PROBECTL_BUS_SASL_USER (none) SASL username
PROBECTL_BUS_SASL_PASSWORD (none) SASL password (secret references supported; never logged)
PROBECTL_BUS_ALLOW_PLAINTEXT false dev only: allow a plaintext broker (the dev compose stack). Production never sets this
PROBECTL_BUS_MAX_BUFFERED 0 (= built-in bound 65536) bound on the async Kafka producer's in-flight records; a full buffer SHEDS new records (counted, never blocking ingest). 0/unset keeps the built-in 65536-record bound — there is deliberately no unbounded mode
PROBECTL_BUS_WORKERS 4 per-subscription consume parallelism — each Kafka poll batch is fanned out across this many key-sharded workers (per-key ordering preserved). 0/1 = serial
PROBECTL_INGEST_MAX_SERIES_PER_AGENT 0 (= built-in cap 1000) cap on active metric-series identities one agent may mint; a NEW identity past the cap is rejected per-series and counted (known series keep flowing), and an identity idle for 1h frees its slot. 0/unset keeps the built-in 1000 cap — the wall always exists (there is no unlimited setting)
PROBECTL_INGEST_MAX_SERIES_PER_TENANT 0 (= built-in cap 50000) tenant-wide active-series wall, so one tenant's cardinality explosion never bleeds into others. 0/unset keeps the built-in 50000 cap
PROBECTL_TSDB_MEMORY_RETENTION 0 (= built-in window 1h) lightweight-mode (in-memory) TSDB retention window, aged by ARRIVAL time (backfilled or clock-skewed sample timestamps are never swept early). 0/unset keeps the built-in 1h window — the buffer never grows forever
PROBECTL_TSDB_MEMORY_MAX_BYTES 0 (= built-in wall 256 MiB) byte ceiling for the in-memory TSDB; oldest-first eviction once exceeded, with usage + eviction counters exposed. 0/unset keeps the built-in 256 MiB wall
PROBECTL_AUDIT_WORM_DIR (none) enable write-once audit export — the provider audit chain is exported as Ed25519-signed segments into this directory (mount an S3/MinIO object-lock bucket for true write-once-read-many) and chain-verified each cycle
PROBECTL_AUDIT_WORM_INTERVAL 1h export + chain-verify cadence
PROBECTL_WORM_SIGNING_KEY_FILE (none) path to the Ed25519 audit-export signing key (PKCS#8 PEM) — loaded, or GENERATED+persisted (0600) on first boot, so the key is stable across restarts (an ephemeral per-boot key would break cross-restart chain verification). Required when PROBECTL_AUDIT_WORM_DIR is set unless PROBECTL_WORM_SIGNING_KEY is. Back it up like the envelope key
PROBECTL_WORM_SIGNING_KEY (none) base64-encoded Ed25519 private-key PEM (KMS/secret-manager injection) — wins over PROBECTL_WORM_SIGNING_KEY_FILE. Enabling audit export with neither set fails closed (no silent ephemeral key)
PROBECTL_TSDB_MODE memory time-series writer: memory (in-process) | prometheus
PROBECTL_TSDB_URL (none) Prometheus/VictoriaMetrics base URL for remote-write (required for prometheus)
PROBECTL_ALERT_EVAL_INTERVAL 30s how often the alerting engine evaluates rules over the TSDB
PROBECTL_INCIDENT_WINDOW 10m time window within which related signals correlate into one incident
PROBECTL_AUTH_MODE session identity mode: session (real OIDC SSO + session cookies) | dev (LOCAL EVALUATION ONLY — exists only in -tags devauth builds; release binaries refuse it at boot)
PROBECTL_DEV_AUTH_ACK (none) must be i-understand to start in dev auth mode (tagged builds only, loopback bind required)
PROBECTL_SESSION_TTL 12h server-side session lifetime
PROBECTL_AUTH_RATE_MAX_FAILURES 5 auth brute-force guard: failures per window before lockout
PROBECTL_AUTH_RATE_WINDOW 1m failure-counting window for the auth throttle
PROBECTL_AUTH_RATE_LOCKOUT 1m base lockout; doubles per consecutive lockout, capped at 1h; lockouts are audited
PROBECTL_OIDC_ISSUER (none) OIDC issuer URL; SSO discovery is performed against it
PROBECTL_OIDC_CLIENT_ID (none) OIDC client ID registered with the IdP
PROBECTL_OIDC_CLIENT_SECRET (none) OIDC client secret (kept out of logs/URLs)
PROBECTL_OIDC_REDIRECT_URL (none) the control plane's /auth/callback URL registered with the IdP
PROBECTL_REQUIRE_MFA false require multi-factor auth. The session's MFA state comes from the ID token's amr/acr claims (a second factor like otp/hwk/mfa, or acr aal2+/loa2+). When true, every authenticated /v1 request from a single-factor session gets 403 (enforced at request time). Off by default

Invalid values fail fast: probectl-control reports all configuration problems at once and exits non-zero. The database password is redacted from logs.

Tenant-owned tables are protected by Postgres Row-Level Security. The PROBECTL_DATABASE_URL role must be able to assume the least-privilege probectl_app role (a superuser always can; otherwise run GRANT probectl_app TO <login_role>), which internal/tenancy assumes per transaction so isolation holds regardless of how the control plane authenticated. See architecture.md.

HTTP endpoints

Method & path Purpose
GET /healthz Liveness — 200 while the process is serving
GET /readyz Readiness — 200 when the database is reachable, else 503
GET /version Build and runtime metadata
GET /openapi.json The OpenAPI 3.1 document

Every response carries an X-Request-Id (honoring an inbound one) and the security headers Strict-Transport-Security (when enabled) and X-Content-Type-Options: nosniff. The versioned resource routes under /v1 are documented in Resource API & CLI below.

Error envelope

All errors share one JSON shape and a stable domain-error → HTTP mapping:

{ "error": { "code": "not_found", "message": "…", "request_id": "…" } }
Domain kind Code HTTP
BadRequest bad_request 400
Unauthorized unauthorized 401
Forbidden forbidden 403
NotFound not_found 404
Conflict conflict 409
Validation validation 422
Internal internal 500
Unavailable unavailable 503

Transport security

probectl never wants a plaintext channel exposed to the network. There are two correct ways to get TLS in front of the API, and the config lets you pick:

The API listens over TLS in two interchangeable ways:

  • App-terminated TLS — set PROBECTL_TLS_CERT_FILE + PROBECTL_TLS_KEY_FILE, and the control plane serves HTTPS only (TLS 1.2+, prefer 1.3; plaintext is refused).
  • Ingress-terminated TLS — leave them unset and serve HTTP behind a TLS-terminating ingress (the shipped Helm/compose default). HSTS is set either way, so the posture is correct end to end.

All TLS and crypto policy lives in internal/crypto; a CI guard (scripts/check_crypto_imports.sh) forbids crypto-primitive imports elsewhere so a FIPS 140-3 validated module can be swapped in. At-rest secrets use the envelope helper (a per-record data key wrapped by a KMS/HSM-pluggable KEK; the dev StaticKeyProvider reads PROBECTL_ENVELOPE_KEY).

Agent transport

This is how agents talk to the control plane, and it is locked down by design. The agent gRPC transport (probectl.agent.v1.AgentService) runs only when PROBECTL_AGENT_GRPC_ADDR and all three PROBECTL_AGENT_TLS_* files are set (address + server cert + server key + the CA that signs client certs). It is mutual-TLS only (RequireAndVerifyClientCert): the agent must present a client certificate, and its tenant and id are read out of that certificate's identity (spiffe://probectl/tenant/<t>/agent/<a>), never from the request body. So even a misbehaving or malicious agent can only ever write to its own tenant — the identity is cryptographic, not self-asserted. Populate PROBECTL_AGENT_TLS_CA_FILE (the client-cert CA pool) with probectl-control agent-ca export <file>, which writes the public agent-CA bundle (root + intermediate, no key). Generate dev mTLS material with the internal/crypto CA helpers. The .proto lives under proto/probectl/agent/v1/; regenerate Go with make proto (tools via make proto-tools).

Version-skew policy. At registration the control plane rejects agents outside the supported version window, so a rolling upgrade never admits an incompatible agent. See lifecycle.md.

Variable Default Description
PROBECTL_AGENT_SKEW_WINDOW 1 allowed minor-version skew on either side (N/N-1); the control plane at minor N accepts agents at N-1…N+1. 0 requires an exact minor match
PROBECTL_AGENT_MIN_VERSION (none) an explicit floor — agents older than this are rejected regardless of the window (force-retire a known-bad version)

A rejected agent gets a gRPC FailedPrecondition ("upgrade required"); a dev/unpinned build (0.0.0-dev) on either side skips the check.

probectl-agent

The canary agent is the worker that actually runs the probes (ping, TCP, DNS, HTTP, …). Unlike the control plane, its primary config is a YAML file (-config, or the path in PROBECTL_AGENT_CONFIG) — see deploy/agent/probectl-agent.example.yml. Crucially, the agent does not configure its own tenant or id: those come from its mTLS client certificate (above), so you can't accidentally point an agent at the wrong tenant by editing a file.

A handful of env vars override individual YAML fields — useful in containers where mounting a full file is awkward:

Variable Overrides (YAML) Meaning
PROBECTL_AGENT_CONFIG path to the YAML config (the -config flag wins over it)
PROBECTL_AGENT_GRPC_ADDR control_plane.grpc_addr the control plane's agent-gRPC endpoint to dial
PROBECTL_AGENT_TLS_CERT_FILE tls.cert_file the agent's mTLS client certificate (PEM)
PROBECTL_AGENT_TLS_KEY_FILE tls.key_file the agent's mTLS client key (PEM)
PROBECTL_AGENT_TLS_CA_FILE tls.ca_file the CA that signed the control plane's server cert (PEM)
PROBECTL_AGENT_BUFFER_DIR buffer.dir on-disk store-and-forward directory (see below)
PROBECTL_AGENT_IDENTITY_SERVER identity.server control-plane HTTPS base URL enabling automatic certificate rotation — the agent rotates its mTLS identity at ~2/3 of its lifetime via /enroll/agent/rotate. See agent/enrollment.md
PROBECTL_AGENT_JOIN_TOKEN a one-time join token for first-boot enrollment: with no identity present yet, the agent redeems it, writes its identity, then runs. Idempotent (a present identity is never overwritten) and fail-closed. See agent/enrollment.md
PROBECTL_AGENT_ENROLL_TOKEN_FILE enroll.token_file a file holding the join token (a mounted secret, read once); PROBECTL_AGENT_JOIN_TOKEN takes precedence
PROBECTL_AGENT_ENROLL_SERVER enroll.server enrollment target for first-boot enrollment; defaults to identity.server
PROBECTL_AGENT_ENROLL_CA_PIN enroll.ca_pin optional hex sha256 pin of the server cert for first contact; otherwise tls.ca_file verifies the server
PROBECTL_AGENT_CANARY_CA_DIR tls.canary_ca_dir the one directory that probe ca_file: parameters may reference (a trust-anchor allowlist for HTTP/DNS-over-TLS probes); empty = the ca_file parameter is refused
PROBECTL_AGENT_LOG_LEVEL debug | info (default) | warn | error
PROBECTL_AGENT_LOG_FORMAT json (default) | text

Results buffer to disk (buffer.dir, bounded by max_records, default 10000) while the control plane is unreachable and drain on reconnect (at-least-once delivery). Probing keeps running regardless of connectivity, so a control-plane outage never blocks measurement — the agent just queues and catches up.

Result pipeline

This is the path every measurement takes from an agent to a queryable metric, and two env vars decide how heavy that pipeline is: PROBECTL_BUS_MODE (the message bus) and PROBECTL_TSDB_MODE (the time-series writer). The memory defaults make a single binary work with zero external dependencies; switch them to kafka / prometheus when you outgrow that.

A streamed result flows agent → gRPC StreamResults → control-plane ingest → result bus (probectl.network.results, Protobuf) → consumer → time-series writer. The agent sends the canonical OTel-aligned result (proto/probectl/result/v1); the control plane re-stamps the tenant and agent id from the verified mTLS certificate before publishing, so a result is always attributed to the sending agent's tenant regardless of payload contents — the tenant boundary is cryptographic, never self-asserted. The bus key is the tenant_id.

PROBECTL_BUS_MODE selects the bus: memory (default; in-process, for the lightweight <5-agent deployment and single-binary runs) or kafka (set PROBECTL_BUS_BROKERS). PROBECTL_TSDB_MODE selects the writer: memory (default; in-process) or prometheus remote-write to PROBECTL_TSDB_URL (Prometheus with --web.enable-remote-write-receiver, or VictoriaMetrics; use an https:// URL for TLS in transit). Each probe emits probectl_probe_success, probectl_probe_duration_seconds, and one probectl_probe_<metric> per custom metric, labeled tenant_id, agent_id, canary_type, and server_address. The canonical signal→OTel mapping is in otel-mapping.md.

ICMP test

The icmp canary measures echo loss, latency, and jitter to a target (IPv4 or IPv6). Configure it per-canary under canaries: (see probectl-agent.example.yml). The schedule interval and reply timeout are canary fields; the rest are params:

Param Default Meaning
count 5 echo requests per probe (continuous mode defaults to the interval in s)
payload_bytes 56 ICMP data bytes (minimum 8)
dscp 0 DSCP marking 0–63 on outgoing packets (best-effort by platform)
mode batch batch (back-to-back) or continuous (1 packet/sec)
privileged false true prefers raw sockets; default is unprivileged datagram ICMP

It emits probectl_probe_loss_ratio, probectl_probe_rtt_{min,avg,max,stddev}_ms, probectl_probe_jitter_ms, and probectl_probe_packets_{sent,received}. A probe with 100% loss reports success=false (target unreachable); partial loss is a success with a non-zero loss ratio. Continuous mode records a per-second drop-timing record as result attributes (icmp.dropped_seqs, icmp.drop_send_offsets_ms) — carried as OTel attributes, not TSDB labels, so they don't widen cardinality.

Privileges. By default the agent uses unprivileged datagram ICMP (IPPROTO_ICMP), which on Linux requires the agent's group to be within net.ipv4.ping_group_range (e.g. sysctl -w net.ipv4.ping_group_range="0 2147483647"). Alternatively grant raw-socket capability (setcap cap_net_raw+ep /usr/bin/probectl-agent, or run with CAP_NET_RAW) and set privileged: "true". The canary tries the preferred socket and falls back to the other; if neither can be opened it returns an internal error (the probe is not silently reported as loss).

TCP & UDP tests

The tcp and udp canaries are agent-to-server probes. Configure a target of host:port (or a host with params.port). Both accept count and dscp.

The tcp canary measures connect latency + reachability (a connect-based, unprivileged equivalent of a TCP-SYN test): it establishes a connection and times the handshake, emitting probectl_probe_connect_{min,avg,max,stddev}_ms, probectl_probe_jitter_ms, and probectl_probe_loss_ratio (failed connects = loss; all-failed = success=false).

The udp canary is an echo round-trip probe: it sends token-tagged datagrams and matches the echoes, emitting probectl_probe_rtt_* + loss. It needs a target that echoes (a UDP echo service, or a probectl agent-to-agent responder); a non-echoing target reports as 100% loss. params.payload_bytes (≥10) sets the datagram size.

Voice/RTP tests

The voice canary streams real RTP packets at codec cadence to an echoing target and scores the path: MOS + R-factor (simplified ITU-T G.107 E-model), RFC 3550 jitter, loss, and a one-way delay estimate. target is host:port. Parameters: codec (g711 default, g729), duration_seconds (1–10, default 3), dscp (default 46/EF). The model variant and the one-way-estimate method ride the result attributes — a computed MOS is never presented as a measured listening score. See docs/voice.md.

DNS tests

The dns canary queries DNS and reports resolution time, the answer, and an optional DNSSEC verdict. The target is the query name. Parameters:

Param Values Default Meaning
type A, AAAA, MX, TXT, NS, … A record type to query
transport udp | tcp | dot | doh udp how the query is sent
server host[:port] or a DoH URL per-transport resolver to query
mode resolver | trace resolver single query vs. delegation walk
dnssec true | false false validate the zone signature

server defaults by transport: the first nameserver in /etc/resolv.conf (or 1.1.1.1:53) for udp/tcp, 1.1.1.1:853 for DoT, and https://cloudflare-dns.com/dns-query for DoH. DoT verifies the resolver's TLS certificate (TLS 1.2+); DoH posts an RFC 8484 application/dns-message query over HTTPS.

In resolver mode the canary emits probectl_probe_dns_query_ms (round-trip) and probectl_probe_dns_answers (answer count), with dns.rcode and a compact dns.answer summary as attributes. The probe is success=false on a non-NOERROR rcode or an empty answer.

With dnssec: "true" the canary requests DNSSEC records (the DO bit) and validates the zone's RRSIG over the answer against the zone DNSKEY — it does not trust the resolver's AD bit. The verdict lands in the dns.dnssec attribute (secure, insecure for an unsigned zone, or bogus) and probectl_probe_dns_dnssec_secure (1/0); a bogus result (tampered, expired, or wrong-key signature) fails the probe. Validation verifies the signature on the answer RRset; full chain-to-root anchoring is a later refinement.

In trace mode the canary performs an iterative delegation walk from the root hints, following NS/glue referrals down to the authoritative server (UDP, capped iterations, with a recursive fallback when a referral ships no glue). It emits probectl_probe_dns_query_ms (total walk time) and probectl_probe_dns_trace_hops, with the delegation chain in the dns.trace attribute. DNS-exfiltration detection and open-data baselines are out of scope for this probe (they live in the NDR and open-data features).

HTTP server tests

The http canary measures HTTP(S) availability with a full response-time breakdown and captures TLS handshake details for the TLS-posture plane (see TLS / certificate observability below). The target is the URL. Parameters:

Param Values Default Meaning
method GET, HEAD, POST, … GET request method
expect_status codes / classes / ranges 2xx,3xx which statuses count as available
follow_redirects true | false true follow 3xx redirects
insecure_skip_verify true | false false capture TLS but don't fail on an invalid cert. Deny-by-default: requires the admin-only test.insecure_tls permission and is flagged in the test.create/test.update audit entry
ca_file path to a PEM bundle extra trust anchor (private/internal CA); must live under PROBECTL_AGENT_CANARY_CA_DIR
body string request body (e.g. for POST)
max_body_bytes integer 10485760 cap bytes read per probe (10 MiB)
allow_private_targets true | false false SSRF-guard override. Every canary (http/tcp/udp/icmp/dns/voice) denies loopback, RFC1918/ULA, link-local (incl. 169.254.169.254 cloud metadata), CGNAT, multicast and numeric-encoding bypasses by default, enforcing the check on the resolved address at dial time (rebind-proof). Setting true lifts the guard for that one test — requires the admin-only test.allow_private permission and is written to the tenant audit trail

expect_status is a comma list of exact codes (200), classes (2xx), and inclusive ranges (200-204); a response outside the set is success=false (the status is still reported). The probe emits the timing breakdown as metrics — probectl_probe_http_dns_ms (resolution), probectl_probe_http_connect_ms (TCP connect), probectl_probe_http_tls_ms (TLS handshake), probectl_probe_http_ttfb_ms (time to first byte), and probectl_probe_http_total_ms — plus probectl_probe_http_status, probectl_probe_http_content_bytes, and probectl_probe_http_throughput_kbps. A phase that does not occur (no DNS for an IP target, no TLS for http://) is omitted rather than reported as zero. The resolved server IP is captured as the network.peer.address attribute, which correlates the result to path/traceroute data for the same destination.

TLS capture. On HTTPS the canary records the negotiated tls.protocol.version and tls.cipher, the leaf certificate's tls.server.{subject,issuer,not_before,not_after,san}, the chain shape (tls.server.chain), and a probectl_probe_http_tls_cert_expiry_days metric (negative once expired). It verifies the chain itself (hostname + trust, honoring ca_file) after capturing the certificate, so the handshake details are recorded even when the certificate is invalid or expired — an invalid cert fails the probe but its details are still attached. Set insecure_skip_verify: "true" to capture posture without failing the availability check. probectl performs no TLS posture analysis here (issuer trust, weak-cipher/expiry policy, CT) — that is the TLS / certificate observability feature below, which consumes these captured fields.

Agent-to-agent tests

An agent-to-agent (A2A) test measures between two registered agents, brokered by the control plane. The control plane assigns roles (one agent responds, opening a short-lived listener; the other initiates), rendezvouses the responder's endpoint to the initiator, and hands each agent its task when it polls (PollCoordination / ReportEndpoint). The measurement is TWAMP-lite: the initiator timestamps each probe (T1), the responder stamps receive/send (T2/T3) and echoes, and the initiator stamps receive (T4), yielding round-trip (probectl_probe_rtt_*) plus forward and reverse one-way delay (probectl_probe_forward_avg_ms, probectl_probe_reverse_avg_ms). The responder also reports forward-direction delivery (probectl_probe_packets_received, probectl_probe_loss_ratio), so both agents and both directions are observed.

Enable participation in the agent's a2a block: enabled: true, advertise_host (the address peers use to reach this agent's responder), poll_interval (default 2s), and responder_ttl (default 15s). Caveats (document for production):

  • NAT/firewall. The responder advertises advertise_host; behind NAT this must be a reachable address and the responder's ephemeral port must be reachable from the initiator. Auto-detection picks a non-loopback IPv4 — set advertise_host explicitly when that is wrong.
  • Clocks. Forward/reverse one-way delays assume the two agents' clocks are synchronized (exact within one host; use NTP across hosts). Round-trip is clock-independent.

Sessions are brokered in-memory; triggering them from the test API is a later addition.

Path discovery

The path engine (internal/path) is the traceroute brain — it runs Paris-style traceroutes (ICMP and TCP), which handle equal-cost multipath (ECMP) and MPLS, and merges per-flow traces into one multi-path picture; see architecture.md. A full per-hop trace needs raw sockets: grant CAP_NET_RAW (setcap cap_net_raw+ep, or run privileged) to capture the intermediate hops + MPLS labels. Without it, only the destination is discovered.

Where the discovered hops/links are stored is a control-plane choice:

Variable Default Description
PROBECTL_PATHSTORE_MODE memory memory (in-process, for the lightweight/single-binary case and tests) | clickhouse (durable hop/link rows)
PROBECTL_PATHSTORE_URL (none) ClickHouse HTTP(S) endpoint (e.g. http://localhost:8123), partitioned by tenant; required when mode is clickhouse
PROBECTL_PATH_RETENTION_DAYS 90 delete-after-N-days TTL on the path/traceroute ClickHouse tables (applied at boot); 0 disables the TTL

BGP routing intelligence

The BGP plane is a Python analyzer (analyzer/) plus a Go bridge (internal/bgp); see architecture.md. The analyzer ingests public collector data and emits probectl.bgp.events:

python -m probectl_analyzer --config config.json --mrt rib.mrt        # RouteViews/RIS dump
python -m probectl_analyzer --config config.json --replay cap.jsonl   # recorded RIS Live
python -m probectl_analyzer --config config.json --ris-live           # live RIS Live websocket

The JSON config is per tenant (tenant_id is required — every event carries it, and the bridge rejects any event without one):

Key Meaning
tenant_id the owning tenant (outermost scope)
monitored_prefixes[].prefix a prefix to watch (a more-specific announcement is matched too)
monitored_prefixes[].expected_origins allowed origin ASNs — an origin outside this set raises possible_hijack
monitored_prefixes[].no_transit ASNs that must not transit this prefix — mid-path appearance raises possible_leak
collector collector label recorded on events (e.g. rrc00)
rpki_vrp_file / rpki_vrp_url a rpki-client/Routinator VRP JSON export for RFC 6811 validation (absent → unknown)

The analyzer emits probectl.bgp.events as JSON Lines; the Go bridge tails that stream, validates the tenant, and republishes each as the canonical probectl.bgp.v1.BGPEvent protobuf onto the bus (topic probectl.bgp.events, keyed by tenant). Event types: origin_change (old/new origin + AS path), possible_hijack, possible_leak, rpki_invalid; each carries an RPKI status (valid / invalid / not_found / unknown), a severity, and a confidence — they are signals, not actions — probectl never acts on routing. MRT dumps are stream-processed (no full RIB in memory); a down RPKI/collector source degrades gracefully rather than breaking the plane. RouteViews/RIS are open data — their AUP/provenance matters for MSP/commercial resale, not for private development or single-tenant OSS use.

Open-data enrichment

internal/opendata annotates IPs with ASN / geo / IXP / allocation context from public datasets; see architecture.md and the source provenance/AUP matrix in opendata-aup.md. The framework is a library (the flow and test pipelines consume it where enrichment is enabled); each source is pluggable and individually enable-able:

Source Kind Input it needs Notes
Team Cymru asn a DNS resolver IP→ASN/prefix/registry/AS-name via the Cymru IP-to-ASN DNS service
MaxMind GeoLite2 geo a .mmdb path (OpenMMDB) country/city/lat-lon; operator-supplied DB (not shipped)
PeeringDB ixp the ASN (from Cymru) IXP/facility presence via the PeeringDB REST API; cached per ASN
RIR delegated-stats allocation a delegated-extended stats file RIR/country/status/date; parsed once into a sorted index
RIPE Atlas (optional) measurement an API key + credits active ping/traceroute scheduling hook; off (fail-closed) by default

The Enricher runs every enabled source over an IP and merges the results, caching per IP and degrading gracefully: a disabled / failing / slow / panicking source is logged, marked degraded or disabled in Enricher.Status(), and skipped — a partial enrichment is returned and a down dataset never breaks a core path. Sources run in registration order (register the ASN source before PeeringDB). Each contribution records Provenance (source + license + attribution

  • fields); a source's AUP (license, commercial-use permission, attribution) is on its Descriptor — the matrix that gates MSP/commercial resale (not private or single-tenant OSS use). All fetches are over TLS with certificate validation and treated as untrusted — external content never gets implicit trust. Open data is ingested once and shared; enrichment is scoped per tenant by the consuming record.

Alerting

The alerting engine (internal/alert) evaluates rules over the TSDB and notifies channels; see architecture.md. Rules are CRUD'd via /v1/alerts (tenant-scoped) and the engine runs in the control plane, ticking every PROBECTL_ALERT_EVAL_INTERVAL (default 30s).

A rule targets a metric series and is either a threshold or a baseline rule:

Field Applies Meaning
metric + match both the TSDB metric (e.g. probectl_probe_loss_ratio) and label matchers
type both threshold | baseline
comparison + threshold threshold gt/lt/gte/lte/eq/neq vs a bound
window + sensitivity baseline rolling-history size and deviation (in std-devs); warms up until the window fills
for_n both consecutive breaching evals before firing (debounce)
renotify_seconds both re-notify cadence while firing (0 = notify once)
severity both info | warning | critical
channels both webhook / email destinations

A channels entry is {"type":"webhook","url":...,"secret":...} or {"type":"email","recipients":[...]}. The webhook secret is the HMAC key; it is redacted (***) from API responses and never returned. SMTP for email is configured at the deployment level (a follow-up exposes it as config).

Webhook payload (probectl.alert.v1). On fire/resolve the webhook channel POSTs:

{
  "version": "probectl.alert.v1",
  "state": "firing",
  "rule": { "id": "…", "name": "loss-high" },
  "tenant_id": "…",
  "severity": "critical",
  "metric": "probectl_probe_loss_ratio",
  "labels": { "server_address": "1.1.1.1" },
  "value": 0.9,
  "threshold": 0.5,
  "comparison": "gt",
  "reason": "probectl_probe_loss_ratio=0.9 gt 0.5",
  "fired_at": "2026-01-02T15:04:05Z"
}

When the channel has a secret, the request carries X-Probectl-Signature: sha256=<hex> — the HMAC-SHA256 of the exact body — so the receiver can verify the sender. Each channel delivers independently: a failing channel is logged and skipped, never blocking the others. Alerts are signals; probectl notifies and does not act on the network (on-call/ITSM routing and detection-as-code are their own features below).

Incidents

The incident correlator (internal/incident) groups related signals across planes into one Incident with a unified timeline; see architecture.md. It runs in the control plane, fed by the alert engine (network plane) and a probectl.bgp.events consumer (BGP plane), and is exposed at /v1/incidents (tenant-scoped):

  • GET /v1/incidents — the tenant's incidents, most-recently-active first.
  • GET /v1/incidents/{id} — an incident with its time-ordered signal timeline.
  • PATCH /v1/incidents/{id} with {"status":"resolved"} — resolve an incident.

Signals correlate into one incident when they are close in time (within PROBECTL_INCIDENT_WINDOW, default 10m) and related in target — the same target, an IP inside the other's prefix (either direction), or overlapping prefixes (so a network alert on 192.0.2.10 and a BGP event on 192.0.2.0/24 land together). An incident's severity is the max of its signals; a signal without a tenant is rejected (fail closed).

The model is extensible without schema churn: a Signal carries a free-form plane/kind and an arbitrary attributes map, so the change, threat, cost, and SLO planes attach as additional signal types onto the same Incident/timeline without schema changes. AI root-cause analysis runs over the timeline.

SSO & RBAC

probectl authenticates users with OIDC SSO and authorizes them with role-based access control (RBAC). The security order is the two-level boundary: a request resolves to exactly one tenant first, then RBAC decides whether the caller may perform the route's action within that tenant.

Login flow. GET /auth/login (optionally ?tenant=<uuid>) starts the OIDC authorization-code flow: it sets a short-lived, HttpOnly CSRF state cookie and redirects to the tenant's identity provider. The IdP redirects back to GET /auth/callback, which verifies the state, exchanges the code, verifies the ID token, just-in-time provisions the user within the tenant (a brand-new user gets no roles — a secure default; an admin grants access), mints a server-side session, and sets the session cookie. POST /auth/logout revokes the session. GET /v1/me returns the caller's tenant, identity, and effective permissions.

Sessions. A session is a random, high-entropy opaque token. Only its hash is stored (table sessions), so a database read cannot mint a session. The session cookie is HttpOnly + SameSite=Lax, and Secure whenever the API serves HTTPS. PROBECTL_SESSION_TTL (default 12h) bounds its lifetime.

Per-tenant IdP. Providers are resolved per tenant through a provider factory — the seam for a tenant bringing its own SSO. The shipped default is the env-configured one (PROBECTL_OIDC_*); database-backed per-tenant IdP config is a later addition. A login always resolves to a single tenant. Provider/MSP operators authenticate into the provider domain (the management plane), not into tenant data.

RBAC. Every /v1 route declares a required permission key; the wrapped handler returns 401 when unauthenticated and 403 when the principal lacks the permission — checked before the handler runs. Effective permissions are loaded per request from the user's role bindings (RLS-scoped to the tenant), so a role grant or revoke takes effect immediately. The permission catalog:

Permission Granted to (seeded roles) Guards
test.read viewer, editor, admin GET /v1/tests*, GET /v1/tests/{id}/path
test.write editor, admin POST/PUT/DELETE /v1/tests*, POST .../path
agent.read viewer, editor, admin GET /v1/agents*
agent.write admin PATCH/DELETE /v1/agents/{id}
alert.read viewer, editor, admin GET /v1/alerts*
alert.write editor, admin POST/PUT/DELETE /v1/alerts*
incident.read viewer, editor, admin GET /v1/incidents*
incident.write editor, admin PATCH /v1/incidents/{id}

The seeded system roles for the default tenant are admin (all permissions), editor (read everything + manage tests/alerts/incidents), and viewer (read-only). GET /v1/me requires only authentication (no specific permission).

Dev mode. PROBECTL_AUTH_MODE=dev bypasses SSO and synthesizes an all-permissions principal for the default tenant, with the X-Probectl-Tenant: <uuid> override for multi-tenant dev. It is triple-gated: (1) the code path exists only in binaries built with -tags devauth (make build-devauth) — a release binary refuses to start in this mode; (2) PROBECTL_DEV_AUTH_ACK=i-understand must be set; (3) the listener must bind loopback (PROBECTL_HTTP_ADDR=127.0.0.1:…). When active it logs at error level and writes an auth.dev_mode_active audit event. The CI gate no-devauth-in-release proves release binaries contain neither the symbols nor the dev-principal literal. The test suite installs its own hook in _test.go files, which never ship.

Resource API & CLI

The versioned resource API lives under /v1 (full schema at /openapi.json):

  • GET/POST /v1/tests, GET/PUT/DELETE /v1/tests/{id} — synthetic-test CRUD.
  • GET /v1/agents, GET/PATCH/DELETE /v1/agents/{id} — agents register over mTLS; the API lists, renames, and deregisters them.
  • GET/POST /v1/tests/{id}/path — the latest discovered network path for a test, and a trigger to discover it now. The Path & Topology UI consumes this.

Every /v1 route is tenant-scoped through internal/tenancy + Postgres RLS, so a request can never read or write across tenants. Authentication and RBAC are real (see SSO & RBAC below): the caller's tenant and effective permissions come from an authenticated session, and each route requires a permission. The "no undocumented routes" rule is enforced by a test that matches the route table against openapi.json.

The probectl CLI is the web-parity client. Configure it with flags or environment: PROBECTL_API_URL (default http://localhost:8080), PROBECTL_API_TOKEN (sent as Bearer), PROBECTL_TENANT (sent as X-Probectl-Tenant).

probectl test list
probectl test create --name edge-dns --type icmp --target 1.1.1.1 --interval 30
probectl test delete <id>
probectl agent list
probectl --json test list      # machine-readable output

eBPF host agent

The eBPF agent watches a host's network from inside the Linux kernel — it sees which processes talk to which services without you instrumenting anything. It is observe-only: it never blocks or modifies traffic. Like the canary agent, its real config is a YAML file (-config / PROBECTL_EBPF_CONFIG); see deploy/agent/probectl-ebpf-agent.example.yml and ebpf-agent.md, with PROBECTL_EBPF_* env vars overriding individual fields. The in-kernel loader is compiled in only with the ebpf build tag; without it (or for tests), point fixture_path at a recording to replay.

The big idea in the keys below: layer-7 plaintext capture is off, and stays off until you prove three separate intents — turn it on (L7_CAPTURE), name the tenant that consents (L7_CONSENT_TENANT), and list the exact workloads (L7_SCOPE). Miss any one and the kernel copies no payload. That is the fail-closed posture for the most sensitive thing this agent can do.

Variable Default Description
PROBECTL_EBPF_CONFIG (none) path to the YAML config (-config flag overrides)
PROBECTL_EBPF_TENANT_ID (required) the tenant every flow is stamped with — the agent refuses to start without it
PROBECTL_EBPF_HOST OS hostname observing host name
PROBECTL_EBPF_BUS_MODE memory memory | kafka
PROBECTL_EBPF_BUS_BROKERS (none) comma-separated Kafka brokers (kafka mode)
PROBECTL_EBPF_BUS_NAMESPACE (none) publish on this tenant's siloed bus lane (probectl.<ns>.ebpf.flows) instead of the shared topic; for per-tenant-namespaced (siloed) deployments
PROBECTL_EBPF_FIXTURE_PATH (none) replay recorded flows instead of loading eBPF (no-kernel path)
PROBECTL_EBPF_L7_FIXTURE_PATH (none) replay recorded layer-7 events (no-kernel L7 path)
PROBECTL_EBPF_RING_BUFFER_BYTES 16777216 size of the kernel→userspace ring buffer (16 MiB; live loader only). Bigger absorbs bigger traffic bursts at the cost of memory
PROBECTL_EBPF_LIBSSL (auto) explicit libssl path for TLS-plaintext (uprobe) L7 capture; auto-discovered when unset (ebpf build)
PROBECTL_EBPF_L7_CAPTURE false master switch — live TLS-plaintext capture is OFF by default. true alone is not enough; consent AND scope below are also required
PROBECTL_EBPF_L7_CONSENT_TENANT (none) the explicit per-tenant consent: must equal this agent's bound tenant id exactly, else capture stays off
PROBECTL_EBPF_L7_SCOPE (none) the explicit workload opt-in — comma-separated pid:<n>, exe:/abs/path, cgroup:/abs/cgroup-dir entries. The kernel program drops every other process BEFORE copying a byte; empty = capture refuses to start. Host-wide capture is deliberately not expressible. Container/pod scoping is the cgroup: form (a container IS a cgroup); exe: entries are re-resolved every 10s so restarts stay in scope
PROBECTL_EBPF_L7_REDACTION headers how much of a payload may survive capture: headers zeroes the bodies in place before anything is retained (protocol metadata survives); length captures NO payload bytes (traffic shape only, no parsed calls); full (consented debugging) disables masking
PROBECTL_EBPF_L7_KERNEL_WINDOW 1024 max plaintext bytes per chunk that may cross from kernel into userspace under headers redaction (128–4095); bytes past the window never leave the kernel. length forces 0, full forces 4095. An unprogrammed kernel defaults to length-only, so it ships no plaintext
PROBECTL_EBPF_PROC_ROOT /proc procfs root for process/cgroup enrichment
PROBECTL_EBPF_FLUSH_INTERVAL 10s how often flows + the service map are emitted
PROBECTL_EBPF_HEALTH_ADDR (none) bind a liveness/readiness probe server (e.g. :9090; /healthz = process up, /readyz = flow source attached). Empty disables it. The Helm DaemonSet sets it from health.port
PROBECTL_EBPF_LOG_LEVEL info debug | info | warn | error
PROBECTL_EBPF_LOG_FORMAT json json | text

Flows + service edges are published to probectl.ebpf.flows (ebpfv1.FlowBatch, tenant-keyed). The live loader needs a BTF Linux kernel (≥5.8) and CAP_BPF/CAP_PERFMON; see ebpf-agent.md.

Agent→bus TLS/SASL (eBPF, endpoint, flow, and device agents)

When a telemetry agent publishes straight to Kafka, its broker connection takes the same hardening keys as the control plane's PROBECTL_BUS_* set, under the agent's own prefix: PROBECTL_EBPF_BUS_* here, and likewise PROBECTL_ENDPOINT_BUS_*, PROBECTL_FLOW_BUS_*, and PROBECTL_DEVICE_BUS_* for the agents below. The policy is the same fail-closed one: kafka mode without TLS refuses to start unless the explicit dev-only plaintext flag is set. (The canary agent has no bus keys — it talks gRPC/mTLS to the control plane, which publishes on its behalf.)

Suffix (append to the agent's prefix) Default Meaning
_BUS_TLS_ENABLED false TLS to the brokers — required in kafka mode unless _BUS_ALLOW_PLAINTEXT is set
_BUS_TLS_CA_FILE (none) private CA bundle for the brokers
_BUS_TLS_CERT_FILE / _BUS_TLS_KEY_FILE (none) client certificate + key (broker mTLS)
_BUS_SASL_MECHANISM (none) plain | scram-sha-256 | scram-sha-512
_BUS_SASL_USER / _BUS_SASL_PASSWORD (none) SASL credentials (the agents read these as literal env values — the secret-reference schemes are a control-plane feature)
_BUS_ALLOW_PLAINTEXT false dev only: allow a plaintext broker (the dev compose stack). Production never sets this
_BUS_MAX_BUFFERED 0 (= built-in bound 65536) async-producer in-flight bound; a full buffer sheds + counts, never blocks

Endpoint / DEM agent (probectl-endpoint)

"DEM" is digital experience monitoring: this agent runs on an end-user's laptop (Linux/macOS/Windows), measures their actual last-mile experience, and figures out whether a slowdown is the WiFi, the ISP, or the network. Because it sits on a personal device, its defaults are privacy-first — it collects the WiFi name and gateway (useful, low-risk) but not the AP MAC or public hop IPs (which can geolocate a person), and it discloses exactly what it collects on startup. It reads a YAML config (default path PROBECTL_ENDPOINT_CONFIG); PROBECTL_ENDPOINT_* env vars override it. See endpoint-dem.md.

Variable Default Meaning
PROBECTL_ENDPOINT_CONFIG (none) path to the YAML config (-config flag overrides)
PROBECTL_ENDPOINT_TENANT_ID (required) the tenant every result is stamped with — refuses to start without it
PROBECTL_ENDPOINT_AGENT_ID OS hostname device identifier in the fleet
PROBECTL_ENDPOINT_BUS_MODE memory memory | kafka
PROBECTL_ENDPOINT_BUS_BROKERS (none) comma-separated Kafka brokers (kafka mode)
PROBECTL_ENDPOINT_BUS_NAMESPACE (none) publish on this tenant's siloed bus lane instead of the shared topic (siloed deployments)
PROBECTL_ENDPOINT_INTERVAL 60s how often a sample is collected
PROBECTL_ENDPOINT_TARGETS https://1.1.1.1,https://www.google.com comma-separated targets (first = last-mile trace; all = session probes)
PROBECTL_ENDPOINT_MAX_HOPS 20 last-mile trace hop cap
PROBECTL_ENDPOINT_COLLECT_SSID true retain the WiFi network name (SSID)
PROBECTL_ENDPOINT_COLLECT_BSSID false retain the access-point MAC (BSSID) — geolocatable PII, off by default
PROBECTL_ENDPOINT_COLLECT_GATEWAY_IP true retain the (private) default-gateway address
PROBECTL_ENDPOINT_COLLECT_PUBLIC_HOPS false retain PUBLIC last-mile hop IPs (which reveal ISP/geo), off by default
PROBECTL_ENDPOINT_LOG_LEVEL info debug | info | warn | error
PROBECTL_ENDPOINT_LOG_FORMAT json json | text

Results (WiFi / gateway / last-mile / session signals + the attribution verdict) are published to probectl.endpoint.results (resultv1.Result, tenant-keyed), flowing through the same pipeline as every other canary. The agent discloses exactly what it collects at startup and never phones home.

Flow collector (probectl-flow-agent)

The flow collector listens for NetFlow v5/v9, IPFIX, and sFlow v5 datagrams from network devices, decodes them (template + sampling handling), and publishes normalized batches to probectl.flow.events (flowv1.FlowBatch, tenant-keyed). It reads a YAML config (default path PROBECTL_FLOW_CONFIG); PROBECTL_FLOW_* env vars override the file. The defaults serve all three protocols on their standard ports (NetFlow :2055, IPFIX :4739, sFlow :6343). See flow.md for the security posture: flow export is plaintext UDP by design, so every datagram is treated as untrusted and the collector should sit adjacent to its exporters (not exposed to the wider network).

Variable Default Meaning
PROBECTL_FLOW_CONFIG (none) path to the YAML config (-config flag overrides)
PROBECTL_FLOW_TENANT (required) the tenant every flow record is stamped with — refuses to start without it
PROBECTL_FLOW_BUS_NAMESPACE (none) publish this agent's batches on its tenant's siloed bus lane (probectl.<ns>.flow.events) instead of the shared topic; a malformed value refuses start. The same key exists for the other agents: PROBECTL_DEVICE_BUS_NAMESPACE, PROBECTL_EBPF_BUS_NAMESPACE, PROBECTL_ENDPOINT_BUS_NAMESPACE
PROBECTL_FLOW_AGENT_ID OS hostname collector identifier
PROBECTL_FLOW_BUS_MODE memory memory | kafka
PROBECTL_FLOW_BUS_BROKERS (none) comma-separated Kafka brokers (kafka mode)
PROBECTL_FLOW_NETFLOW_ENABLED true serve NetFlow v5 and v9 (version-sniffed) on one socket
PROBECTL_FLOW_NETFLOW_LISTEN :2055 NetFlow UDP listen address
PROBECTL_FLOW_IPFIX_ENABLED true serve IPFIX
PROBECTL_FLOW_IPFIX_LISTEN :4739 IPFIX UDP listen address
PROBECTL_FLOW_SFLOW_ENABLED true serve sFlow v5
PROBECTL_FLOW_SFLOW_LISTEN :6343 sFlow UDP listen address
PROBECTL_FLOW_BATCH_SIZE 1000 records per emitted batch
PROBECTL_FLOW_FLUSH_INTERVAL 2s max time a record waits before emission
PROBECTL_FLOW_TEMPLATE_TTL 30m v9/IPFIX template expiry
PROBECTL_FLOW_MAX_TEMPLATES 4096 template-cache size cap (untrusted-input bound)
PROBECTL_FLOW_READ_BUFFER_BYTES 4194304 kernel UDP receive buffer (burst absorption)
PROBECTL_FLOW_QUEUE_SIZE 65536 decode→flush channel depth (overflow drops are counted)
PROBECTL_FLOW_WORKERS 2 reader goroutines per socket
PROBECTL_FLOW_LOG_LEVEL info debug | info | warn | error
PROBECTL_FLOW_LOG_FORMAT json json | text

The control plane consumes that flow topic, optionally enriches each record with ASN/geo, and persists to the flow store behind /v1/flows/* (top-talkers / capacity / anomalies). These are control-plane keys (not flow-agent keys):

Variable Default Meaning
PROBECTL_FLOWSTORE_MODE memory where flow records live: memory (lightweight/single-binary) | clickhouse (durable, high-cardinality)
PROBECTL_FLOWSTORE_URL (none) ClickHouse HTTP(S) endpoint; required in clickhouse mode
PROBECTL_FLOWSTORE_TENANT_SCOPING false defense-in-depth: also constrain flow reads at the database by attaching a per-request tenant setting that a ClickHouse row policy enforces (needs server-side custom_settings_prefixes=SQL_ + a reader user). Tenant scoping already happens above this; this pushes it down one more layer
PROBECTL_FLOWSTORE_READER_USER (none) the ClickHouse reader user the setting-scoped row policy is installed on at boot (pairs with the toggle above)
PROBECTL_FLOW_RETENTION_DAYS 0 (keep) when > 0, applies a delete-after-N-days TTL to the probectl_flows ClickHouse table; 0 keeps flows indefinitely
PROBECTL_FLOW_ENRICH_ASN false opt-in Team Cymru ASN enrichment. Off by default because it makes outbound DNS lookups (the no-phone-home guardrail); AS numbers the device itself exported always pass through regardless

Device telemetry agent (probectl-device-agent)

This agent reads metrics straight off network gear (routers, switches). It polls the old way (SNMP v2c/v3) and subscribes the modern streaming way (gNMI/OpenConfig), normalizes both into one DeviceMetric shape, and publishes to probectl.device.metrics (tenant-keyed); the control plane lands them in the TSDB as probectl_device_* series. The full device list lives in a YAML config (see deploy/agent/probectl-device-agent.example.yml); the env vars below override it and give a single-device quick start for trying one device fast. See device-telemetry.md.

Variable Default Meaning
PROBECTL_DEVICE_CONFIG (none) path to the YAML config (-config flag overrides)
PROBECTL_DEVICE_TENANT (required) the tenant every device metric is stamped with — refuses to start without it
PROBECTL_DEVICE_AGENT_ID OS hostname agent identifier
PROBECTL_DEVICE_BUS_MODE memory memory | kafka
PROBECTL_DEVICE_BUS_BROKERS (none) comma-separated Kafka brokers (kafka mode)
PROBECTL_DEVICE_BUS_NAMESPACE (none) publish on this tenant's siloed bus lane instead of the shared topic (siloed deployments)
PROBECTL_DEVICE_TARGET (none) quick start: add one device by address
PROBECTL_DEVICE_TRANSPORT snmpv2c quick-start transport: snmpv2c | snmpv3 | gnmi
PROBECTL_DEVICE_CREDENTIAL (none) quick start: credential NAME for the device (see below)
PROBECTL_DEVICE_PORT 161 (SNMP) / 9339 (gNMI) quick start: port override (defaults to the transport's standard port)
PROBECTL_DEVICE_INTERVAL 60s quick start: poll/sample interval
PROBECTL_DEVICE_LOG_LEVEL info debug | info | warn | error
PROBECTL_DEVICE_LOG_FORMAT json json | text

Credentials are referenced by NAME, never inlined — no secrets in the device list. The default credential source resolves those names from the environment (the PROBECTL_DEVICE_CRED_<NAME>_* vars below); the secrets backends plug Vault/CyberArk into the same seam. An unresolvable name fails closed at startup. <NAME> is the upper-cased credential name with -/._:

Variable Used by Meaning
PROBECTL_DEVICE_CRED_<NAME>_COMMUNITY snmpv2c community string
PROBECTL_DEVICE_CRED_<NAME>_USERNAME snmpv3, gnmi USM user / gNMI metadata user
PROBECTL_DEVICE_CRED_<NAME>_AUTH_PROTO snmpv3 sha (default) | sha256 | sha512 | md5
PROBECTL_DEVICE_CRED_<NAME>_AUTH_PASS snmpv3 auth passphrase (empty → NoAuthNoPriv)
PROBECTL_DEVICE_CRED_<NAME>_PRIV_PROTO snmpv3 aes (default) | aes256 | des
PROBECTL_DEVICE_CRED_<NAME>_PRIV_PASS snmpv3 privacy passphrase (empty → AuthNoPriv)
PROBECTL_DEVICE_CRED_<NAME>_PASSWORD gnmi gNMI metadata password

gNMI connections are TLS with certificate verification (system roots or a per-device ca_file); there is no skip-verify option. plaintext: true is an explicit lab-only YAML opt-in and is loudly logged — never a silent plaintext default.

OTLP receiver

This lets other systems push their OpenTelemetry data (metrics, traces, logs) into probectl. It is off by default and, when on, is locked to the same posture as everything else: TLS-only, token-authenticated, tenant-scoped, on its own listeners separate from the /v1 REST API. There is no anonymous-plaintext mode — setting a listen address without both a TLS cert/key pair and at least one bearer token fails config validation. See otlp.md.

Variable Default Description
PROBECTL_OTLP_GRPC_ADDR (none) OTLP/gRPC listen address (e.g. :4317)
PROBECTL_OTLP_HTTP_ADDR (none) OTLP/HTTP listen address (e.g. :4318); accepts all three signals — POST /v1/metrics, /v1/traces, /v1/logs
PROBECTL_OTELSTORE_MODE memory where ingested OTLP traces+logs live: memory (lightweight) | clickhouse (production; (tenant_id, day) partition)
PROBECTL_OTELSTORE_URL (none) ClickHouse HTTP URL for clickhouse mode (https = TLS in transit)
PROBECTL_OTEL_RETENTION_DAYS 30 delete-TTL for stored OTLP traces+logs (0 disables)
PROBECTL_OTLP_TLS_CERT_FILE (none) PEM server certificate (required to enable)
PROBECTL_OTLP_TLS_KEY_FILE (none) PEM server private key (required to enable)
PROBECTL_OTLP_TOKENS (none) bearer-token→tenant map: token1=tenant1,token2=tenant2

Setting an address without the TLS files and at least one token fails config validation — the receiver is never anonymous plaintext. Ingested metrics are tenant-tagged and published to the probectl.otlp.metrics bus topic.

Ecosystem integrations

The Grafana datasource API (/v1/grafana/api/v1/*), the federation endpoint (/v1/prometheus/federate), and the remote-write receiver (/v1/prometheus/write) ride the existing TSDB config (PROBECTL_TSDB_MODE / PROBECTL_TSDB_URL) and the /v1 API listener — no extra keys. Reads need metrics.read, remote-write metrics.write (migration 0022). See ecosystem-integrations.md.

The ServiceNow CMDB correlation is off unless configured:

Variable Default Meaning
PROBECTL_CMDB_PROVIDER (none) servicenow enables CI correlation (/v1/cmdb/*, incident/agent CIs)
PROBECTL_CMDB_URL (none) instance URL, e.g. https://acme.service-now.com (https; http only for loopback test doubles)
PROBECTL_CMDB_SECRET (none) user:password for the read-only integration user (env only — never in files/logs)
PROBECTL_CMDB_TABLE cmdb_ci CI table queried via the Table API
PROBECTL_CMDB_CACHE_TTL 10m CI lookup cache TTL (a down CMDB serves stale entries)

AI assistant

Worked per-provider setups (Ollama, vLLM, OpenAI, Anthropic, Azure) are in ai-rca.mdCopy-paste recipes; the remote-egress enablement chain (operator ack + per-tenant consent) is in ai-egress.md.

The assistant (root-cause analysis + natural-language query) works out of the box with zero network access — the default builtin provider is an in-process synthesizer that writes its answers locally. You only point it at a real language model if you want nicer prose, and doing so is treated as data egress: a remote endpoint must be https, and you have to explicitly acknowledge that tenant data will leave (PROBECTL_AI_EGRESS_ACK). A loopback endpoint may be http (for a local model on the same box). The redaction keys below mask sensitive values before anything reaches an external model. See ai-rca.md.

Variable Default Description
PROBECTL_AI_MODEL_PROVIDER builtin builtin (air-gapped, the default) | ollama | openai | anthropic
PROBECTL_AI_EGRESS_ACK (none) required to use a REMOTE model: must equal yes-send-tenant-data-to-the-remote-model, or the server refuses to start. This is a deliberate "yes, I know data leaves" gate, on top of per-tenant consent + audit — see docs/ai-egress.md
PROBECTL_AI_REDACT_IPS true mask IP addresses in anything sent to an external model (stable per-value tokens, so correlation survives; local file paths are never redacted)
PROBECTL_AI_REDACT_HOSTNAMES false also mask hostnames (secrets are masked unconditionally regardless of this)
PROBECTL_AI_REDACT_PII true mask free-text PII — emails, phone numbers, MAC addresses — in anything sent to an external model (RCA prompts, MCP tool results, authoring prompts)
PROBECTL_AI_REDACT_PATTERNS (none) your own regexes (;;-separated), masked as [custom:xxxx] — for org-specific identifiers (employee IDs, ticket refs). A bad pattern refuses start (fail closed)
PROBECTL_AI_MODEL_ENDPOINT (none) base URL of the model (required for a non-builtin provider)
PROBECTL_AI_MODEL_NAME (none) model name (e.g. llama3.1, gpt-4o-mini)
PROBECTL_AI_MODEL_TOKEN (none) API key / bearer token (optional for a local Ollama)
PROBECTL_AI_MODEL_TIMEOUT 60s per-request timeout for the model endpoint
PROBECTL_AI_MAX_EVIDENCE 50 cost guard: the most signals one answer may gather
PROBECTL_AI_MAX_CONCURRENT 8 process-wide cap on concurrent analyses (HTTP 429 when exceeded); a backstop beneath the per-tenant fairness gate
PROBECTL_AI_PERSIST_ANSWERS false persist every answer (the cited JSON + model + config hash) for reproducibility/disputes
PROBECTL_AI_ANSWER_RETENTION 2160h (90 days) prune persisted answers older than this (enforced opportunistically on write)

A non-builtin provider without an endpoint fails config validation. Whatever the backend, every answer is tenant- and RBAC-scoped by the query layer and every claim is citation-checked before it reaches the user — a model can never see out-of-scope data or inject an ungrounded claim.

MCP server

The MCP server exposes read-only, tenant- + RBAC-scoped tools to AI clients. The HTTP transport is off by default and is TLS-only + bearer-authenticated; the stdio transport is local (probectl-control mcp-stdio, token from PROBECTL_MCP_TOKEN). Mint a token with probectl-control mcp-token --user <user-uuid> [--tenant <uuid>] [--name <label>] — the token prints once and only its hash is stored, so a database read can never recover it. See mcp.md.

Variable Default Description
PROBECTL_MCP_HTTP_ADDR (none) MCP HTTP listen address (e.g. :8090) — enables the transport
PROBECTL_MCP_TLS_CERT_FILE (none) PEM server certificate (required to enable HTTP)
PROBECTL_MCP_TLS_KEY_FILE (none) PEM server private key (required to enable HTTP)
PROBECTL_MCP_RATE_PER_MIN 120 per-tenant tool-call rate limit (0 disables)

Setting PROBECTL_MCP_HTTP_ADDR without the TLS files fails config validation — the MCP endpoint is never anonymous plaintext.

TLS / certificate observability

The control plane analyzes TLS/cert posture from TLS handshakes the HTTP and eBPF-L7 probes already captured — it never re-handshakes a target itself — and correlates the findings into threat-plane incidents. See tls-observability.md.

Variable Default Description
PROBECTL_TRUSTCTL_URL (none) trustctl base URL; enables a one-click renewal deep-link on findings
PROBECTL_TLS_EXPIRY_WARNING 504h (21d) expiring-soon window
PROBECTL_CT_ENABLED false opt in to Certificate Transparency correlation (external fetch)
PROBECTL_CT_ENDPOINT https://crt.sh CT log API endpoint

CT correlation is off by default (an external fetch — sovereignty / AUP / rate limits) and degrades gracefully when the CT source is down.

Threat-intel enrichment

The control plane can match peer IPs / hostnames / certs / JA3 against public threat-intel feeds, surfacing confidence-scored, source-attributed threat-plane signals (a signal, not an IPS — never blocks). See threat-intel.md for the feed/AUP matrix and caveats.

Variable Default Description
PROBECTL_THREATINTEL_ENABLED false master switch (outbound feed fetches); off ⇒ no IOC code runs
PROBECTL_THREATINTEL_REFRESH 6h feed refresh cadence
PROBECTL_THREATINTEL_FEEDS (all) comma-separated feed names (spamhaus_drop, feodo_tracker, sslbl, sslbl_ja3, urlhaus, tor_exit, firehol_level1); empty ⇒ all

Off by default (an outbound fetch — sovereignty / no-phone-home). The refresher keeps each source's last-good indicators, so a feed outage degrades gracefully and never breaks a core path.

Enterprise identity: SCIM + ABAC

SCIM 2.0 provisioning and ABAC have no environment keys — the SCIM bearer token an IdP presents is minted with the control-plane CLI, and ABAC policies are managed over the API. See scim-abac.md.

# mint a per-tenant SCIM token for an IdP (shown once)
probectl-control scim-token --tenant <tenant-uuid> --name okta

The /scim/v2/* surface is gated by a valid SCIM token (no token ⇒ 401), and the directory-admin API (/v1/abac/policies) requires directory.read/directory.write.

Change intelligence

Ingest per-provider-signed change webhooks (deploys/config/route/IaC/commits) into a change timeline + change-to-incident correlation, feeding the AI RCA. See change-intel.md for the webhook contract + provider/signature table.

Variable Default Description
PROBECTL_CHANGE_WEBHOOKS (none) comma-separated id:tenant:provider:secret webhook credentials (providergeneric/github/gitlab). The secret is the last field, so it may contain : but not , — use URL-safe (hex/base64) secrets.
PROBECTL_CHANGE_CORRELATION_WINDOW 24h how far before an incident a change is treated as a candidate cause

Each inbound delivery is TLS + signature-verified (HMAC/token, constant-time) + tenant-bound to the credential; an unsigned or forged event is rejected before storage, and one tenant cannot inject another's changes. Webhook secrets are runtime config — inject them from a secret manager, never commit them.

SIEM export

Forward the audit stream and threat-plane signals to a SOC's SIEM over hardened TLS. probectl is the forwarder, not a SIEM — events are rendered into a standard format and pushed; nothing is auto-blocked. See siem.md for formats, delivery guarantees, and per-SIEM setup.

Variable Default Description
PROBECTL_SIEM_ENABLED false master switch (an outbound connection to your SIEM); off ⇒ no SIEM code runs
PROBECTL_SIEM_PRESET generic SIEM adapter: generic, splunk, sentinel, elastic, chronicle (sets the auth scheme + default format)
PROBECTL_SIEM_FORMAT (preset) wire format: syslog (RFC 5424), cef, ecs, otlp; empty ⇒ the preset's native default (Elastic⇒ecs, Chronicle⇒otlp, else cef)
PROBECTL_SIEM_ENDPOINT (none) HTTPS ingest URL (e.g. the Splunk HEC / Sentinel / Chronicle / Elasticsearch endpoint). Required when enabled
PROBECTL_SIEM_TOKEN (none) ingest credential (Splunk ⇒ Splunk <tok>, Elastic ⇒ ApiKey <tok>, others ⇒ Bearer <tok>). Inject from a secret manager
PROBECTL_SIEM_POLL_INTERVAL 30s audit-stream drain cadence
PROBECTL_SIEM_BUFFER 1024 threat-signal buffer; full ⇒ producers block (backpressure, never drop)
PROBECTL_SIEM_REDACT_KEYS (none) extra audit data keys to scrub (on top of the built-in secret/PII denylist)

Off by default (an outbound connection — sovereignty / no-phone-home). Audit forwarding resumes from a durable per-tenant cursor, and delivery retries without dropping under a SIEM outage. Exported audit events are PII/secret redacted (built-in denylist + PROBECTL_SIEM_REDACT_KEYS).

On-call + ITSM integration

Mirror incidents into operational tooling: page on-call (PagerDuty/Opsgenie), post to chat (Slack/Teams), and open + bidirectionally sync tickets (ServiceNow/Jira). probectl is the forwarder, not the system of record — it never auto-blocks anything. See oncall-itsm.md for the connector matrix, mapping, and the inbound webhook contract.

Variable Default Description
PROBECTL_NOTIFY_CONNECTORS (none) outbound connectors, comma-separated, each tenant|provider|endpoint|secret (pipe-delimited because the endpoint is a URL). providerpagerduty/opsgenie/slack/teams/servicenow/jira. secret is the provider credential (PagerDuty routing key, Opsgenie API key, ServiceNow user:password, Jira email:token; unused for chat).
PROBECTL_NOTIFY_INBOUND (none) inbound status-sync credentials, comma-separated, each id:tenant:provider:secret (the id is the URL selector for POST /ingest/itsm/{provider}/{id}; secret verifies the delivery).

Off by default (each connector is an outbound connection to the operator's tooling). Paging + ticket creation are idempotent (an incident opens at most once per connector — a retry/restart never double-pages), status sync is bidirectional with loop protection (an inbound resolve from one system is never echoed back to it), and routing is per-tenant (a connector only fires for its own tenant). Endpoint specifics: a Slack/Teams endpoint is the incoming-webhook URL; a Jira endpoint carries the project (and optional resolve transition) as query params, e.g. …/rest/api/2/issue?project=OPS&resolve_transition=31; a ServiceNow endpoint is the …/api/now/table/incident URL. Inbound deliveries must include X-Probectl-Signature: sha256=<hmac> or X-Probectl-Token: <secret> over TLS; an unsigned or forged delivery is rejected (401). Secrets are runtime config — inject them from a secret manager, never commit them.

Topology graph + what-if

Variable Default Purpose
PROBECTL_TOPOLOGY_ENGINE indexed graph engine: indexed (adjacency-indexed, for large/extra-large graphs) or memory (the simpler reference implementation). Both sit behind the same query API

The graph feeds from eBPF/BGP/device streams + path discoveries; served at GET /v1/topology with what-if simulation at POST /v1/topology/whatif. See docs/topology.md.

FinOps / egress cost

Variable Default Purpose
PROBECTL_COST_ENABLED true cost engine over the local flow stream (volume × public pricing; no billing-API calls)
PROBECTL_COST_ZONES (none) CIDR→zone rules, e.g. 10.0.1.0/24=us-east-1a,… (locality classification)
PROBECTL_COST_SERVICES (none) CIDR→service:team attribution rules (showback)
PROBECTL_COST_BUDGETS (none) monthly USD budgets, e.g. team:payments=500 (breach = one cost-plane signal per month)
PROBECTL_COST_PRICES_FILE (none) JSON price-table override; embedded public list rates otherwise (provenance + as-of surfaced)
PROBECTL_COST_PRICED true false = volume-only mode (bytes attributed, dollars never invented)

Summary at GET /v1/cost/summary and the Cost page; deep dashboards are federated to Grafana (see Ecosystem integrations above). See docs/finops.md.

SLO engine

Variable Default Purpose
PROBECTL_SLO_ENABLED true OpenSLO SLI/SLO engine over the synthetic-result stream (error budgets + multi-window burn-rate signals)
PROBECTL_SLO_DIR (none) directory of OpenSLO v1 YAML definitions (strictly validated; malformed/duplicate definitions fail startup)

Statuses at GET /v1/slos, OpenSLO export at GET /v1/slos/openslo, and the SLOs page. See docs/slo.md.

Compliance / segmentation validation

Variable Default Purpose
PROBECTL_COMPLIANCE_ENABLED true segmentation validator over observed flow/eBPF traffic (validation only — never enforcement)
PROBECTL_COMPLIANCE_POLICY_DIR (none) segmentation policy YAML directory (strictly validated; malformed files fail startup)

Verdicts at GET /v1/compliance, hash-chained audit evidence at GET /v1/compliance/evidence, and the Compliance page. See docs/compliance.md.

Collective internet-outage view

Variable Default Purpose
PROBECTL_OUTAGE_ENABLED true the local engine: vantage detection over your own results + correlation with external events (no outbound calls)
PROBECTL_OUTAGE_FEEDS_ENABLED false opt-in public outage feeds (IODA, Cloudflare Radar) — enabling makes outbound fetches (sovereignty / no-phone-home)
PROBECTL_OUTAGE_FEEDS (all) feeds to load: ioda, cloudflare_radar
PROBECTL_OUTAGE_REFRESH 10m feed refresh cadence (last-good kept on failure)
PROBECTL_OUTAGE_RETENTION 48h event window kept/queried
PROBECTL_OUTAGE_RADAR_TOKEN (none) Cloudflare API token the radar feed requires (a secret reference is accepted); the feed is omitted without it

The collective view at GET /v1/outages (events + the caller-tenant's affected tests + vantage detections + feed AUP/health + coverage notes) and the Internet outages page. Scope resolution (IP→ASN/country) rides the open-data enricher (PROBECTL_FLOW_ENRICH_ASN); without it the response reports the degradation honestly. See docs/outage.md.

RUM convergence

Variable Default Purpose
PROBECTL_RUM_ENABLED false the browser-beacon ingest + synthetic↔RUM convergence engine (an inbound surface — opt-in)
PROBECTL_RUM_APPS (none) app-key registry pk_key=tenant/app,... — each beacon binds to its KEY's tenant; enabled-but-empty fails startup
PROBECTL_RUM_RATE_PER_MIN 300 per-key beacon rate limit (429 + Retry-After above it; 0 = unlimited)

Beacons ingest at POST /ingest/rum (app-key authenticated, consent-gated, URL-redacted, no IP stored — privacy is enforced server-side, fail closed); the convergence view serves at GET /v1/rum and folds into the Endpoints surface; rum.* vitals flow to the TSDB for dashboards. The SDK is web/public/probectl-rum.js. See docs/rum.md.

Carbon / power observability

Variable Default Purpose
PROBECTL_CARBON_ENABLED true coefficient-based energy/carbon ESTIMATES over the local flow stream (local-only; methodology served with every response)
PROBECTL_CARBON_GRID_GCO2E 436 your grid's carbon intensity in gCO2e/kWh (defaults to the world average — set yours)

Attribution reuses PROBECTL_COST_ZONES / PROBECTL_COST_SERVICES. The estimate serves at GET /v1/carbon and folds into the Cost page. See docs/carbon.md. The chaos injector and the large/extra-large scale gate are test-harness tools — see docs/chaos.md and docs/scale-gate.md.

Editions / license

Variable Default Purpose
PROBECTL_LICENSE_FILE (none) path to the Ed25519-signed license file. Unset = Community (the full core, default-open). Set-but-missing/invalid = startup error (fail closed on configuration)

Verification is offline — local signature math against public keys baked into the binary at build time (never an env var; never phone-home). Expiry runs the 30-day-grace → read-only ladder and never breaks running telemetry. License state + the feature→tier map serve at GET /v1/editions and render on Admin → Editions — the one place tiers appear when unlicensed. See docs/editions.md for the file format, the signing CLI (probectl-license), and the gating pattern.

Provider / management plane (ee/)

Active only when the license grants provider_plane; otherwise /provider/* is a plain 404 (hidden, not locked).

Variable Default Purpose
PROBECTL_PROVIDER_BOOTSTRAP_TOKEN (none) creates the FIRST operator via POST /provider/v1/auth/bootstrap; single-use — inert once any operator exists
PROBECTL_PROVIDER_BREAKGLASS_MAX_TTL_MINUTES 240 cap on break-glass grant lifetimes (5–1440)

The provider plane additionally requires PROBECTL_ENVELOPE_KEY (operator TOTP secrets are envelope-sealed at rest) and a database. Operator MFA is mandatory; operators are a privilege domain distinct from tenant users with no implicit access to tenant telemetry — see docs/provider-plane.md for the model, the break-glass consent flow, and the storage-layer confinement (probectl_provider role). Suspending a tenant rejects its users at the API (tenant_suspended) without touching data or ingestion.

Siloed / hybrid isolation (ee/)

Pooled isolation stays the default and needs no configuration. Siloed and hybrid tenants (per-tenant Postgres schema / ClickHouse database / bus topic namespace / object key namespace) require a license granting siloed_isolation and are selected per tenant at provisioning (isolation_model + optional residency).

Variable Default Purpose
PROBECTL_DATAPLANES (none) named residency data planes — name=clickhouseURL[;name=clickhouseURL...] (e.g. eu=https://ch-eu:8123;us=https://ch-us:8123). A tenant's residency pins its ClickHouse database to that plane

Residency pins the tenant's ClickHouse flow data in this release; Postgres control state, the TSDB, object storage, and bus brokers are NOT region-pinned yet — docs/isolation.md states the exact contract, the catch-up/migration story for silo schemas, and the offboard-teardown semantics.

White-label branding (ee/)

No configuration keys: branding activates with a license granting white_label and is configured per tenant (or as the provider master) from the provider console. The public GET /branding endpoint serves the resolved brand pre-auth (Host-resolved for custom domains; the probectl default when unlicensed); custom-domain login resolves the tenant from the serving host. Custom domains need a certificate at the TLS-terminating ingress (or via trustctl) — see docs/white-label.md for the token-override contract, the no-bleed rules, and the email-template contract.

Advanced data governance (governance, ee/)

Per-tenant data classification + redaction, composed with retention, residency, and BYOK. No new config keys: the classification + redaction MECHANISM is core (the ?redact=true export toggle works anywhere, masking PII with a partial strategy); the governance feature adds per-tenant POLICY (stored in tenant_governance, migration 0033) set from the provider plane (GET/PUT /provider/v1/tenants/{id}/governance). IPs are PII by default. Full model: docs/governance.md. Redacted export: GET /v1/lifecycle/export?redact=true.

Tenant lifecycle: export, retention, erasure (core)

Export + verifiable deletion are a compliance right — core in every edition. GET /v1/lifecycle/export (permission lifecycle.export) streams the portability bundle; GET/PUT /v1/lifecycle/retention + POST /v1/lifecycle/erase (permission lifecycle.erase, slug-confirmed, irreversible) manage retention and run the attested cross-store erasure. The provider console adds the operator-side erase trigger. See docs/runbooks/tenant-offboarding.md for the full procedure and the per-store verification table.

Variable Default Purpose
PROBECTL_BACKUP_RETENTION_NOTE (empty → a generic fallback statement) your backup-TTL statement, included VERBATIM in every deletion attestation — be explicit about when snapshots expire. When unset, a generic placeholder sentence is recorded instead
PROBECTL_BACKUP_RETENTION_DAYS 0 concrete backup TTL in days. When > 0, the tenant-erasure attestation quantifies a bounded backup-coverage window (backup_erasure_deadline = erased_at + this many days); 0 = note-only
PROBECTL_ENVELOPE_KEY / PROBECTL_ENVELOPE_KEY_FILE (none) the at-rest KEK (see the control-plane table) — also used by probectl-control backup-seal/backup-open to encrypt/restore backups. The chart's Postgres backup CronJob mounts it to seal dumps in the pipeline

The daily retention sweeper enforces per-tenant flow_retention_days (tighter than the deployment TTL). Prometheus-mode TSDB series deletion is a documented manual step (the attestation says so honestly).

Per-tenant metering & quotas (ee/)

No configuration keys: metering activates with a license granting metering (provider/MSP tier). Counters flush every minute; gauge snapshots run every 15 minutes; usage and quotas live in Postgres (migration 0026). The usage API, the CSV/JSONL billing-export feed, per-tenant quotas (creation-gating only — telemetry is never quota-dropped), and the console showback card are documented in docs/metering.md.

Per-tenant key isolation / BYOK (ee/)

Unlocked by the byok feature (Enterprise). No new config keys: the keyring wraps managed tenant KEKs under PROBECTL_ENVELOPE_KEY (required when byok is licensed — startup fails loudly without it) and resolves BYOK references through the secret backends. Surfaces: GET/POST /v1/security/keys[...] (permission security.keys) + the Admin → Encryption keys card. The full model — sealing formats, rotation, the BYOK lockout warning, crypto-offboarding — is in docs/byok.md.

Tenant fairness (core)

These are the per-tenant bounds that protect a pooled (shared) deployment, so one noisy tenant can't starve the others — and they are core in every edition. The ingest-rate bounds are on by default with conservative numbers; you opt out of a bound by setting it to an explicit 0 (unlimited). Unset keeps the default, and a negative value is a startup error — config validation rejects it. The two query bounds already default to 0, i.e. unlimited until you set them. Per-tenant overrides are set from the provider console into tenant_fairness. Full model: docs/fairness.md.

These are token-bucket rate limits: the steady rate is the value below, and the bucket can hold a burst of rate × PROBECTL_FAIRNESS_BURST_SECONDS. Telemetry over a bound is admission-controlled (shed + counted), never silently corrupted.

Key Default Description
PROBECTL_FAIRNESS_RESULTS_PER_SEC 1000 per-tenant result-message admission rate. Explicit 0 = unlimited
PROBECTL_FAIRNESS_FLOW_EVENTS_PER_SEC 10000 per-tenant flow-record admission rate. Explicit 0 = unlimited
PROBECTL_FAIRNESS_INGEST_BYTES_PER_SEC 2097152 per-tenant ingest byte rate (2 MiB/s). Explicit 0 = unlimited
PROBECTL_FAIRNESS_DEVICE_METRICS_PER_SEC 2000 per-tenant SNMP/gNMI device-sample admission rate. Explicit 0 = unlimited
PROBECTL_FAIRNESS_BURST_SECONDS 10 burst window: bucket capacity = rate × this. 0 falls back to 10 — an enforced bucket always has a burst
PROBECTL_FAIRNESS_QUERY_CONCURRENCY 0 (unlimited) per-tenant in-flight query cap (HTTP 429 over it)
PROBECTL_FAIRNESS_QUERIES_PER_MIN 0 (unlimited) per-tenant query budget per minute (HTTP 429 over it)

Multi-region / active-active HA (core)

Inert unless PROBECTL_REGION is set (single-region deployments need none of these). The control plane stays stateless and active in every region; the split-brain fence pauses API writes during a failover while reads + telemetry keep flowing. Full model + the failover runbook: docs/multi-region.md, docs/runbooks/region-failover.md.

Key Default Description
PROBECTL_REGION (empty) this replica's region; empty = single-region (fence inert)
PROBECTL_REGIONS (empty) comma list of all regions in the deployment
PROBECTL_DATABASE_URL the WRITER endpoint (DNS/proxy that resolves to the current primary)
PROBECTL_DATABASE_READ_URL (empty) optional local read-replica endpoint; empty = reads use the writer
PROBECTL_REPLICATION_MODE async sync (RPO 0) or async (RPO ≈ lag) — descriptive; configure Postgres to match
PROBECTL_RESIDENCY (empty) default data-residency region (governance)
PROBECTL_RPO_SECONDS 0 provisional RPO target (human sign-off)
PROBECTL_RTO_SECONDS 60 provisional RTO target (human sign-off)

The writer must be reachable for API writes; cluster_state (migration 0032) holds the promotion epoch the fence reads. Promotion is cluster_promote() in the failover runbook.

Supportability (core)

Deep health + a secret-stripped support bundle for triage (CORE; the support org/SLA is contract). No new config keys; diagnostics.read (migration 0034, admin-seeded) gates GET /v1/diagnostics and GET /v1/diagnostics/bundle. An offline bundle: probectl-control support-bundle [-o file]. Self-monitoring series probectl_self_* + probectl_build_info feed deploy/grafana/dashboards/probectl-self.json. The bundle NEVER contains secrets/credentials/PII (allowlist config + anonymized topology + a final scrub). Full model: docs/supportability.md.

Guarded agentic remediation (remediation, ee/)

The assistant PROPOSES remediations; a human APPROVES; probectl NEVER executes — there is no executor in the codebase (remediation is human-gated by design). Approve is a recorded, audited, blast-radius-limited, human-only sign-off that an operator carries out in their own change process; ingested data (e.g. a prompt-injection routed through the propose_remediation MCP tool) can at most create a proposed proposal a human must approve via the authenticated UI. The feature is hidden (404) when the remediation Enterprise feature is unlicensed.

Variable Default Notes
PROBECTL_REMEDIATION_APPROVALS_ENABLED false advisory-only master switch — until an operator turns this on, Approve is unavailable and proposals are review-only
PROBECTL_REMEDIATION_MAX_BLAST_RADIUS 50 a proposal whose simulated (topology what-if) blast radius exceeds this cannot be approved; an unknown radius (no topology available) is also blocked — fail closed

Permissions remediation.propose and remediation.approve (migration 0035, admin-seeded) gate the /v1/remediation/* routes; the dry-run blast radius is a read-only topology simulation. Full policy + architecture: docs/remediation.md.

NDR-lite detection

Variable Default Purpose
PROBECTL_NDR_ENABLED true behavioral detection engine (DGA/exfil/beaconing/egress/lateral) over local DNS/flow/eBPF streams; signals only — never blocks
PROBECTL_NDR_RULES_DIR (none) detection-as-code overlay directory; rules merge by id over the embedded defaults (a malformed dir fails startup)

Detections are confidence-scored threat-plane signals (ndr.*) exported to incidents, the Security triage surface, and the SIEM (see SIEM export above). See docs/ndr.md for the detector and tuning reference.

Secrets integration

This is the feature that lets you keep raw passwords out of your config entirely. Anywhere this document asks for a credential, you can instead hand it a pointer to where the real secret lives — a Vault path, a CyberArk query, an AWS/Azure/GCP secret id — and the control plane fetches it at boot (or per poll, for device creds). The settings below just tell probectl how to reach each backend; the references themselves go in the credential keys documented throughout this page.

Any credential value in this document may be a secret reference instead of the literal material — env:NAME, vault:<mount>/<path>#<field>, cyberark:<query>, aws:<id>[#<json-field>], azure:<vault>/<name>, gcp:<project>/<secret>[/<version>], or literal:<value> as the escape hatch. The control plane resolves PROBECTL_OIDC_CLIENT_SECRET, PROBECTL_CMDB_SECRET, PROBECTL_AI_MODEL_TOKEN, PROBECTL_SIEM_TOKEN, PROBECTL_BUS_SASL_PASSWORD, PROBECTL_OUTAGE_RADAR_TOKEN, and the secret parts of PROBECTL_CHANGE_WEBHOOKS / PROBECTL_NOTIFY_CONNECTORS / PROBECTL_NOTIFY_INBOUND at startup (fail closed); the device agent resolves every PROBECTL_DEVICE_CRED_<NAME>_* value per poll cycle. Resolved values are cached only encrypted, for a short lease (5 m). See docs/secrets.md.

Backend access settings (environment only; all over verified TLS):

Variable Default Purpose
PROBECTL_SECRETS_VAULT_ADDR (none) Vault base URL; enables vault: references
PROBECTL_SECRETS_VAULT_TOKEN (none) static Vault token (alternative to AppRole)
PROBECTL_SECRETS_VAULT_ROLE_ID / _SECRET_ID (none) AppRole login; the lease-aware client token is renewed at ⅔ TTL
PROBECTL_SECRETS_VAULT_NAMESPACE (none) X-Vault-Namespace (Vault Enterprise)
PROBECTL_SECRETS_CYBERARK_URL (none) CyberArk CCP base URL; enables cyberark:
PROBECTL_SECRETS_CYBERARK_APP_ID (none) CCP AppID
PROBECTL_SECRETS_CYBERARK_CERT_FILE / _KEY_FILE / _CA_FILE (none) optional CCP client-certificate auth
AWS_REGION (or AWS_DEFAULT_REGION), AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN (none) enables aws: (Secrets Manager, SigV4)
AZURE_TENANT_ID, AZURE_CLIENT_ID, AZURE_CLIENT_SECRET (none) enables azure: (Key Vault)
GOOGLE_APPLICATION_CREDENTIALS (none) service-account key file; enables gcp: (Secret Manager)

Backend health (counters + redacted last error, never secret material) is served at GET /v1/secrets/health and on the Admin page.

Local dev stack (deploy/compose/dev.yml)

Started with make compose-up. Local, non-production defaults — plaintext listeners and dev credentials for convenience. Production deploys are TLS/HTTPS-by-default — TLS on every listener.

Service Compose name Host port(s) Purpose Dev credentials
PostgreSQL postgres 5432 Durable state, tenants, RBAC, audit, SLOs user/pass/db = probectl
Kafka kafka 9092 Result/event bus (KRaft, no ZooKeeper) none (PLAINTEXT)
ClickHouse clickhouse 8123 (HTTP), 9000 (native) High-cardinality events/flows user/pass/db = probectl
Prometheus prometheus 9090 Metrics TSDB (remote-write enabled) none

Kafka listeners: host clients use localhost:9092; in-network containers use kafka:19092; the KRaft controller uses 9093 (internal). Prometheus runs with --web.enable-remote-write-receiver so the result pipeline can remote-write into it.

These names and ports are a contract — the integration test harness depends on them, so don't rename them casually.

Tear-down

make compose-down removes the containers and volumes (pgdata, chdata, promdata).

Rendered live from github.com/imfeelingtheagi/probectl — found a mistake? edit this page.