Configuration
This is the full reference for every knob probectl reads at startup. The short
version: the control plane and every server-side feature read environment
variables (all prefixed PROBECTL_); the agents read a YAML file (with the
same env vars as overrides). This page lists each variable, its default, and what
it does — and it is the contract, so every row here is checked against the code.
How to read this page:
- A variable's default is what you get if you set nothing. The defaults are
chosen so a fresh install boots in a safe, sovereign posture (no outbound calls,
fail-closed on missing TLS/secrets) — probectl's standing security guardrails
(see
security/threat-model.md). - A default of
(none)means the value is empty/unset; the feature usually stays off until you give it one. - Where it's read: the control plane resolves its config in
internal/config/config.go(oneLoadfunction that reports every bad value at once and exits non-zero — you never chase config errors one at a time). Each agent has its own loader (internal/agent,internal/ebpf,internal/flow,internal/device,internal/endpoint).
Conventions
- Control plane (
probectl-control): environment variables,PROBECTL_prefix. Listed in the next section. - Agents: a YAML config file is the source of truth; the matching
PROBECTL_*env vars override individual fields (handy for containers). Each agent's keys are in its own section below. - Secrets are never hardcoded, logged, or placed in URLs/query strings.
Sensitive values at rest are sealed with envelope encryption, and any credential
in this document may be a secret reference (e.g.
vault:…) instead of the raw value — see Secrets integration.
Control plane (probectl-control)
The control plane is the brain: it serves the API/UI, accepts agent connections, runs the alerting/incident/correlation engines, and talks to the datastores. It is stateless — all durable state lives in Postgres/ClickHouse/the TSDB — so every behavioral choice it makes comes from these environment variables, read once at boot. The table below is the base set every deployment uses; the feature-specific sections that follow add more.
Subcommands: probectl-control [serve] (default), probectl-control migrate (apply
database migrations and exit), probectl-control version, and
probectl-control gen-cert [dir] — a convenience that writes a self-signed
tls.crt/tls.key/ca.crt for an HTTPS quickstart (PROBECTL_CERT_HOSTS, default
localhost,127.0.0.1, sets the certificate's host names; production brings its own
CA-issued cert). The other subcommands are covered with their features:
agent-ca init|export, enroll-token, and revoke-agent (agent transport and
enrollment — agent/enrollment.md), scim-token
(SCIM, below), mcp-stdio and mcp-token (MCP server, below), preflight
(the storage-encryption preflight — hardening.md),
support-bundle (supportability, below), and backup-seal / backup-open
(sealed backups — Tenant lifecycle, below).
A note on the defaults: the listen address is :8080, the database DSN points at a
local Postgres with sslmode=require (TLS to the database is the default, not an
afterthought), and HSTS is on. These defaults assume you front the process with a
TLS-terminating ingress (the shipped Helm/compose posture); set the TLS cert/key
pair below to have the process serve HTTPS itself instead.
| Variable | Default | Description |
|---|---|---|
PROBECTL_HTTP_ADDR |
:8080 |
API listen address |
PROBECTL_HTTP_READ_TIMEOUT |
15s |
HTTP read timeout |
PROBECTL_HTTP_WRITE_TIMEOUT |
15s |
HTTP write timeout |
PROBECTL_HTTP_IDLE_TIMEOUT |
60s |
HTTP idle (keep-alive) timeout |
PROBECTL_SHUTDOWN_TIMEOUT |
15s |
graceful-shutdown drain timeout |
PROBECTL_DATABASE_URL |
postgres://probectl:probectl@localhost:5432/probectl?sslmode=require |
PostgreSQL DSN; sslmode=require is the default (TLS to the DB out of the box). Dev-only: a local source-dev stack without TLS may explicitly append sslmode=disable to its own DSN |
PROBECTL_DATABASE_MAX_CONNS |
10 |
max pool connections (1–1000) |
PROBECTL_DATABASE_MIN_CONNS |
0 |
min pool connections |
PROBECTL_DATABASE_CONNECT_TIMEOUT |
5s |
per-connection connect timeout |
PROBECTL_MIGRATE_ON_BOOT |
false |
apply migrations during serve startup |
PROBECTL_LOG_LEVEL |
info |
debug | info | warn | error |
PROBECTL_LOG_FORMAT |
json |
json | text |
PROBECTL_HSTS_ENABLED |
true |
send Strict-Transport-Security |
PROBECTL_HSTS_MAX_AGE |
8760h |
HSTS max-age |
PROBECTL_TLS_CERT_FILE |
(none) | PEM server certificate; the process serves HTTPS directly when set together with the key |
PROBECTL_TLS_KEY_FILE |
(none) | PEM server private key (set together with the cert) |
PROBECTL_PUBLIC_TLS |
false |
tells the app that TLS terminates at the edge (an ingress in front) even though the app itself serves plaintext. Browsers only see the edge, so this is what flips cookies to Secure when you run behind a TLS ingress |
PROBECTL_ALLOW_PLAINTEXT_HTTP |
false |
explicit, loud opt-in for a non-loopback plaintext control listener — only valid behind a TLS-terminating ingress (the Helm chart sets it). Without it, plaintext + a non-loopback bind = refuse to start (fail closed) |
PROBECTL_SECURITY_CONTACT |
(none) | your vulnerability-disclosure mailbox; published in the served /.well-known/security.txt (left as a template comment when unset) |
PROBECTL_ENVELOPE_KEY |
(none) | base64-encoded 32-byte key-encryption key (KEK) for at-rest envelope encryption. The single root secret behind sealed credentials and backups — back it up |
PROBECTL_ENVELOPE_KEY_FILE |
(none) | path to the KEK file — loaded, or GENERATED+persisted (0600) on first boot if absent; an explicit PROBECTL_ENVELOPE_KEY wins over it. Shipped compose mounts it on the controldata volume |
PROBECTL_ENVELOPE_KEY_ID |
dev |
identifier recorded alongside each sealed value (so a future key rotation can tell which key sealed what) |
PROBECTL_REQUIRE_AT_REST_ENCRYPTION |
false |
when true, the control plane refuses to start if no envelope key resolves — a hard guarantee against accidentally running with plaintext-at-rest |
PROBECTL_STORAGE_ENCRYPTION_ATTESTED |
false |
operator attestation that the bulk-store volumes are encrypted below the host (e.g. encrypted cloud volumes the startup preflight can't see); logged, and downgrades the preflight warning |
PROBECTL_AGENT_GRPC_ADDR |
(none) | agent gRPC listen address; enables the transport when set together with the agent mTLS files below |
PROBECTL_AGENT_TLS_CERT_FILE |
(none) | agent-transport server certificate (PEM) |
PROBECTL_AGENT_TLS_KEY_FILE |
(none) | agent-transport server private key (PEM) |
PROBECTL_AGENT_TLS_CA_FILE |
(none) | CA bundle that signs agent client certificates (PEM) |
PROBECTL_BUS_MODE |
memory |
result bus: memory (lightweight, in-process) | kafka |
PROBECTL_BUS_BROKERS |
(none) | comma-separated host:port Kafka brokers (required for kafka) |
PROBECTL_BUS_MEMORY_BUFFER |
1024 |
in-memory bus: per-subscriber channel depth (lightweight mode) |
PROBECTL_BUS_MEMORY_OVERFLOW |
block |
in-memory bus overflow policy: block (back-pressure publisher) | drop (drop + count, no deadlock) |
PROBECTL_BUS_TLS_ENABLED |
false |
TLS to the Kafka brokers. Required in kafka mode unless the explicit dev flag below is set |
PROBECTL_BUS_TLS_CA_FILE |
(none) | private CA bundle for the brokers |
PROBECTL_BUS_TLS_CERT_FILE |
(none) | client certificate (broker mTLS; with _KEY_FILE) |
PROBECTL_BUS_TLS_KEY_FILE |
(none) | client key (broker mTLS) |
PROBECTL_BUS_SASL_MECHANISM |
(none) | plain | scram-sha-256 | scram-sha-512 |
PROBECTL_BUS_SASL_USER |
(none) | SASL username |
PROBECTL_BUS_SASL_PASSWORD |
(none) | SASL password (secret references supported; never logged) |
PROBECTL_BUS_ALLOW_PLAINTEXT |
false |
dev only: allow a plaintext broker (the dev compose stack). Production never sets this |
PROBECTL_BUS_MAX_BUFFERED |
0 (= built-in bound 65536) |
bound on the async Kafka producer's in-flight records; a full buffer SHEDS new records (counted, never blocking ingest). 0/unset keeps the built-in 65536-record bound — there is deliberately no unbounded mode |
PROBECTL_BUS_WORKERS |
4 |
per-subscription consume parallelism — each Kafka poll batch is fanned out across this many key-sharded workers (per-key ordering preserved). 0/1 = serial |
PROBECTL_INGEST_MAX_SERIES_PER_AGENT |
0 (= built-in cap 1000) |
cap on active metric-series identities one agent may mint; a NEW identity past the cap is rejected per-series and counted (known series keep flowing), and an identity idle for 1h frees its slot. 0/unset keeps the built-in 1000 cap — the wall always exists (there is no unlimited setting) |
PROBECTL_INGEST_MAX_SERIES_PER_TENANT |
0 (= built-in cap 50000) |
tenant-wide active-series wall, so one tenant's cardinality explosion never bleeds into others. 0/unset keeps the built-in 50000 cap |
PROBECTL_TSDB_MEMORY_RETENTION |
0 (= built-in window 1h) |
lightweight-mode (in-memory) TSDB retention window, aged by ARRIVAL time (backfilled or clock-skewed sample timestamps are never swept early). 0/unset keeps the built-in 1h window — the buffer never grows forever |
PROBECTL_TSDB_MEMORY_MAX_BYTES |
0 (= built-in wall 256 MiB) |
byte ceiling for the in-memory TSDB; oldest-first eviction once exceeded, with usage + eviction counters exposed. 0/unset keeps the built-in 256 MiB wall |
PROBECTL_AUDIT_WORM_DIR |
(none) | enable write-once audit export — the provider audit chain is exported as Ed25519-signed segments into this directory (mount an S3/MinIO object-lock bucket for true write-once-read-many) and chain-verified each cycle |
PROBECTL_AUDIT_WORM_INTERVAL |
1h |
export + chain-verify cadence |
PROBECTL_WORM_SIGNING_KEY_FILE |
(none) | path to the Ed25519 audit-export signing key (PKCS#8 PEM) — loaded, or GENERATED+persisted (0600) on first boot, so the key is stable across restarts (an ephemeral per-boot key would break cross-restart chain verification). Required when PROBECTL_AUDIT_WORM_DIR is set unless PROBECTL_WORM_SIGNING_KEY is. Back it up like the envelope key |
PROBECTL_WORM_SIGNING_KEY |
(none) | base64-encoded Ed25519 private-key PEM (KMS/secret-manager injection) — wins over PROBECTL_WORM_SIGNING_KEY_FILE. Enabling audit export with neither set fails closed (no silent ephemeral key) |
PROBECTL_TSDB_MODE |
memory |
time-series writer: memory (in-process) | prometheus |
PROBECTL_TSDB_URL |
(none) | Prometheus/VictoriaMetrics base URL for remote-write (required for prometheus) |
PROBECTL_ALERT_EVAL_INTERVAL |
30s |
how often the alerting engine evaluates rules over the TSDB |
PROBECTL_INCIDENT_WINDOW |
10m |
time window within which related signals correlate into one incident |
PROBECTL_AUTH_MODE |
session |
identity mode: session (real OIDC SSO + session cookies) | dev (LOCAL EVALUATION ONLY — exists only in -tags devauth builds; release binaries refuse it at boot) |
PROBECTL_DEV_AUTH_ACK |
(none) | must be i-understand to start in dev auth mode (tagged builds only, loopback bind required) |
PROBECTL_SESSION_TTL |
12h |
server-side session lifetime |
PROBECTL_AUTH_RATE_MAX_FAILURES |
5 |
auth brute-force guard: failures per window before lockout |
PROBECTL_AUTH_RATE_WINDOW |
1m |
failure-counting window for the auth throttle |
PROBECTL_AUTH_RATE_LOCKOUT |
1m |
base lockout; doubles per consecutive lockout, capped at 1h; lockouts are audited |
PROBECTL_OIDC_ISSUER |
(none) | OIDC issuer URL; SSO discovery is performed against it |
PROBECTL_OIDC_CLIENT_ID |
(none) | OIDC client ID registered with the IdP |
PROBECTL_OIDC_CLIENT_SECRET |
(none) | OIDC client secret (kept out of logs/URLs) |
PROBECTL_OIDC_REDIRECT_URL |
(none) | the control plane's /auth/callback URL registered with the IdP |
PROBECTL_REQUIRE_MFA |
false |
require multi-factor auth. The session's MFA state comes from the ID token's amr/acr claims (a second factor like otp/hwk/mfa, or acr aal2+/loa2+). When true, every authenticated /v1 request from a single-factor session gets 403 (enforced at request time). Off by default |
Invalid values fail fast: probectl-control reports all configuration problems
at once and exits non-zero. The database password is redacted from logs.
Tenant-owned tables are protected by Postgres Row-Level Security. The
PROBECTL_DATABASE_URL role must be able to assume the least-privilege probectl_app
role (a superuser always can; otherwise run GRANT probectl_app TO <login_role>),
which internal/tenancy assumes per transaction so isolation holds regardless of
how the control plane authenticated. See architecture.md.
HTTP endpoints
| Method & path | Purpose |
|---|---|
GET /healthz |
Liveness — 200 while the process is serving |
GET /readyz |
Readiness — 200 when the database is reachable, else 503 |
GET /version |
Build and runtime metadata |
GET /openapi.json |
The OpenAPI 3.1 document |
Every response carries an X-Request-Id (honoring an inbound one) and the
security headers Strict-Transport-Security (when enabled) and
X-Content-Type-Options: nosniff. The versioned resource routes under /v1 are
documented in Resource API & CLI below.
Error envelope
All errors share one JSON shape and a stable domain-error → HTTP mapping:
{ "error": { "code": "not_found", "message": "…", "request_id": "…" } }
| Domain kind | Code | HTTP |
|---|---|---|
| BadRequest | bad_request |
400 |
| Unauthorized | unauthorized |
401 |
| Forbidden | forbidden |
403 |
| NotFound | not_found |
404 |
| Conflict | conflict |
409 |
| Validation | validation |
422 |
| Internal | internal |
500 |
| Unavailable | unavailable |
503 |
Transport security
probectl never wants a plaintext channel exposed to the network. There are two correct ways to get TLS in front of the API, and the config lets you pick:
The API listens over TLS in two interchangeable ways:
- App-terminated TLS — set
PROBECTL_TLS_CERT_FILE+PROBECTL_TLS_KEY_FILE, and the control plane serves HTTPS only (TLS 1.2+, prefer 1.3; plaintext is refused). - Ingress-terminated TLS — leave them unset and serve HTTP behind a TLS-terminating ingress (the shipped Helm/compose default). HSTS is set either way, so the posture is correct end to end.
All TLS and crypto policy lives in internal/crypto; a CI guard
(scripts/check_crypto_imports.sh) forbids crypto-primitive imports elsewhere so
a FIPS 140-3 validated module can be swapped in. At-rest secrets use the envelope
helper (a per-record data key wrapped by a KMS/HSM-pluggable KEK; the dev
StaticKeyProvider reads PROBECTL_ENVELOPE_KEY).
Agent transport
This is how agents talk to the control plane, and it is locked down by design.
The agent gRPC transport (probectl.agent.v1.AgentService) runs only when
PROBECTL_AGENT_GRPC_ADDR and all three PROBECTL_AGENT_TLS_* files are set
(address + server cert + server key + the CA that signs client certs). It is
mutual-TLS only (RequireAndVerifyClientCert): the agent must present a client
certificate, and its tenant and id are read out of that certificate's identity
(spiffe://probectl/tenant/<t>/agent/<a>), never from the request body. So even a
misbehaving or malicious agent can only ever write to its own tenant — the identity
is cryptographic, not self-asserted. Populate PROBECTL_AGENT_TLS_CA_FILE (the
client-cert CA pool) with probectl-control agent-ca export <file>, which writes the
public agent-CA bundle (root + intermediate, no key). Generate dev mTLS material with the
internal/crypto CA helpers. The .proto lives under proto/probectl/agent/v1/;
regenerate Go with make proto (tools via make proto-tools).
Version-skew policy. At registration the control plane rejects agents outside
the supported version window, so a rolling upgrade never admits an incompatible
agent. See lifecycle.md.
| Variable | Default | Description |
|---|---|---|
PROBECTL_AGENT_SKEW_WINDOW |
1 |
allowed minor-version skew on either side (N/N-1); the control plane at minor N accepts agents at N-1…N+1. 0 requires an exact minor match |
PROBECTL_AGENT_MIN_VERSION |
(none) | an explicit floor — agents older than this are rejected regardless of the window (force-retire a known-bad version) |
A rejected agent gets a gRPC FailedPrecondition ("upgrade required"); a dev/unpinned
build (0.0.0-dev) on either side skips the check.
probectl-agent
The canary agent is the worker that actually runs the probes (ping, TCP, DNS,
HTTP, …). Unlike the control plane, its primary config is a YAML file
(-config, or the path in PROBECTL_AGENT_CONFIG) — see
deploy/agent/probectl-agent.example.yml.
Crucially, the agent does not configure its own tenant or id: those come from
its mTLS client certificate (above), so you can't accidentally point an agent at
the wrong tenant by editing a file.
A handful of env vars override individual YAML fields — useful in containers where mounting a full file is awkward:
| Variable | Overrides (YAML) | Meaning |
|---|---|---|
PROBECTL_AGENT_CONFIG |
— | path to the YAML config (the -config flag wins over it) |
PROBECTL_AGENT_GRPC_ADDR |
control_plane.grpc_addr |
the control plane's agent-gRPC endpoint to dial |
PROBECTL_AGENT_TLS_CERT_FILE |
tls.cert_file |
the agent's mTLS client certificate (PEM) |
PROBECTL_AGENT_TLS_KEY_FILE |
tls.key_file |
the agent's mTLS client key (PEM) |
PROBECTL_AGENT_TLS_CA_FILE |
tls.ca_file |
the CA that signed the control plane's server cert (PEM) |
PROBECTL_AGENT_BUFFER_DIR |
buffer.dir |
on-disk store-and-forward directory (see below) |
PROBECTL_AGENT_IDENTITY_SERVER |
identity.server |
control-plane HTTPS base URL enabling automatic certificate rotation — the agent rotates its mTLS identity at ~2/3 of its lifetime via /enroll/agent/rotate. See agent/enrollment.md |
PROBECTL_AGENT_JOIN_TOKEN |
— | a one-time join token for first-boot enrollment: with no identity present yet, the agent redeems it, writes its identity, then runs. Idempotent (a present identity is never overwritten) and fail-closed. See agent/enrollment.md |
PROBECTL_AGENT_ENROLL_TOKEN_FILE |
enroll.token_file |
a file holding the join token (a mounted secret, read once); PROBECTL_AGENT_JOIN_TOKEN takes precedence |
PROBECTL_AGENT_ENROLL_SERVER |
enroll.server |
enrollment target for first-boot enrollment; defaults to identity.server |
PROBECTL_AGENT_ENROLL_CA_PIN |
enroll.ca_pin |
optional hex sha256 pin of the server cert for first contact; otherwise tls.ca_file verifies the server |
PROBECTL_AGENT_CANARY_CA_DIR |
tls.canary_ca_dir |
the one directory that probe ca_file: parameters may reference (a trust-anchor allowlist for HTTP/DNS-over-TLS probes); empty = the ca_file parameter is refused |
PROBECTL_AGENT_LOG_LEVEL |
— | debug | info (default) | warn | error |
PROBECTL_AGENT_LOG_FORMAT |
— | json (default) | text |
Results buffer to disk (buffer.dir, bounded by max_records, default 10000)
while the control plane is unreachable and drain on reconnect (at-least-once
delivery). Probing keeps running regardless of connectivity, so a control-plane
outage never blocks measurement — the agent just queues and catches up.
Result pipeline
This is the path every measurement takes from an agent to a queryable metric, and
two env vars decide how heavy that pipeline is: PROBECTL_BUS_MODE (the message
bus) and PROBECTL_TSDB_MODE (the time-series writer). The memory defaults make
a single binary work with zero external dependencies; switch them to kafka /
prometheus when you outgrow that.
A streamed result flows agent → gRPC StreamResults → control-plane ingest →
result bus (probectl.network.results, Protobuf) → consumer → time-series writer.
The agent sends the canonical OTel-aligned result (proto/probectl/result/v1); the
control plane re-stamps the tenant and agent id from the verified mTLS
certificate before publishing, so a result is always attributed to the sending
agent's tenant regardless of payload contents — the tenant boundary is
cryptographic, never self-asserted. The bus key is the tenant_id.
PROBECTL_BUS_MODE selects the bus: memory (default; in-process, for the
lightweight <5-agent deployment and single-binary runs) or kafka (set
PROBECTL_BUS_BROKERS). PROBECTL_TSDB_MODE selects the writer: memory (default;
in-process) or prometheus remote-write to PROBECTL_TSDB_URL (Prometheus with
--web.enable-remote-write-receiver, or VictoriaMetrics; use an https:// URL
for TLS in transit). Each probe emits probectl_probe_success,
probectl_probe_duration_seconds, and one probectl_probe_<metric> per custom
metric, labeled tenant_id, agent_id, canary_type, and server_address. The
canonical signal→OTel mapping is in otel-mapping.md.
ICMP test
The icmp canary measures echo loss, latency, and jitter to a target
(IPv4 or IPv6). Configure it per-canary under canaries: (see
probectl-agent.example.yml). The
schedule interval and reply timeout are canary fields; the rest are params:
| Param | Default | Meaning |
|---|---|---|
count |
5 |
echo requests per probe (continuous mode defaults to the interval in s) |
payload_bytes |
56 |
ICMP data bytes (minimum 8) |
dscp |
0 |
DSCP marking 0–63 on outgoing packets (best-effort by platform) |
mode |
batch |
batch (back-to-back) or continuous (1 packet/sec) |
privileged |
false |
true prefers raw sockets; default is unprivileged datagram ICMP |
It emits probectl_probe_loss_ratio, probectl_probe_rtt_{min,avg,max,stddev}_ms,
probectl_probe_jitter_ms, and probectl_probe_packets_{sent,received}. A probe with
100% loss reports success=false (target unreachable); partial loss is a success
with a non-zero loss ratio. Continuous mode records a per-second drop-timing
record as result attributes (icmp.dropped_seqs, icmp.drop_send_offsets_ms) —
carried as OTel attributes, not TSDB labels, so they don't widen cardinality.
Privileges. By default the agent uses unprivileged datagram ICMP
(IPPROTO_ICMP), which on Linux requires the agent's group to be within
net.ipv4.ping_group_range (e.g. sysctl -w net.ipv4.ping_group_range="0 2147483647"). Alternatively grant raw-socket capability
(setcap cap_net_raw+ep /usr/bin/probectl-agent, or run with CAP_NET_RAW) and set
privileged: "true". The canary tries the preferred socket and falls back to the
other; if neither can be opened it returns an internal error (the probe is not
silently reported as loss).
TCP & UDP tests
The tcp and udp canaries are agent-to-server probes. Configure a target of
host:port (or a host with params.port). Both accept count and dscp.
The tcp canary measures connect latency + reachability (a connect-based,
unprivileged equivalent of a TCP-SYN test): it establishes a connection and times
the handshake, emitting probectl_probe_connect_{min,avg,max,stddev}_ms,
probectl_probe_jitter_ms, and probectl_probe_loss_ratio (failed connects = loss;
all-failed = success=false).
The udp canary is an echo round-trip probe: it sends token-tagged
datagrams and matches the echoes, emitting probectl_probe_rtt_* + loss. It needs a
target that echoes (a UDP echo service, or a probectl agent-to-agent responder); a
non-echoing target reports as 100% loss. params.payload_bytes (≥10) sets the
datagram size.
Voice/RTP tests
The voice canary streams real RTP packets at codec cadence to an echoing
target and scores the path: MOS + R-factor (simplified ITU-T G.107
E-model), RFC 3550 jitter, loss, and a one-way delay estimate. target is
host:port. Parameters: codec (g711 default, g729),
duration_seconds (1–10, default 3), dscp (default 46/EF). The model
variant and the one-way-estimate method ride the result attributes — a
computed MOS is never presented as a measured listening score. See
docs/voice.md.
DNS tests
The dns canary queries DNS and reports resolution time, the answer, and an
optional DNSSEC verdict. The target is the query name. Parameters:
| Param | Values | Default | Meaning |
|---|---|---|---|
type |
A, AAAA, MX, TXT, NS, … |
A |
record type to query |
transport |
udp | tcp | dot | doh |
udp |
how the query is sent |
server |
host[:port] or a DoH URL |
per-transport | resolver to query |
mode |
resolver | trace |
resolver |
single query vs. delegation walk |
dnssec |
true | false |
false |
validate the zone signature |
server defaults by transport: the first nameserver in /etc/resolv.conf (or
1.1.1.1:53) for udp/tcp, 1.1.1.1:853 for DoT, and
https://cloudflare-dns.com/dns-query for DoH. DoT verifies the resolver's
TLS certificate (TLS 1.2+); DoH posts an RFC 8484 application/dns-message query
over HTTPS.
In resolver mode the canary emits probectl_probe_dns_query_ms (round-trip) and
probectl_probe_dns_answers (answer count), with dns.rcode and a compact
dns.answer summary as attributes. The probe is success=false on a non-NOERROR
rcode or an empty answer.
With dnssec: "true" the canary requests DNSSEC records (the DO bit) and
validates the zone's RRSIG over the answer against the zone DNSKEY — it
does not trust the resolver's AD bit. The verdict lands in the dns.dnssec
attribute (secure, insecure for an unsigned zone, or bogus) and
probectl_probe_dns_dnssec_secure (1/0); a bogus result (tampered, expired, or
wrong-key signature) fails the probe. Validation verifies the signature on the
answer RRset; full chain-to-root anchoring is a later refinement.
In trace mode the canary performs an iterative delegation walk from the
root hints, following NS/glue referrals down to the authoritative server (UDP,
capped iterations, with a recursive fallback when a referral ships no glue). It
emits probectl_probe_dns_query_ms (total walk time) and
probectl_probe_dns_trace_hops, with the delegation chain in the dns.trace
attribute. DNS-exfiltration detection and open-data baselines are out of scope for
this probe (they live in the NDR and open-data features).
HTTP server tests
The http canary measures HTTP(S) availability with a full response-time
breakdown and captures TLS handshake details for the TLS-posture plane (see
TLS / certificate observability below). The target is the URL. Parameters:
| Param | Values | Default | Meaning |
|---|---|---|---|
method |
GET, HEAD, POST, … |
GET |
request method |
expect_status |
codes / classes / ranges | 2xx,3xx |
which statuses count as available |
follow_redirects |
true | false |
true |
follow 3xx redirects |
insecure_skip_verify |
true | false |
false |
capture TLS but don't fail on an invalid cert. Deny-by-default: requires the admin-only test.insecure_tls permission and is flagged in the test.create/test.update audit entry |
ca_file |
path to a PEM bundle | — | extra trust anchor (private/internal CA); must live under PROBECTL_AGENT_CANARY_CA_DIR |
body |
string | — | request body (e.g. for POST) |
max_body_bytes |
integer | 10485760 |
cap bytes read per probe (10 MiB) |
allow_private_targets |
true | false |
false |
SSRF-guard override. Every canary (http/tcp/udp/icmp/dns/voice) denies loopback, RFC1918/ULA, link-local (incl. 169.254.169.254 cloud metadata), CGNAT, multicast and numeric-encoding bypasses by default, enforcing the check on the resolved address at dial time (rebind-proof). Setting true lifts the guard for that one test — requires the admin-only test.allow_private permission and is written to the tenant audit trail |
expect_status is a comma list of exact codes (200), classes (2xx), and
inclusive ranges (200-204); a response outside the set is success=false (the
status is still reported). The probe emits the timing breakdown as metrics —
probectl_probe_http_dns_ms (resolution), probectl_probe_http_connect_ms (TCP
connect), probectl_probe_http_tls_ms (TLS handshake), probectl_probe_http_ttfb_ms
(time to first byte), and probectl_probe_http_total_ms — plus
probectl_probe_http_status, probectl_probe_http_content_bytes, and
probectl_probe_http_throughput_kbps. A phase that does not occur (no DNS for an IP
target, no TLS for http://) is omitted rather than reported as zero. The resolved
server IP is captured as the network.peer.address attribute, which correlates
the result to path/traceroute data for the same destination.
TLS capture. On HTTPS the canary records the negotiated
tls.protocol.version and tls.cipher, the leaf certificate's
tls.server.{subject,issuer,not_before,not_after,san}, the chain shape
(tls.server.chain), and a probectl_probe_http_tls_cert_expiry_days metric
(negative once expired). It verifies the chain itself (hostname + trust, honoring
ca_file) after capturing the certificate, so the handshake details are
recorded even when the certificate is invalid or expired — an invalid cert
fails the probe but its details are still attached. Set insecure_skip_verify: "true" to capture posture without failing the availability check. probectl performs
no TLS posture analysis here (issuer trust, weak-cipher/expiry policy, CT) — that
is the TLS / certificate observability feature below, which consumes these
captured fields.
Agent-to-agent tests
An agent-to-agent (A2A) test measures between two registered agents, brokered
by the control plane. The control plane assigns roles (one agent responds,
opening a short-lived listener; the other initiates), rendezvouses the
responder's endpoint to the initiator, and hands each agent its task when it
polls (PollCoordination / ReportEndpoint). The measurement is TWAMP-lite: the
initiator timestamps each probe (T1), the responder stamps receive/send (T2/T3)
and echoes, and the initiator stamps receive (T4), yielding round-trip
(probectl_probe_rtt_*) plus forward and reverse one-way delay
(probectl_probe_forward_avg_ms, probectl_probe_reverse_avg_ms). The responder also
reports forward-direction delivery (probectl_probe_packets_received,
probectl_probe_loss_ratio), so both agents and both directions are observed.
Enable participation in the agent's a2a block: enabled: true,
advertise_host (the address peers use to reach this agent's responder),
poll_interval (default 2s), and responder_ttl (default 15s).
Caveats (document for production):
- NAT/firewall. The responder advertises
advertise_host; behind NAT this must be a reachable address and the responder's ephemeral port must be reachable from the initiator. Auto-detection picks a non-loopback IPv4 — setadvertise_hostexplicitly when that is wrong. - Clocks. Forward/reverse one-way delays assume the two agents' clocks are synchronized (exact within one host; use NTP across hosts). Round-trip is clock-independent.
Sessions are brokered in-memory; triggering them from the test API is a later addition.
Path discovery
The path engine (internal/path) is the traceroute brain — it runs Paris-style
traceroutes (ICMP and TCP), which handle equal-cost multipath (ECMP) and MPLS, and
merges per-flow traces into one multi-path picture; see
architecture.md. A full per-hop trace needs raw sockets:
grant CAP_NET_RAW (setcap cap_net_raw+ep, or run privileged) to capture the
intermediate hops + MPLS labels. Without it, only the destination is discovered.
Where the discovered hops/links are stored is a control-plane choice:
| Variable | Default | Description |
|---|---|---|
PROBECTL_PATHSTORE_MODE |
memory |
memory (in-process, for the lightweight/single-binary case and tests) | clickhouse (durable hop/link rows) |
PROBECTL_PATHSTORE_URL |
(none) | ClickHouse HTTP(S) endpoint (e.g. http://localhost:8123), partitioned by tenant; required when mode is clickhouse |
PROBECTL_PATH_RETENTION_DAYS |
90 |
delete-after-N-days TTL on the path/traceroute ClickHouse tables (applied at boot); 0 disables the TTL |
BGP routing intelligence
The BGP plane is a Python analyzer (analyzer/) plus a Go bridge (internal/bgp);
see architecture.md. The analyzer ingests public collector
data and emits probectl.bgp.events:
python -m probectl_analyzer --config config.json --mrt rib.mrt # RouteViews/RIS dump
python -m probectl_analyzer --config config.json --replay cap.jsonl # recorded RIS Live
python -m probectl_analyzer --config config.json --ris-live # live RIS Live websocket
The JSON config is per tenant (tenant_id is required — every event carries
it, and the bridge rejects any event without one):
| Key | Meaning |
|---|---|
tenant_id |
the owning tenant (outermost scope) |
monitored_prefixes[].prefix |
a prefix to watch (a more-specific announcement is matched too) |
monitored_prefixes[].expected_origins |
allowed origin ASNs — an origin outside this set raises possible_hijack |
monitored_prefixes[].no_transit |
ASNs that must not transit this prefix — mid-path appearance raises possible_leak |
collector |
collector label recorded on events (e.g. rrc00) |
rpki_vrp_file / rpki_vrp_url |
a rpki-client/Routinator VRP JSON export for RFC 6811 validation (absent → unknown) |
The analyzer emits probectl.bgp.events as JSON Lines; the Go bridge tails that
stream, validates the tenant, and republishes each as the canonical
probectl.bgp.v1.BGPEvent protobuf onto the bus (topic probectl.bgp.events, keyed by
tenant). Event types: origin_change (old/new origin + AS path), possible_hijack,
possible_leak, rpki_invalid; each carries an RPKI status (valid / invalid /
not_found / unknown), a severity, and a confidence — they are signals, not
actions — probectl never acts on routing. MRT dumps are stream-processed (no
full RIB in memory); a down RPKI/collector source degrades gracefully rather
than breaking the plane.
RouteViews/RIS are open data — their AUP/provenance matters for MSP/commercial
resale, not for private development or single-tenant OSS use.
Open-data enrichment
internal/opendata annotates IPs with ASN / geo / IXP / allocation context from
public datasets; see architecture.md and the source
provenance/AUP matrix in opendata-aup.md. The framework is a
library (the flow and test pipelines consume it where enrichment is enabled);
each source is
pluggable and individually enable-able:
| Source | Kind | Input it needs | Notes |
|---|---|---|---|
| Team Cymru | asn |
a DNS resolver | IP→ASN/prefix/registry/AS-name via the Cymru IP-to-ASN DNS service |
| MaxMind GeoLite2 | geo |
a .mmdb path (OpenMMDB) |
country/city/lat-lon; operator-supplied DB (not shipped) |
| PeeringDB | ixp |
the ASN (from Cymru) | IXP/facility presence via the PeeringDB REST API; cached per ASN |
| RIR delegated-stats | allocation |
a delegated-extended stats file | RIR/country/status/date; parsed once into a sorted index |
| RIPE Atlas (optional) | measurement |
an API key + credits | active ping/traceroute scheduling hook; off (fail-closed) by default |
The Enricher runs every enabled source over an IP and merges the results,
caching per IP and degrading gracefully: a disabled / failing / slow /
panicking source is logged, marked degraded or disabled in Enricher.Status(),
and skipped — a partial enrichment is returned and a down dataset never breaks a
core path. Sources run in registration order (register the ASN source before
PeeringDB). Each contribution records Provenance (source + license + attribution
- fields); a source's AUP (license, commercial-use permission, attribution) is on
its
Descriptor— the matrix that gates MSP/commercial resale (not private or single-tenant OSS use). All fetches are over TLS with certificate validation and treated as untrusted — external content never gets implicit trust. Open data is ingested once and shared; enrichment is scoped per tenant by the consuming record.
Alerting
The alerting engine (internal/alert) evaluates rules over the TSDB and notifies
channels; see architecture.md. Rules are CRUD'd via
/v1/alerts (tenant-scoped) and the engine runs in the control plane, ticking
every PROBECTL_ALERT_EVAL_INTERVAL (default 30s).
A rule targets a metric series and is either a threshold or a baseline rule:
| Field | Applies | Meaning |
|---|---|---|
metric + match |
both | the TSDB metric (e.g. probectl_probe_loss_ratio) and label matchers |
type |
both | threshold | baseline |
comparison + threshold |
threshold | gt/lt/gte/lte/eq/neq vs a bound |
window + sensitivity |
baseline | rolling-history size and deviation (in std-devs); warms up until the window fills |
for_n |
both | consecutive breaching evals before firing (debounce) |
renotify_seconds |
both | re-notify cadence while firing (0 = notify once) |
severity |
both | info | warning | critical |
channels |
both | webhook / email destinations |
A channels entry is {"type":"webhook","url":...,"secret":...} or
{"type":"email","recipients":[...]}. The webhook secret is the HMAC key; it
is redacted (***) from API responses and never returned. SMTP for email is
configured at the deployment level (a follow-up exposes it as config).
Webhook payload (probectl.alert.v1). On fire/resolve the webhook channel POSTs:
{
"version": "probectl.alert.v1",
"state": "firing",
"rule": { "id": "…", "name": "loss-high" },
"tenant_id": "…",
"severity": "critical",
"metric": "probectl_probe_loss_ratio",
"labels": { "server_address": "1.1.1.1" },
"value": 0.9,
"threshold": 0.5,
"comparison": "gt",
"reason": "probectl_probe_loss_ratio=0.9 gt 0.5",
"fired_at": "2026-01-02T15:04:05Z"
}
When the channel has a secret, the request carries
X-Probectl-Signature: sha256=<hex> — the HMAC-SHA256 of the exact body — so the
receiver can verify the sender. Each channel delivers independently: a failing
channel is logged and skipped, never blocking the others. Alerts are signals;
probectl notifies and does not act on the network (on-call/ITSM routing and
detection-as-code are their own features below).
Incidents
The incident correlator (internal/incident) groups related signals across planes
into one Incident with a unified timeline; see architecture.md.
It runs in the control plane, fed by the alert engine (network plane) and a
probectl.bgp.events consumer (BGP plane), and is exposed at /v1/incidents
(tenant-scoped):
GET /v1/incidents— the tenant's incidents, most-recently-active first.GET /v1/incidents/{id}— an incident with its time-ordered signal timeline.PATCH /v1/incidents/{id}with{"status":"resolved"}— resolve an incident.
Signals correlate into one incident when they are close in time
(within PROBECTL_INCIDENT_WINDOW, default 10m) and related in target — the
same target, an IP inside the other's prefix (either direction), or overlapping
prefixes (so a network alert on 192.0.2.10 and a BGP event on 192.0.2.0/24
land together). An incident's severity is the max of its signals; a signal
without a tenant is rejected (fail closed).
The model is extensible without schema churn: a Signal carries a free-form
plane/kind and an arbitrary attributes map, so the change, threat, cost, and
SLO planes attach as additional signal types onto the same Incident/timeline
without schema changes. AI root-cause analysis runs over the timeline.
SSO & RBAC
probectl authenticates users with OIDC SSO and authorizes them with role-based access control (RBAC). The security order is the two-level boundary: a request resolves to exactly one tenant first, then RBAC decides whether the caller may perform the route's action within that tenant.
Login flow. GET /auth/login (optionally ?tenant=<uuid>) starts the OIDC
authorization-code flow: it sets a short-lived, HttpOnly CSRF state cookie and
redirects to the tenant's identity provider. The IdP redirects back to
GET /auth/callback, which verifies the state, exchanges the code, verifies the
ID token, just-in-time provisions the user within the tenant (a brand-new user
gets no roles — a secure default; an admin grants access), mints a server-side
session, and sets the session cookie. POST /auth/logout revokes the session.
GET /v1/me returns the caller's tenant, identity, and effective permissions.
Sessions. A session is a random, high-entropy opaque token. Only its hash
is stored (table sessions), so a database read cannot mint a session. The
session cookie is HttpOnly + SameSite=Lax, and Secure whenever the
API serves HTTPS. PROBECTL_SESSION_TTL (default 12h) bounds its lifetime.
Per-tenant IdP. Providers are resolved per tenant through a provider factory —
the seam for a tenant bringing its own SSO. The shipped default is the
env-configured one (PROBECTL_OIDC_*); database-backed per-tenant IdP config is a
later addition. A login always resolves to a single tenant. Provider/MSP operators
authenticate into the provider domain (the management plane), not into tenant
data.
RBAC. Every /v1 route declares a required permission key; the wrapped
handler returns 401 when unauthenticated and 403 when the principal lacks
the permission — checked before the handler runs. Effective permissions are
loaded per request from the user's role bindings (RLS-scoped to the tenant), so a
role grant or revoke takes effect immediately. The permission catalog:
| Permission | Granted to (seeded roles) | Guards |
|---|---|---|
test.read |
viewer, editor, admin | GET /v1/tests*, GET /v1/tests/{id}/path |
test.write |
editor, admin | POST/PUT/DELETE /v1/tests*, POST .../path |
agent.read |
viewer, editor, admin | GET /v1/agents* |
agent.write |
admin | PATCH/DELETE /v1/agents/{id} |
alert.read |
viewer, editor, admin | GET /v1/alerts* |
alert.write |
editor, admin | POST/PUT/DELETE /v1/alerts* |
incident.read |
viewer, editor, admin | GET /v1/incidents* |
incident.write |
editor, admin | PATCH /v1/incidents/{id} |
The seeded system roles for the default tenant are admin (all permissions),
editor (read everything + manage tests/alerts/incidents), and viewer
(read-only). GET /v1/me requires only authentication (no specific permission).
Dev mode. PROBECTL_AUTH_MODE=dev bypasses SSO and synthesizes an
all-permissions principal for the default tenant, with the
X-Probectl-Tenant: <uuid> override for multi-tenant dev. It is
triple-gated: (1) the code path exists only in binaries built with
-tags devauth (make build-devauth) — a release binary refuses to start
in this mode; (2) PROBECTL_DEV_AUTH_ACK=i-understand must be set; (3) the
listener must bind loopback (PROBECTL_HTTP_ADDR=127.0.0.1:…). When active it
logs at error level and writes an auth.dev_mode_active audit event. The CI
gate no-devauth-in-release proves release binaries contain neither the
symbols nor the dev-principal literal. The test suite installs its own hook
in _test.go files, which never ship.
Resource API & CLI
The versioned resource API lives under /v1 (full schema at /openapi.json):
GET/POST /v1/tests,GET/PUT/DELETE /v1/tests/{id}— synthetic-test CRUD.GET /v1/agents,GET/PATCH/DELETE /v1/agents/{id}— agents register over mTLS; the API lists, renames, and deregisters them.GET/POST /v1/tests/{id}/path— the latest discovered network path for a test, and a trigger to discover it now. The Path & Topology UI consumes this.
Every /v1 route is tenant-scoped through internal/tenancy + Postgres RLS,
so a request can never read or write across tenants. Authentication and RBAC are
real (see SSO & RBAC below): the caller's tenant and effective permissions come
from an authenticated session, and each route requires a permission. The "no
undocumented routes" rule is enforced by a test that matches the route table
against openapi.json.
The probectl CLI is the web-parity client. Configure it with flags or
environment: PROBECTL_API_URL (default http://localhost:8080),
PROBECTL_API_TOKEN (sent as Bearer), PROBECTL_TENANT (sent as X-Probectl-Tenant).
probectl test list
probectl test create --name edge-dns --type icmp --target 1.1.1.1 --interval 30
probectl test delete <id>
probectl agent list
probectl --json test list # machine-readable output
eBPF host agent
The eBPF agent watches a host's network from inside the Linux kernel — it sees
which processes talk to which services without you instrumenting anything. It is
observe-only: it never blocks or modifies traffic. Like the canary agent, its
real config is a YAML file (-config / PROBECTL_EBPF_CONFIG); see
deploy/agent/probectl-ebpf-agent.example.yml
and ebpf-agent.md, with PROBECTL_EBPF_* env vars overriding
individual fields. The in-kernel loader is compiled in only with the ebpf build
tag; without it (or for tests), point fixture_path at a recording to replay.
The big idea in the keys below: layer-7 plaintext capture is off, and stays off
until you prove three separate intents — turn it on (L7_CAPTURE), name the
tenant that consents (L7_CONSENT_TENANT), and list the exact workloads
(L7_SCOPE). Miss any one and the kernel copies no payload. That is the
fail-closed posture for the most sensitive thing this agent can do.
| Variable | Default | Description |
|---|---|---|
PROBECTL_EBPF_CONFIG |
(none) | path to the YAML config (-config flag overrides) |
PROBECTL_EBPF_TENANT_ID |
(required) | the tenant every flow is stamped with — the agent refuses to start without it |
PROBECTL_EBPF_HOST |
OS hostname | observing host name |
PROBECTL_EBPF_BUS_MODE |
memory |
memory | kafka |
PROBECTL_EBPF_BUS_BROKERS |
(none) | comma-separated Kafka brokers (kafka mode) |
PROBECTL_EBPF_BUS_NAMESPACE |
(none) | publish on this tenant's siloed bus lane (probectl.<ns>.ebpf.flows) instead of the shared topic; for per-tenant-namespaced (siloed) deployments |
PROBECTL_EBPF_FIXTURE_PATH |
(none) | replay recorded flows instead of loading eBPF (no-kernel path) |
PROBECTL_EBPF_L7_FIXTURE_PATH |
(none) | replay recorded layer-7 events (no-kernel L7 path) |
PROBECTL_EBPF_RING_BUFFER_BYTES |
16777216 |
size of the kernel→userspace ring buffer (16 MiB; live loader only). Bigger absorbs bigger traffic bursts at the cost of memory |
PROBECTL_EBPF_LIBSSL |
(auto) | explicit libssl path for TLS-plaintext (uprobe) L7 capture; auto-discovered when unset (ebpf build) |
PROBECTL_EBPF_L7_CAPTURE |
false |
master switch — live TLS-plaintext capture is OFF by default. true alone is not enough; consent AND scope below are also required |
PROBECTL_EBPF_L7_CONSENT_TENANT |
(none) | the explicit per-tenant consent: must equal this agent's bound tenant id exactly, else capture stays off |
PROBECTL_EBPF_L7_SCOPE |
(none) | the explicit workload opt-in — comma-separated pid:<n>, exe:/abs/path, cgroup:/abs/cgroup-dir entries. The kernel program drops every other process BEFORE copying a byte; empty = capture refuses to start. Host-wide capture is deliberately not expressible. Container/pod scoping is the cgroup: form (a container IS a cgroup); exe: entries are re-resolved every 10s so restarts stay in scope |
PROBECTL_EBPF_L7_REDACTION |
headers |
how much of a payload may survive capture: headers zeroes the bodies in place before anything is retained (protocol metadata survives); length captures NO payload bytes (traffic shape only, no parsed calls); full (consented debugging) disables masking |
PROBECTL_EBPF_L7_KERNEL_WINDOW |
1024 |
max plaintext bytes per chunk that may cross from kernel into userspace under headers redaction (128–4095); bytes past the window never leave the kernel. length forces 0, full forces 4095. An unprogrammed kernel defaults to length-only, so it ships no plaintext |
PROBECTL_EBPF_PROC_ROOT |
/proc |
procfs root for process/cgroup enrichment |
PROBECTL_EBPF_FLUSH_INTERVAL |
10s |
how often flows + the service map are emitted |
PROBECTL_EBPF_HEALTH_ADDR |
(none) | bind a liveness/readiness probe server (e.g. :9090; /healthz = process up, /readyz = flow source attached). Empty disables it. The Helm DaemonSet sets it from health.port |
PROBECTL_EBPF_LOG_LEVEL |
info |
debug | info | warn | error |
PROBECTL_EBPF_LOG_FORMAT |
json |
json | text |
Flows + service edges are published to probectl.ebpf.flows (ebpfv1.FlowBatch,
tenant-keyed). The live loader needs a BTF Linux kernel (≥5.8) and
CAP_BPF/CAP_PERFMON; see ebpf-agent.md.
Agent→bus TLS/SASL (eBPF, endpoint, flow, and device agents)
When a telemetry agent publishes straight to Kafka, its broker connection takes
the same hardening keys as the control plane's PROBECTL_BUS_* set, under the
agent's own prefix: PROBECTL_EBPF_BUS_* here, and likewise
PROBECTL_ENDPOINT_BUS_*, PROBECTL_FLOW_BUS_*, and PROBECTL_DEVICE_BUS_*
for the agents below. The policy is the same fail-closed one: kafka mode
without TLS refuses to start unless the explicit dev-only plaintext flag is
set. (The canary agent has no bus keys — it talks gRPC/mTLS to the control
plane, which publishes on its behalf.)
| Suffix (append to the agent's prefix) | Default | Meaning |
|---|---|---|
_BUS_TLS_ENABLED |
false |
TLS to the brokers — required in kafka mode unless _BUS_ALLOW_PLAINTEXT is set |
_BUS_TLS_CA_FILE |
(none) | private CA bundle for the brokers |
_BUS_TLS_CERT_FILE / _BUS_TLS_KEY_FILE |
(none) | client certificate + key (broker mTLS) |
_BUS_SASL_MECHANISM |
(none) | plain | scram-sha-256 | scram-sha-512 |
_BUS_SASL_USER / _BUS_SASL_PASSWORD |
(none) | SASL credentials (the agents read these as literal env values — the secret-reference schemes are a control-plane feature) |
_BUS_ALLOW_PLAINTEXT |
false |
dev only: allow a plaintext broker (the dev compose stack). Production never sets this |
_BUS_MAX_BUFFERED |
0 (= built-in bound 65536) |
async-producer in-flight bound; a full buffer sheds + counts, never blocks |
Endpoint / DEM agent (probectl-endpoint)
"DEM" is digital experience monitoring: this agent runs on an end-user's laptop
(Linux/macOS/Windows), measures their actual last-mile experience, and figures out
whether a slowdown is the WiFi, the ISP, or the network. Because it sits on a
personal device, its defaults are privacy-first — it collects the WiFi name
and gateway (useful, low-risk) but not the AP MAC or public hop IPs (which can
geolocate a person), and it discloses exactly what it collects on startup. It reads
a YAML config (default path PROBECTL_ENDPOINT_CONFIG); PROBECTL_ENDPOINT_* env
vars override it. See endpoint-dem.md.
| Variable | Default | Meaning |
|---|---|---|
PROBECTL_ENDPOINT_CONFIG |
(none) | path to the YAML config (-config flag overrides) |
PROBECTL_ENDPOINT_TENANT_ID |
(required) | the tenant every result is stamped with — refuses to start without it |
PROBECTL_ENDPOINT_AGENT_ID |
OS hostname | device identifier in the fleet |
PROBECTL_ENDPOINT_BUS_MODE |
memory |
memory | kafka |
PROBECTL_ENDPOINT_BUS_BROKERS |
(none) | comma-separated Kafka brokers (kafka mode) |
PROBECTL_ENDPOINT_BUS_NAMESPACE |
(none) | publish on this tenant's siloed bus lane instead of the shared topic (siloed deployments) |
PROBECTL_ENDPOINT_INTERVAL |
60s |
how often a sample is collected |
PROBECTL_ENDPOINT_TARGETS |
https://1.1.1.1,https://www.google.com |
comma-separated targets (first = last-mile trace; all = session probes) |
PROBECTL_ENDPOINT_MAX_HOPS |
20 |
last-mile trace hop cap |
PROBECTL_ENDPOINT_COLLECT_SSID |
true |
retain the WiFi network name (SSID) |
PROBECTL_ENDPOINT_COLLECT_BSSID |
false |
retain the access-point MAC (BSSID) — geolocatable PII, off by default |
PROBECTL_ENDPOINT_COLLECT_GATEWAY_IP |
true |
retain the (private) default-gateway address |
PROBECTL_ENDPOINT_COLLECT_PUBLIC_HOPS |
false |
retain PUBLIC last-mile hop IPs (which reveal ISP/geo), off by default |
PROBECTL_ENDPOINT_LOG_LEVEL |
info |
debug | info | warn | error |
PROBECTL_ENDPOINT_LOG_FORMAT |
json |
json | text |
Results (WiFi / gateway / last-mile / session signals + the attribution verdict)
are published to probectl.endpoint.results (resultv1.Result, tenant-keyed),
flowing through the same pipeline as every other canary. The agent discloses
exactly what it collects at startup and never phones home.
Flow collector (probectl-flow-agent)
The flow collector listens for NetFlow v5/v9, IPFIX, and sFlow v5 datagrams from
network devices, decodes them (template + sampling handling), and publishes
normalized batches to probectl.flow.events (flowv1.FlowBatch, tenant-keyed).
It reads a YAML config (default path PROBECTL_FLOW_CONFIG); PROBECTL_FLOW_*
env vars override the file. The defaults serve all three protocols on their
standard ports (NetFlow :2055, IPFIX :4739, sFlow :6343). See
flow.md for the security posture: flow export is plaintext UDP by
design, so every datagram is treated as untrusted and the collector should sit
adjacent to its exporters (not exposed to the wider network).
| Variable | Default | Meaning |
|---|---|---|
PROBECTL_FLOW_CONFIG |
(none) | path to the YAML config (-config flag overrides) |
PROBECTL_FLOW_TENANT |
(required) | the tenant every flow record is stamped with — refuses to start without it |
PROBECTL_FLOW_BUS_NAMESPACE |
(none) | publish this agent's batches on its tenant's siloed bus lane (probectl.<ns>.flow.events) instead of the shared topic; a malformed value refuses start. The same key exists for the other agents: PROBECTL_DEVICE_BUS_NAMESPACE, PROBECTL_EBPF_BUS_NAMESPACE, PROBECTL_ENDPOINT_BUS_NAMESPACE |
PROBECTL_FLOW_AGENT_ID |
OS hostname | collector identifier |
PROBECTL_FLOW_BUS_MODE |
memory |
memory | kafka |
PROBECTL_FLOW_BUS_BROKERS |
(none) | comma-separated Kafka brokers (kafka mode) |
PROBECTL_FLOW_NETFLOW_ENABLED |
true |
serve NetFlow v5 and v9 (version-sniffed) on one socket |
PROBECTL_FLOW_NETFLOW_LISTEN |
:2055 |
NetFlow UDP listen address |
PROBECTL_FLOW_IPFIX_ENABLED |
true |
serve IPFIX |
PROBECTL_FLOW_IPFIX_LISTEN |
:4739 |
IPFIX UDP listen address |
PROBECTL_FLOW_SFLOW_ENABLED |
true |
serve sFlow v5 |
PROBECTL_FLOW_SFLOW_LISTEN |
:6343 |
sFlow UDP listen address |
PROBECTL_FLOW_BATCH_SIZE |
1000 |
records per emitted batch |
PROBECTL_FLOW_FLUSH_INTERVAL |
2s |
max time a record waits before emission |
PROBECTL_FLOW_TEMPLATE_TTL |
30m |
v9/IPFIX template expiry |
PROBECTL_FLOW_MAX_TEMPLATES |
4096 |
template-cache size cap (untrusted-input bound) |
PROBECTL_FLOW_READ_BUFFER_BYTES |
4194304 |
kernel UDP receive buffer (burst absorption) |
PROBECTL_FLOW_QUEUE_SIZE |
65536 |
decode→flush channel depth (overflow drops are counted) |
PROBECTL_FLOW_WORKERS |
2 |
reader goroutines per socket |
PROBECTL_FLOW_LOG_LEVEL |
info |
debug | info | warn | error |
PROBECTL_FLOW_LOG_FORMAT |
json |
json | text |
The control plane consumes that flow topic, optionally enriches each record
with ASN/geo, and persists to the flow store behind /v1/flows/* (top-talkers /
capacity / anomalies). These are control-plane keys (not flow-agent keys):
| Variable | Default | Meaning |
|---|---|---|
PROBECTL_FLOWSTORE_MODE |
memory |
where flow records live: memory (lightweight/single-binary) | clickhouse (durable, high-cardinality) |
PROBECTL_FLOWSTORE_URL |
(none) | ClickHouse HTTP(S) endpoint; required in clickhouse mode |
PROBECTL_FLOWSTORE_TENANT_SCOPING |
false |
defense-in-depth: also constrain flow reads at the database by attaching a per-request tenant setting that a ClickHouse row policy enforces (needs server-side custom_settings_prefixes=SQL_ + a reader user). Tenant scoping already happens above this; this pushes it down one more layer |
PROBECTL_FLOWSTORE_READER_USER |
(none) | the ClickHouse reader user the setting-scoped row policy is installed on at boot (pairs with the toggle above) |
PROBECTL_FLOW_RETENTION_DAYS |
0 (keep) |
when > 0, applies a delete-after-N-days TTL to the probectl_flows ClickHouse table; 0 keeps flows indefinitely |
PROBECTL_FLOW_ENRICH_ASN |
false |
opt-in Team Cymru ASN enrichment. Off by default because it makes outbound DNS lookups (the no-phone-home guardrail); AS numbers the device itself exported always pass through regardless |
Device telemetry agent (probectl-device-agent)
This agent reads metrics straight off network gear (routers, switches). It polls
the old way (SNMP v2c/v3) and subscribes the modern streaming way
(gNMI/OpenConfig), normalizes both into one DeviceMetric shape, and publishes
to probectl.device.metrics (tenant-keyed); the control plane lands them in the
TSDB as probectl_device_* series. The full device list lives in a YAML config
(see deploy/agent/probectl-device-agent.example.yml); the env vars below override
it and give a single-device quick start for trying one device fast. See
device-telemetry.md.
| Variable | Default | Meaning |
|---|---|---|
PROBECTL_DEVICE_CONFIG |
(none) | path to the YAML config (-config flag overrides) |
PROBECTL_DEVICE_TENANT |
(required) | the tenant every device metric is stamped with — refuses to start without it |
PROBECTL_DEVICE_AGENT_ID |
OS hostname | agent identifier |
PROBECTL_DEVICE_BUS_MODE |
memory |
memory | kafka |
PROBECTL_DEVICE_BUS_BROKERS |
(none) | comma-separated Kafka brokers (kafka mode) |
PROBECTL_DEVICE_BUS_NAMESPACE |
(none) | publish on this tenant's siloed bus lane instead of the shared topic (siloed deployments) |
PROBECTL_DEVICE_TARGET |
(none) | quick start: add one device by address |
PROBECTL_DEVICE_TRANSPORT |
snmpv2c |
quick-start transport: snmpv2c | snmpv3 | gnmi |
PROBECTL_DEVICE_CREDENTIAL |
(none) | quick start: credential NAME for the device (see below) |
PROBECTL_DEVICE_PORT |
161 (SNMP) / 9339 (gNMI) |
quick start: port override (defaults to the transport's standard port) |
PROBECTL_DEVICE_INTERVAL |
60s |
quick start: poll/sample interval |
PROBECTL_DEVICE_LOG_LEVEL |
info |
debug | info | warn | error |
PROBECTL_DEVICE_LOG_FORMAT |
json |
json | text |
Credentials are referenced by NAME, never inlined — no secrets in
the device list. The default credential source resolves those names from the
environment (the PROBECTL_DEVICE_CRED_<NAME>_* vars below); the secrets backends
plug Vault/CyberArk into the same seam. An unresolvable name fails closed at
startup. <NAME> is the upper-cased credential name with -/. → _:
| Variable | Used by | Meaning |
|---|---|---|
PROBECTL_DEVICE_CRED_<NAME>_COMMUNITY |
snmpv2c | community string |
PROBECTL_DEVICE_CRED_<NAME>_USERNAME |
snmpv3, gnmi | USM user / gNMI metadata user |
PROBECTL_DEVICE_CRED_<NAME>_AUTH_PROTO |
snmpv3 | sha (default) | sha256 | sha512 | md5 |
PROBECTL_DEVICE_CRED_<NAME>_AUTH_PASS |
snmpv3 | auth passphrase (empty → NoAuthNoPriv) |
PROBECTL_DEVICE_CRED_<NAME>_PRIV_PROTO |
snmpv3 | aes (default) | aes256 | des |
PROBECTL_DEVICE_CRED_<NAME>_PRIV_PASS |
snmpv3 | privacy passphrase (empty → AuthNoPriv) |
PROBECTL_DEVICE_CRED_<NAME>_PASSWORD |
gnmi | gNMI metadata password |
gNMI connections are TLS with certificate verification (system roots or a
per-device ca_file); there is no skip-verify option. plaintext: true is an
explicit lab-only YAML opt-in and is loudly logged — never a silent plaintext default.
OTLP receiver
This lets other systems push their OpenTelemetry data (metrics, traces, logs)
into probectl. It is off by default and, when on, is locked to the same posture
as everything else: TLS-only, token-authenticated, tenant-scoped, on its own
listeners separate from the /v1 REST API. There is no anonymous-plaintext mode —
setting a listen address without both a TLS cert/key pair and at least one bearer
token fails config validation. See otlp.md.
| Variable | Default | Description |
|---|---|---|
PROBECTL_OTLP_GRPC_ADDR |
(none) | OTLP/gRPC listen address (e.g. :4317) |
PROBECTL_OTLP_HTTP_ADDR |
(none) | OTLP/HTTP listen address (e.g. :4318); accepts all three signals — POST /v1/metrics, /v1/traces, /v1/logs |
PROBECTL_OTELSTORE_MODE |
memory |
where ingested OTLP traces+logs live: memory (lightweight) | clickhouse (production; (tenant_id, day) partition) |
PROBECTL_OTELSTORE_URL |
(none) | ClickHouse HTTP URL for clickhouse mode (https = TLS in transit) |
PROBECTL_OTEL_RETENTION_DAYS |
30 |
delete-TTL for stored OTLP traces+logs (0 disables) |
PROBECTL_OTLP_TLS_CERT_FILE |
(none) | PEM server certificate (required to enable) |
PROBECTL_OTLP_TLS_KEY_FILE |
(none) | PEM server private key (required to enable) |
PROBECTL_OTLP_TOKENS |
(none) | bearer-token→tenant map: token1=tenant1,token2=tenant2 |
Setting an address without the TLS files and at least one token fails config
validation — the receiver is never anonymous plaintext. Ingested metrics are
tenant-tagged and published to the probectl.otlp.metrics bus topic.
Ecosystem integrations
The Grafana datasource API (/v1/grafana/api/v1/*), the federation endpoint
(/v1/prometheus/federate), and the remote-write receiver
(/v1/prometheus/write) ride the existing TSDB config (PROBECTL_TSDB_MODE /
PROBECTL_TSDB_URL) and the /v1 API listener — no extra keys. Reads need
metrics.read, remote-write metrics.write (migration 0022). See
ecosystem-integrations.md.
The ServiceNow CMDB correlation is off unless configured:
| Variable | Default | Meaning |
|---|---|---|
PROBECTL_CMDB_PROVIDER |
(none) | servicenow enables CI correlation (/v1/cmdb/*, incident/agent CIs) |
PROBECTL_CMDB_URL |
(none) | instance URL, e.g. https://acme.service-now.com (https; http only for loopback test doubles) |
PROBECTL_CMDB_SECRET |
(none) | user:password for the read-only integration user (env only — never in files/logs) |
PROBECTL_CMDB_TABLE |
cmdb_ci |
CI table queried via the Table API |
PROBECTL_CMDB_CACHE_TTL |
10m |
CI lookup cache TTL (a down CMDB serves stale entries) |
AI assistant
Worked per-provider setups (Ollama, vLLM, OpenAI, Anthropic, Azure) are in
ai-rca.md → Copy-paste recipes; the remote-egress enablement
chain (operator ack + per-tenant consent) is in ai-egress.md.
The assistant (root-cause analysis + natural-language query) works out of the
box with zero network access — the default builtin provider is an in-process
synthesizer that writes its answers locally. You only point it at a real language
model if you want nicer prose, and doing so is treated as data egress: a remote
endpoint must be https, and you have to explicitly acknowledge that tenant data
will leave (PROBECTL_AI_EGRESS_ACK). A loopback endpoint may be http (for a
local model on the same box). The redaction keys below mask sensitive values
before anything reaches an external model. See ai-rca.md.
| Variable | Default | Description |
|---|---|---|
PROBECTL_AI_MODEL_PROVIDER |
builtin |
builtin (air-gapped, the default) | ollama | openai | anthropic |
PROBECTL_AI_EGRESS_ACK |
(none) | required to use a REMOTE model: must equal yes-send-tenant-data-to-the-remote-model, or the server refuses to start. This is a deliberate "yes, I know data leaves" gate, on top of per-tenant consent + audit — see docs/ai-egress.md |
PROBECTL_AI_REDACT_IPS |
true |
mask IP addresses in anything sent to an external model (stable per-value tokens, so correlation survives; local file paths are never redacted) |
PROBECTL_AI_REDACT_HOSTNAMES |
false |
also mask hostnames (secrets are masked unconditionally regardless of this) |
PROBECTL_AI_REDACT_PII |
true |
mask free-text PII — emails, phone numbers, MAC addresses — in anything sent to an external model (RCA prompts, MCP tool results, authoring prompts) |
PROBECTL_AI_REDACT_PATTERNS |
(none) | your own regexes (;;-separated), masked as [custom:xxxx] — for org-specific identifiers (employee IDs, ticket refs). A bad pattern refuses start (fail closed) |
PROBECTL_AI_MODEL_ENDPOINT |
(none) | base URL of the model (required for a non-builtin provider) |
PROBECTL_AI_MODEL_NAME |
(none) | model name (e.g. llama3.1, gpt-4o-mini) |
PROBECTL_AI_MODEL_TOKEN |
(none) | API key / bearer token (optional for a local Ollama) |
PROBECTL_AI_MODEL_TIMEOUT |
60s |
per-request timeout for the model endpoint |
PROBECTL_AI_MAX_EVIDENCE |
50 |
cost guard: the most signals one answer may gather |
PROBECTL_AI_MAX_CONCURRENT |
8 |
process-wide cap on concurrent analyses (HTTP 429 when exceeded); a backstop beneath the per-tenant fairness gate |
PROBECTL_AI_PERSIST_ANSWERS |
false |
persist every answer (the cited JSON + model + config hash) for reproducibility/disputes |
PROBECTL_AI_ANSWER_RETENTION |
2160h (90 days) |
prune persisted answers older than this (enforced opportunistically on write) |
A non-builtin provider without an endpoint fails config validation. Whatever the
backend, every answer is tenant- and RBAC-scoped by the query layer and every claim
is citation-checked before it reaches the user — a model can never see out-of-scope
data or inject an ungrounded claim.
MCP server
The MCP server exposes read-only, tenant- + RBAC-scoped tools to AI clients. The
HTTP transport is off by default and is TLS-only + bearer-authenticated;
the stdio transport is local (probectl-control mcp-stdio,
token from PROBECTL_MCP_TOKEN). Mint a token with
probectl-control mcp-token --user <user-uuid> [--tenant <uuid>] [--name <label>] —
the token prints once and only its hash is stored, so a database read can never
recover it. See mcp.md.
| Variable | Default | Description |
|---|---|---|
PROBECTL_MCP_HTTP_ADDR |
(none) | MCP HTTP listen address (e.g. :8090) — enables the transport |
PROBECTL_MCP_TLS_CERT_FILE |
(none) | PEM server certificate (required to enable HTTP) |
PROBECTL_MCP_TLS_KEY_FILE |
(none) | PEM server private key (required to enable HTTP) |
PROBECTL_MCP_RATE_PER_MIN |
120 |
per-tenant tool-call rate limit (0 disables) |
Setting PROBECTL_MCP_HTTP_ADDR without the TLS files fails config validation — the
MCP endpoint is never anonymous plaintext.
TLS / certificate observability
The control plane analyzes TLS/cert posture from TLS handshakes the HTTP and
eBPF-L7 probes already captured — it never re-handshakes a target itself — and
correlates the findings into threat-plane incidents. See
tls-observability.md.
| Variable | Default | Description |
|---|---|---|
PROBECTL_TRUSTCTL_URL |
(none) | trustctl base URL; enables a one-click renewal deep-link on findings |
PROBECTL_TLS_EXPIRY_WARNING |
504h (21d) |
expiring-soon window |
PROBECTL_CT_ENABLED |
false |
opt in to Certificate Transparency correlation (external fetch) |
PROBECTL_CT_ENDPOINT |
https://crt.sh |
CT log API endpoint |
CT correlation is off by default (an external fetch — sovereignty / AUP / rate limits) and degrades gracefully when the CT source is down.
Threat-intel enrichment
The control plane can match peer IPs / hostnames / certs / JA3 against public
threat-intel feeds, surfacing confidence-scored, source-attributed threat-plane
signals (a signal, not an IPS — never blocks). See
threat-intel.md for the feed/AUP matrix and caveats.
| Variable | Default | Description |
|---|---|---|
PROBECTL_THREATINTEL_ENABLED |
false |
master switch (outbound feed fetches); off ⇒ no IOC code runs |
PROBECTL_THREATINTEL_REFRESH |
6h |
feed refresh cadence |
PROBECTL_THREATINTEL_FEEDS |
(all) | comma-separated feed names (spamhaus_drop, feodo_tracker, sslbl, sslbl_ja3, urlhaus, tor_exit, firehol_level1); empty ⇒ all |
Off by default (an outbound fetch — sovereignty / no-phone-home). The refresher keeps each source's last-good indicators, so a feed outage degrades gracefully and never breaks a core path.
Enterprise identity: SCIM + ABAC
SCIM 2.0 provisioning and ABAC have no environment keys — the SCIM bearer token
an IdP presents is minted with the control-plane CLI, and ABAC policies are managed
over the API. See scim-abac.md.
# mint a per-tenant SCIM token for an IdP (shown once)
probectl-control scim-token --tenant <tenant-uuid> --name okta
The /scim/v2/* surface is gated by a valid SCIM token (no token ⇒ 401), and the
directory-admin API (/v1/abac/policies) requires directory.read/directory.write.
Change intelligence
Ingest per-provider-signed change webhooks (deploys/config/route/IaC/commits) into
a change timeline + change-to-incident correlation, feeding the AI RCA. See
change-intel.md for the webhook contract + provider/signature
table.
| Variable | Default | Description |
|---|---|---|
PROBECTL_CHANGE_WEBHOOKS |
(none) | comma-separated id:tenant:provider:secret webhook credentials (provider ∈ generic/github/gitlab). The secret is the last field, so it may contain : but not , — use URL-safe (hex/base64) secrets. |
PROBECTL_CHANGE_CORRELATION_WINDOW |
24h |
how far before an incident a change is treated as a candidate cause |
Each inbound delivery is TLS + signature-verified (HMAC/token, constant-time) + tenant-bound to the credential; an unsigned or forged event is rejected before storage, and one tenant cannot inject another's changes. Webhook secrets are runtime config — inject them from a secret manager, never commit them.
SIEM export
Forward the audit stream and threat-plane signals to a SOC's SIEM over
hardened TLS. probectl is the forwarder, not a SIEM — events are rendered into a
standard format and pushed; nothing is auto-blocked. See siem.md for
formats, delivery guarantees, and per-SIEM setup.
| Variable | Default | Description |
|---|---|---|
PROBECTL_SIEM_ENABLED |
false |
master switch (an outbound connection to your SIEM); off ⇒ no SIEM code runs |
PROBECTL_SIEM_PRESET |
generic |
SIEM adapter: generic, splunk, sentinel, elastic, chronicle (sets the auth scheme + default format) |
PROBECTL_SIEM_FORMAT |
(preset) | wire format: syslog (RFC 5424), cef, ecs, otlp; empty ⇒ the preset's native default (Elastic⇒ecs, Chronicle⇒otlp, else cef) |
PROBECTL_SIEM_ENDPOINT |
(none) | HTTPS ingest URL (e.g. the Splunk HEC / Sentinel / Chronicle / Elasticsearch endpoint). Required when enabled |
PROBECTL_SIEM_TOKEN |
(none) | ingest credential (Splunk ⇒ Splunk <tok>, Elastic ⇒ ApiKey <tok>, others ⇒ Bearer <tok>). Inject from a secret manager |
PROBECTL_SIEM_POLL_INTERVAL |
30s |
audit-stream drain cadence |
PROBECTL_SIEM_BUFFER |
1024 |
threat-signal buffer; full ⇒ producers block (backpressure, never drop) |
PROBECTL_SIEM_REDACT_KEYS |
(none) | extra audit data keys to scrub (on top of the built-in secret/PII denylist) |
Off by default (an outbound connection — sovereignty / no-phone-home). Audit
forwarding resumes from a durable per-tenant cursor, and delivery retries
without dropping under a SIEM outage. Exported audit events are PII/secret
redacted (built-in denylist + PROBECTL_SIEM_REDACT_KEYS).
On-call + ITSM integration
Mirror incidents into operational tooling: page on-call (PagerDuty/Opsgenie), post
to chat (Slack/Teams), and open + bidirectionally sync tickets (ServiceNow/Jira).
probectl is the forwarder, not the system of record — it never auto-blocks anything.
See oncall-itsm.md for the connector matrix, mapping, and the
inbound webhook contract.
| Variable | Default | Description |
|---|---|---|
PROBECTL_NOTIFY_CONNECTORS |
(none) | outbound connectors, comma-separated, each tenant|provider|endpoint|secret (pipe-delimited because the endpoint is a URL). provider ∈ pagerduty/opsgenie/slack/teams/servicenow/jira. secret is the provider credential (PagerDuty routing key, Opsgenie API key, ServiceNow user:password, Jira email:token; unused for chat). |
PROBECTL_NOTIFY_INBOUND |
(none) | inbound status-sync credentials, comma-separated, each id:tenant:provider:secret (the id is the URL selector for POST /ingest/itsm/{provider}/{id}; secret verifies the delivery). |
Off by default (each connector is an outbound connection to the operator's
tooling). Paging + ticket creation are idempotent (an incident opens at most
once per connector — a retry/restart never double-pages), status sync is
bidirectional with loop protection (an inbound resolve from one system is
never echoed back to it), and routing is per-tenant (a connector only fires for
its own tenant). Endpoint specifics: a Slack/Teams endpoint is the incoming-webhook
URL; a Jira endpoint carries the project (and optional resolve transition) as query
params, e.g. …/rest/api/2/issue?project=OPS&resolve_transition=31; a ServiceNow
endpoint is the …/api/now/table/incident URL. Inbound deliveries must include
X-Probectl-Signature: sha256=<hmac> or X-Probectl-Token: <secret> over TLS; an
unsigned or forged delivery is rejected (401). Secrets are runtime config —
inject them from a secret manager, never commit them.
Topology graph + what-if
| Variable | Default | Purpose |
|---|---|---|
PROBECTL_TOPOLOGY_ENGINE |
indexed |
graph engine: indexed (adjacency-indexed, for large/extra-large graphs) or memory (the simpler reference implementation). Both sit behind the same query API |
The graph feeds from eBPF/BGP/device streams + path discoveries; served at
GET /v1/topology with what-if simulation at POST /v1/topology/whatif.
See docs/topology.md.
FinOps / egress cost
| Variable | Default | Purpose |
|---|---|---|
PROBECTL_COST_ENABLED |
true |
cost engine over the local flow stream (volume × public pricing; no billing-API calls) |
PROBECTL_COST_ZONES |
(none) | CIDR→zone rules, e.g. 10.0.1.0/24=us-east-1a,… (locality classification) |
PROBECTL_COST_SERVICES |
(none) | CIDR→service:team attribution rules (showback) |
PROBECTL_COST_BUDGETS |
(none) | monthly USD budgets, e.g. team:payments=500 (breach = one cost-plane signal per month) |
PROBECTL_COST_PRICES_FILE |
(none) | JSON price-table override; embedded public list rates otherwise (provenance + as-of surfaced) |
PROBECTL_COST_PRICED |
true |
false = volume-only mode (bytes attributed, dollars never invented) |
Summary at GET /v1/cost/summary and the Cost page; deep dashboards are federated
to Grafana (see Ecosystem integrations above). See docs/finops.md.
SLO engine
| Variable | Default | Purpose |
|---|---|---|
PROBECTL_SLO_ENABLED |
true |
OpenSLO SLI/SLO engine over the synthetic-result stream (error budgets + multi-window burn-rate signals) |
PROBECTL_SLO_DIR |
(none) | directory of OpenSLO v1 YAML definitions (strictly validated; malformed/duplicate definitions fail startup) |
Statuses at GET /v1/slos, OpenSLO export at GET /v1/slos/openslo, and the
SLOs page. See docs/slo.md.
Compliance / segmentation validation
| Variable | Default | Purpose |
|---|---|---|
PROBECTL_COMPLIANCE_ENABLED |
true |
segmentation validator over observed flow/eBPF traffic (validation only — never enforcement) |
PROBECTL_COMPLIANCE_POLICY_DIR |
(none) | segmentation policy YAML directory (strictly validated; malformed files fail startup) |
Verdicts at GET /v1/compliance, hash-chained audit evidence at
GET /v1/compliance/evidence, and the Compliance page. See
docs/compliance.md.
Collective internet-outage view
| Variable | Default | Purpose |
|---|---|---|
PROBECTL_OUTAGE_ENABLED |
true |
the local engine: vantage detection over your own results + correlation with external events (no outbound calls) |
PROBECTL_OUTAGE_FEEDS_ENABLED |
false |
opt-in public outage feeds (IODA, Cloudflare Radar) — enabling makes outbound fetches (sovereignty / no-phone-home) |
PROBECTL_OUTAGE_FEEDS |
(all) | feeds to load: ioda, cloudflare_radar |
PROBECTL_OUTAGE_REFRESH |
10m |
feed refresh cadence (last-good kept on failure) |
PROBECTL_OUTAGE_RETENTION |
48h |
event window kept/queried |
PROBECTL_OUTAGE_RADAR_TOKEN |
(none) | Cloudflare API token the radar feed requires (a secret reference is accepted); the feed is omitted without it |
The collective view at GET /v1/outages (events + the caller-tenant's
affected tests + vantage detections + feed AUP/health + coverage notes) and
the Internet outages page. Scope resolution (IP→ASN/country) rides the open-data
enricher (PROBECTL_FLOW_ENRICH_ASN); without it the response reports the
degradation honestly. See docs/outage.md.
RUM convergence
| Variable | Default | Purpose |
|---|---|---|
PROBECTL_RUM_ENABLED |
false |
the browser-beacon ingest + synthetic↔RUM convergence engine (an inbound surface — opt-in) |
PROBECTL_RUM_APPS |
(none) | app-key registry pk_key=tenant/app,... — each beacon binds to its KEY's tenant; enabled-but-empty fails startup |
PROBECTL_RUM_RATE_PER_MIN |
300 |
per-key beacon rate limit (429 + Retry-After above it; 0 = unlimited) |
Beacons ingest at POST /ingest/rum (app-key authenticated, consent-gated,
URL-redacted, no IP stored — privacy is enforced server-side, fail closed);
the convergence view serves at GET /v1/rum and folds into the Endpoints
surface; rum.* vitals flow to the TSDB for dashboards. The SDK is
web/public/probectl-rum.js. See docs/rum.md.
Carbon / power observability
| Variable | Default | Purpose |
|---|---|---|
PROBECTL_CARBON_ENABLED |
true |
coefficient-based energy/carbon ESTIMATES over the local flow stream (local-only; methodology served with every response) |
PROBECTL_CARBON_GRID_GCO2E |
436 |
your grid's carbon intensity in gCO2e/kWh (defaults to the world average — set yours) |
Attribution reuses PROBECTL_COST_ZONES / PROBECTL_COST_SERVICES. The
estimate serves at GET /v1/carbon and folds into the Cost page. See
docs/carbon.md. The chaos injector and the large/extra-large scale gate are
test-harness tools — see docs/chaos.md and docs/scale-gate.md.
Editions / license
| Variable | Default | Purpose |
|---|---|---|
PROBECTL_LICENSE_FILE |
(none) | path to the Ed25519-signed license file. Unset = Community (the full core, default-open). Set-but-missing/invalid = startup error (fail closed on configuration) |
Verification is offline — local signature math against public keys baked
into the binary at build time (never an env var; never phone-home). Expiry
runs the 30-day-grace → read-only ladder and never breaks running
telemetry. License state + the feature→tier map serve at
GET /v1/editions and render on Admin → Editions — the one place tiers
appear when unlicensed. See docs/editions.md for the file format, the
signing CLI (probectl-license), and the gating pattern.
Provider / management plane (ee/)
Active only when the license grants provider_plane; otherwise /provider/*
is a plain 404 (hidden, not locked).
| Variable | Default | Purpose |
|---|---|---|
PROBECTL_PROVIDER_BOOTSTRAP_TOKEN |
(none) | creates the FIRST operator via POST /provider/v1/auth/bootstrap; single-use — inert once any operator exists |
PROBECTL_PROVIDER_BREAKGLASS_MAX_TTL_MINUTES |
240 |
cap on break-glass grant lifetimes (5–1440) |
The provider plane additionally requires PROBECTL_ENVELOPE_KEY (operator
TOTP secrets are envelope-sealed at rest) and a database. Operator MFA is
mandatory; operators are a privilege domain distinct from tenant users with
no implicit access to tenant telemetry — see docs/provider-plane.md for
the model, the break-glass consent flow, and the storage-layer confinement
(probectl_provider role). Suspending a tenant rejects its users at the API
(tenant_suspended) without touching data or ingestion.
Siloed / hybrid isolation (ee/)
Pooled isolation stays the default and needs no configuration. Siloed and
hybrid tenants (per-tenant Postgres schema / ClickHouse database / bus topic
namespace / object key namespace) require a license granting
siloed_isolation and are selected per tenant at provisioning
(isolation_model + optional residency).
| Variable | Default | Purpose |
|---|---|---|
PROBECTL_DATAPLANES |
(none) | named residency data planes — name=clickhouseURL[;name=clickhouseURL...] (e.g. eu=https://ch-eu:8123;us=https://ch-us:8123). A tenant's residency pins its ClickHouse database to that plane |
Residency pins the tenant's ClickHouse flow data in this release;
Postgres control state, the TSDB, object storage, and bus brokers are NOT
region-pinned yet — docs/isolation.md states the exact contract, the
catch-up/migration story for silo schemas, and the offboard-teardown
semantics.
White-label branding (ee/)
No configuration keys: branding activates with a license granting
white_label and is configured per tenant (or as the provider master) from
the provider console. The public GET /branding endpoint serves the resolved
brand pre-auth (Host-resolved for custom domains; the probectl default when
unlicensed); custom-domain login resolves the tenant from the serving host.
Custom domains need a certificate at the TLS-terminating ingress (or via
trustctl) — see docs/white-label.md for the token-override contract, the
no-bleed rules, and the email-template contract.
Advanced data governance (governance, ee/)
Per-tenant data classification + redaction, composed with retention, residency,
and BYOK. No new config keys: the classification + redaction MECHANISM is core (the
?redact=true export toggle works anywhere,
masking PII with a partial strategy); the governance feature adds per-tenant
POLICY (stored in tenant_governance, migration 0033) set from the provider
plane (GET/PUT /provider/v1/tenants/{id}/governance). IPs are PII by default.
Full model: docs/governance.md. Redacted export: GET /v1/lifecycle/export?redact=true.
Tenant lifecycle: export, retention, erasure (core)
Export + verifiable deletion are a compliance right — core in every edition.
GET /v1/lifecycle/export (permission lifecycle.export) streams the
portability bundle; GET/PUT /v1/lifecycle/retention + POST /v1/lifecycle/erase (permission lifecycle.erase, slug-confirmed,
irreversible) manage retention and run the attested cross-store erasure. The
provider console adds the operator-side erase trigger. See
docs/runbooks/tenant-offboarding.md for the full procedure and the
per-store verification table.
| Variable | Default | Purpose |
|---|---|---|
PROBECTL_BACKUP_RETENTION_NOTE |
(empty → a generic fallback statement) | your backup-TTL statement, included VERBATIM in every deletion attestation — be explicit about when snapshots expire. When unset, a generic placeholder sentence is recorded instead |
PROBECTL_BACKUP_RETENTION_DAYS |
0 |
concrete backup TTL in days. When > 0, the tenant-erasure attestation quantifies a bounded backup-coverage window (backup_erasure_deadline = erased_at + this many days); 0 = note-only |
PROBECTL_ENVELOPE_KEY / PROBECTL_ENVELOPE_KEY_FILE |
(none) | the at-rest KEK (see the control-plane table) — also used by probectl-control backup-seal/backup-open to encrypt/restore backups. The chart's Postgres backup CronJob mounts it to seal dumps in the pipeline |
The daily retention sweeper enforces per-tenant flow_retention_days
(tighter than the deployment TTL). Prometheus-mode TSDB series deletion is a
documented manual step (the attestation says so honestly).
Per-tenant metering & quotas (ee/)
No configuration keys: metering activates with a license granting metering
(provider/MSP tier). Counters flush every minute; gauge snapshots run every
15 minutes; usage and quotas live in Postgres (migration 0026). The usage
API, the CSV/JSONL billing-export feed, per-tenant quotas (creation-gating
only — telemetry is never quota-dropped), and the console showback card are
documented in docs/metering.md.
Per-tenant key isolation / BYOK (ee/)
Unlocked by the byok feature (Enterprise). No new config keys: the keyring
wraps managed tenant KEKs under PROBECTL_ENVELOPE_KEY (required when
byok is licensed — startup fails loudly without it) and resolves BYOK
references through the secret backends. Surfaces: GET/POST /v1/security/keys[...] (permission security.keys) + the Admin →
Encryption keys card. The full model — sealing formats, rotation, the BYOK
lockout warning, crypto-offboarding — is in docs/byok.md.
Tenant fairness (core)
These are the per-tenant bounds that protect a pooled (shared) deployment, so
one noisy tenant can't starve the others — and they are core in every edition. The
ingest-rate bounds are on by default with conservative numbers; you opt out
of a bound by setting it to an explicit 0 (unlimited). Unset keeps the
default, and a negative value is a startup error — config validation rejects it.
The two query bounds already default to 0, i.e. unlimited until you set them.
Per-tenant overrides are set from the provider console into tenant_fairness.
Full model: docs/fairness.md.
These are token-bucket rate limits: the steady rate is the value below, and the
bucket can hold a burst of rate × PROBECTL_FAIRNESS_BURST_SECONDS. Telemetry over
a bound is admission-controlled (shed + counted), never silently corrupted.
| Key | Default | Description |
|---|---|---|
PROBECTL_FAIRNESS_RESULTS_PER_SEC |
1000 |
per-tenant result-message admission rate. Explicit 0 = unlimited |
PROBECTL_FAIRNESS_FLOW_EVENTS_PER_SEC |
10000 |
per-tenant flow-record admission rate. Explicit 0 = unlimited |
PROBECTL_FAIRNESS_INGEST_BYTES_PER_SEC |
2097152 |
per-tenant ingest byte rate (2 MiB/s). Explicit 0 = unlimited |
PROBECTL_FAIRNESS_DEVICE_METRICS_PER_SEC |
2000 |
per-tenant SNMP/gNMI device-sample admission rate. Explicit 0 = unlimited |
PROBECTL_FAIRNESS_BURST_SECONDS |
10 |
burst window: bucket capacity = rate × this. 0 falls back to 10 — an enforced bucket always has a burst |
PROBECTL_FAIRNESS_QUERY_CONCURRENCY |
0 (unlimited) |
per-tenant in-flight query cap (HTTP 429 over it) |
PROBECTL_FAIRNESS_QUERIES_PER_MIN |
0 (unlimited) |
per-tenant query budget per minute (HTTP 429 over it) |
Multi-region / active-active HA (core)
Inert unless PROBECTL_REGION is set (single-region deployments need none of
these). The control plane stays stateless and active in every region; the
split-brain fence pauses API writes during a failover while reads + telemetry
keep flowing. Full model + the failover runbook: docs/multi-region.md,
docs/runbooks/region-failover.md.
| Key | Default | Description |
|---|---|---|
PROBECTL_REGION |
(empty) | this replica's region; empty = single-region (fence inert) |
PROBECTL_REGIONS |
(empty) | comma list of all regions in the deployment |
PROBECTL_DATABASE_URL |
… | the WRITER endpoint (DNS/proxy that resolves to the current primary) |
PROBECTL_DATABASE_READ_URL |
(empty) | optional local read-replica endpoint; empty = reads use the writer |
PROBECTL_REPLICATION_MODE |
async |
sync (RPO 0) or async (RPO ≈ lag) — descriptive; configure Postgres to match |
PROBECTL_RESIDENCY |
(empty) | default data-residency region (governance) |
PROBECTL_RPO_SECONDS |
0 |
provisional RPO target (human sign-off) |
PROBECTL_RTO_SECONDS |
60 |
provisional RTO target (human sign-off) |
The writer must be reachable for API writes; cluster_state (migration 0032)
holds the promotion epoch the fence reads. Promotion is cluster_promote() in
the failover runbook.
Supportability (core)
Deep health + a secret-stripped support bundle for triage (CORE; the support
org/SLA is contract). No new config keys; diagnostics.read (migration 0034,
admin-seeded) gates GET /v1/diagnostics and GET /v1/diagnostics/bundle. An
offline bundle: probectl-control support-bundle [-o file]. Self-monitoring
series probectl_self_* + probectl_build_info feed
deploy/grafana/dashboards/probectl-self.json. The bundle NEVER contains
secrets/credentials/PII (allowlist config + anonymized topology + a final
scrub). Full model: docs/supportability.md.
Guarded agentic remediation (remediation, ee/)
The assistant PROPOSES remediations; a human APPROVES; probectl NEVER executes —
there is no executor in the codebase (remediation is human-gated by design). Approve is a recorded,
audited, blast-radius-limited, human-only sign-off that an operator carries out
in their own change process; ingested data (e.g. a prompt-injection routed
through the propose_remediation MCP tool) can at most create a proposed
proposal a human must approve via the authenticated UI. The feature is hidden
(404) when the remediation Enterprise feature is unlicensed.
| Variable | Default | Notes |
|---|---|---|
PROBECTL_REMEDIATION_APPROVALS_ENABLED |
false |
advisory-only master switch — until an operator turns this on, Approve is unavailable and proposals are review-only |
PROBECTL_REMEDIATION_MAX_BLAST_RADIUS |
50 |
a proposal whose simulated (topology what-if) blast radius exceeds this cannot be approved; an unknown radius (no topology available) is also blocked — fail closed |
Permissions remediation.propose and remediation.approve (migration 0035,
admin-seeded) gate the /v1/remediation/* routes; the dry-run blast radius is a
read-only topology simulation. Full policy + architecture: docs/remediation.md.
NDR-lite detection
| Variable | Default | Purpose |
|---|---|---|
PROBECTL_NDR_ENABLED |
true |
behavioral detection engine (DGA/exfil/beaconing/egress/lateral) over local DNS/flow/eBPF streams; signals only — never blocks |
PROBECTL_NDR_RULES_DIR |
(none) | detection-as-code overlay directory; rules merge by id over the embedded defaults (a malformed dir fails startup) |
Detections are confidence-scored threat-plane signals (ndr.*) exported to
incidents, the Security triage surface, and the SIEM (see SIEM export above).
See docs/ndr.md for the detector and tuning reference.
Secrets integration
This is the feature that lets you keep raw passwords out of your config entirely. Anywhere this document asks for a credential, you can instead hand it a pointer to where the real secret lives — a Vault path, a CyberArk query, an AWS/Azure/GCP secret id — and the control plane fetches it at boot (or per poll, for device creds). The settings below just tell probectl how to reach each backend; the references themselves go in the credential keys documented throughout this page.
Any credential value in this document may be a secret reference instead of
the literal material — env:NAME, vault:<mount>/<path>#<field>,
cyberark:<query>, aws:<id>[#<json-field>], azure:<vault>/<name>,
gcp:<project>/<secret>[/<version>], or literal:<value> as the escape
hatch. The control plane resolves PROBECTL_OIDC_CLIENT_SECRET,
PROBECTL_CMDB_SECRET, PROBECTL_AI_MODEL_TOKEN, PROBECTL_SIEM_TOKEN,
PROBECTL_BUS_SASL_PASSWORD, PROBECTL_OUTAGE_RADAR_TOKEN, and the secret
parts of PROBECTL_CHANGE_WEBHOOKS / PROBECTL_NOTIFY_CONNECTORS /
PROBECTL_NOTIFY_INBOUND at startup (fail closed); the device agent resolves
every PROBECTL_DEVICE_CRED_<NAME>_* value per poll cycle. Resolved values are
cached only encrypted, for a short lease (5 m). See docs/secrets.md.
Backend access settings (environment only; all over verified TLS):
| Variable | Default | Purpose |
|---|---|---|
PROBECTL_SECRETS_VAULT_ADDR |
(none) | Vault base URL; enables vault: references |
PROBECTL_SECRETS_VAULT_TOKEN |
(none) | static Vault token (alternative to AppRole) |
PROBECTL_SECRETS_VAULT_ROLE_ID / _SECRET_ID |
(none) | AppRole login; the lease-aware client token is renewed at ⅔ TTL |
PROBECTL_SECRETS_VAULT_NAMESPACE |
(none) | X-Vault-Namespace (Vault Enterprise) |
PROBECTL_SECRETS_CYBERARK_URL |
(none) | CyberArk CCP base URL; enables cyberark: |
PROBECTL_SECRETS_CYBERARK_APP_ID |
(none) | CCP AppID |
PROBECTL_SECRETS_CYBERARK_CERT_FILE / _KEY_FILE / _CA_FILE |
(none) | optional CCP client-certificate auth |
AWS_REGION (or AWS_DEFAULT_REGION), AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN |
(none) | enables aws: (Secrets Manager, SigV4) |
AZURE_TENANT_ID, AZURE_CLIENT_ID, AZURE_CLIENT_SECRET |
(none) | enables azure: (Key Vault) |
GOOGLE_APPLICATION_CREDENTIALS |
(none) | service-account key file; enables gcp: (Secret Manager) |
Backend health (counters + redacted last error, never secret material) is
served at GET /v1/secrets/health and on the Admin page.
Local dev stack (deploy/compose/dev.yml)
Started with make compose-up. Local, non-production defaults — plaintext
listeners and dev credentials for convenience. Production deploys are
TLS/HTTPS-by-default — TLS on every listener.
| Service | Compose name | Host port(s) | Purpose | Dev credentials |
|---|---|---|---|---|
| PostgreSQL | postgres |
5432 |
Durable state, tenants, RBAC, audit, SLOs | user/pass/db = probectl |
| Kafka | kafka |
9092 |
Result/event bus (KRaft, no ZooKeeper) | none (PLAINTEXT) |
| ClickHouse | clickhouse |
8123 (HTTP), 9000 (native) |
High-cardinality events/flows | user/pass/db = probectl |
| Prometheus | prometheus |
9090 |
Metrics TSDB (remote-write enabled) | none |
Kafka listeners: host clients use localhost:9092; in-network containers use
kafka:19092; the KRaft controller uses 9093 (internal). Prometheus runs with
--web.enable-remote-write-receiver so the result pipeline can remote-write into
it.
These names and ports are a contract — the integration test harness depends on them, so don't rename them casually.
Tear-down
make compose-down removes the containers and volumes (pgdata, chdata,
promdata).