Change intelligence + change-to-incident correlation
What this is. Most outages have a human cause: someone deployed, changed a config, flipped a route, or ran a Terraform apply. probectl pulls those change events in from the systems that already record them (GitHub, GitLab, CI, IaC tools), normalizes them into one shape, keeps a per-tenant change timeline, and correlates recent changes to incidents — so the AI root-cause analysis (RCA) can answer the one question that resolves most outages: "what changed?"
A change event is context, not an alarm. A deploy is a candidate cause — surfaced and ranked when an incident opens nearby in time and topology. It is never raised as an alert on its own. (That keeps this on the right side of probectl's standing rule that a detection is a signal, never an action.)
It lives in internal/change (pure: no datastore/bus/HTTP-server dependency, so
the normalizers and the correlator are independently testable), with the inbound
HTTP receiver in internal/control (change.go).
Security model (the webhook is an inbound attack surface)
A webhook is an unauthenticated POST from the public internet that ends up feeding the RCA. So every inbound delivery is treated as untrusted and must clear all of these before anything is stored — the same verified, authenticated, untrusted-ingestion discipline every probectl inbound surface follows (see the Non-negotiables):
- TLS. The API (and the shipped compose/Helm) are HTTPS-only.
- Per-provider signature verification. The sender's HMAC (GitHub / probectl's
generic scheme) or shared token (GitLab) is verified in constant time
against that webhook's configured secret. An unsigned or forged delivery is
rejected with
401before the body is parsed, so a forged change can never reach the timeline or the RCA. (change.go:p.Verify(...)→apierror.Unauthorized.) - Tenant binding. The tenant is taken from the verified credential, never
from the payload. The generic wire schema deliberately has no
tenant_idfield (provider.go:genericEvent), so there is nothing to spoof; injecting another tenant's change would require that tenant's webhook secret, which fails the HMAC. One tenant therefore cannot inject another's change events. - Size-limited, validated parse. Bodies are capped at 1 MiB
(
changeWebhookMaxBody = 1 << 20) and malformed entries are dropped, not stored.
All cryptographic checks route through internal/crypto (crypto.Verify,
crypto.ConstantTimeEqual) — never a direct crypto/hmac call — so a FIPS
module swaps in cleanly (the crypto-abstraction rule: handlers never touch
primitives).
Pipeline
%%{init: {'theme':'base','themeVariables':{'background':'#0d1117','primaryColor':'#161b22','primaryTextColor':'#e6edf3','primaryBorderColor':'#3b82f6','lineColor':'#8b949e','secondaryColor':'#21262d','tertiaryColor':'#0d1117','clusterBkg':'#161b22','clusterBorder':'#30363d','fontFamily':'ui-monospace, SFMono-Regular, Menlo, monospace'},'flowchart':{'curve':'basis','nodeSpacing':55,'rankSpacing':55,'padding':12}}}%%
flowchart LR
subgraph sources["Change sources (sign each delivery)"]
GH["GitHub\nX-Hub-Signature-256 (HMAC)"]
GL["GitLab\nX-Gitlab-Token (token)"]
CI["CI / IaC / automation\nX-Probectl-Signature (HMAC)"]
end
sources -->|"POST /ingest/changes/{provider}/{id}\n(TLS)"| WH["Webhook handler\n1. lookup credential by {id}\n2. verify signature (const-time)\n3. normalize (untrusted)\n4. stamp tenant from credential"]
WH -->|"reject 401 if unsigned/forged"| X["✗ never stored"]
WH -->|"verified"| STORE[("change_events\n(tenant-scoped, RLS)")]
STORE --> TL["GET /v1/changes\n(change timeline)"]
STORE --> COR["change.Candidates\n(time + topology window)"]
INC["Incident\n(target/prefix, started_at)"] --> COR
COR --> IC["GET /v1/incidents/{id}/changes\n(ranked candidate causes)"]
STORE --> RCA["AI RCA evidence source\n(events) → cites the change"]
The ChangeEvent model
Every source is normalized onto one record (internal/change.Event): an id,
source, kind (deploy / config / route / iac / commit / release /
other), title / summary, a correlation target (host / service / IP)
and/or prefix (CIDR), actor, ref (commit / deploy id), url, free-form
attributes, and occurred_at.
target and prefix are the anchors that tie a change to an incident — they are
what the correlator matches on. A change with neither anchor can still appear on
the timeline, but it can only correlate to an untargeted incident (see below).
Providers & signatures
| Provider | provider |
Signature header | Scheme | Events normalized |
|---|---|---|---|---|
| probectl / CI / IaC | generic |
X-Probectl-Signature: sha256=<hmac> |
HMAC-SHA256 | probectl change schema (a single object, an array, or {"events":[…]}) |
| GitHub | github |
X-Hub-Signature-256: sha256=<hmac> |
HMAC-SHA256 | push → commit; deployment / deployment_status → deploy |
| GitLab | gitlab |
X-Gitlab-Token: <token> |
shared token (constant-time) | Push Hook → commit; Deployment Hook → deploy |
The generic provider is the path for network-automation / CI / Terraform /
Atlantis: it accepts probectl's own schema and carries an explicit correlation
target or prefix, so a deploy can be tied to the host or netblock it touched.
GitHub and GitLab demonstrate heterogeneous normalization — for example, a
deploy's environment_url host becomes the correlation target.
Sending a generic change
POST /ingest/changes/generic/<webhook-id>
X-Probectl-Signature: sha256=<hmac-sha256(secret, body)>
{"kind":"deploy","title":"deploy payments-api to prod",
"target":"api.example.com","actor":"ci","ref":"abc123"}
A verified delivery returns 202 Accepted with the count ingested
({"accepted": <n>}). A verified-but-empty delivery (e.g. a GitHub ping) is not
an error — it stores zero events.
Correlation (avoid drowning in changes)
The hard part of "what changed?" is not collecting changes — it is not
burying the operator under every deploy that happened that day. change.Candidates
(in correlate.go) scores each recent change as a candidate cause of an incident,
combining two signals:
- Topology proximity. Exact target match scores highest (
1.0), then IP-inside-the-incident's-prefix (0.8), then overlapping prefixes (0.7). - Time proximity. A change at the incident's start time scores
1.0; one a full window earlier scores ~`0. The blended score is0.6 * topology + 0.4 * recency`.
Only changes within the configured window before the incident are considered —
plus a 5-minute skew grace (correlationSkew) so a near-simultaneous deploy
logged a moment after the incident still correlates (clocks differ across
sources). A change that neither matches a targeted incident nor falls in the
window is dropped. The result: the RCA is fed the few likely causes, not the
firehose.
GET /v1/incidents/{id}/changes returns the ranked candidates for one incident
(permission incident.read). The handler loads the incident, fetches changes
since started_at - window, and runs change.Candidates.
Feeding the RCA
Change events are wired in as the AI engine's events evidence source
(changeEventsSource in change.go). The planner already routes
deploy/config/routing questions ("what changed?", "any recent deploy?") to the
events domain, so the RCA retrieves the relevant change for the question's
subject and cites it — within the caller's tenant and authorization scope.
That scope is enforced twice: the source opens an RLS transaction for the
principal's tenant (passed by the engine, never read from the query), and the
RCA path requires the events.read + ai.query permissions. The plain timeline
read (GET /v1/changes) requires change.read.
Configuration
Webhooks are configured by the operator (mirroring the OTLP token model). Each entry maps a public webhook id (the URL selector) to a tenant + provider + secret.
| Variable | Default | Description |
|---|---|---|
PROBECTL_CHANGE_WEBHOOKS |
(none) | comma-separated id:tenant:provider:secret credentials. The secret is the last field (so it may contain :, just not ,) — use URL-safe (hex / base64) secrets. |
PROBECTL_CHANGE_CORRELATION_WINDOW |
24h |
how far before an incident a change is considered a candidate cause |
The webhook id is a non-secret URL selector; the secret is the HMAC key / shared token. Provision a distinct id + secret per tenant. Secrets are runtime config (inject from a secret manager) — never commit them.
Security properties
- Untrusted + signature-verified + TLS + tenant-scoped ingestion; a forged or unsigned event is rejected before normalization.
- Tenant binding to the verified credential — cross-tenant injection is structurally impossible.
- Signal, not action. Changes are context fed to the RCA, not alarms, and trigger no remediation.
- FIPS crypto abstraction. HMAC + constant-time compare route through
internal/crypto. - Audited. Each ingest appends a tenant audit event (
change.ingest).
Out of scope (deferred)
Self-service webhook registration (DB-backed, envelope-encrypted secrets) for the
multi-tenant / MSP case; bus publication of probectl.change.events for
cross-plane replay; non-webhook collectors (a BGP-derived route-change collector,
a network config-diff collector) — the Provider interface is the extension seam
for these. Topology what-if (the impact-preview the RCA can reach for) is covered
in docs/topology.md.