On-call + ITSM integration
What this is. probectl correlates faults into incidents — but the team's workflow lives in the tools they already run: PagerDuty/Opsgenie for paging, Slack/Teams for chat, ServiceNow/Jira for tickets. This feature mirrors a probectl incident into those tools: it pages on-call, posts to chat, and opens + bidirectionally syncs tickets.
Two boundaries to keep in mind. First, probectl stays the system of record for
the incident; these connectors are a thin, best-effort mirror. Second, a
connector only ever pages or posts or opens a ticket — it never auto-blocks or
auto-remediates. (Notification is confidence in the incident, not control over the
network.) The code is in internal/notify, wired from internal/control/notify.go.
Off by default. A connector is an outbound connection to the operator's
tooling, so the whole feature stays off unless PROBECTL_NOTIFY_CONNECTORS is set
(sovereignty / no surprise egress).
Connectors
| Provider | Capability | On open | On resolve | Inbound (status-sync back) |
|---|---|---|---|---|
| PagerDuty | page | Events API v2 trigger (dedup key probectl-<id>) |
resolve (same dedup key) |
resolve/ack via the portable contract |
| Opsgenie | page | Alerts API create (alias probectl-<id>) |
close-by-alias | resolve/ack via the portable contract |
| Slack | chat | post "incident opened" | post "incident resolved" | — |
| Teams | chat | post "incident opened" | post "incident resolved" | — |
| ServiceNow | ticket | create incident (Table API) | set state Resolved (state 6) |
native Business-Rule POST, or the portable contract |
| Jira | ticket | create issue (REST v2) | transition to Done (default transition 31) |
native issue webhook, or the portable contract |
Routing is per-tenant: a connector is registered against one tenant id and only ever fires for incidents of that tenant.
Lifecycle + mapping
%%{init: {'theme':'base','themeVariables':{'background':'#0d1117','primaryColor':'#161b22','primaryTextColor':'#e6edf3','primaryBorderColor':'#3b82f6','lineColor':'#8b949e','secondaryColor':'#21262d','tertiaryColor':'#0d1117','clusterBkg':'#161b22','clusterBorder':'#30363d','fontFamily':'ui-monospace, SFMono-Regular, Menlo, monospace'},'flowchart':{'curve':'basis','nodeSpacing':55,'rankSpacing':55,'padding':12}}}%%
flowchart LR
S[cross-plane signal] --> C[incident correlator]
C -- opened --> O[observer]
O --> D[dispatcher]
D -- page --> PD[PagerDuty/Opsgenie]
D -- post --> CH[Slack/Teams]
D -- open ticket --> IT[ServiceNow/Jira]
D -. persist external ref .-> L[(incident_integrations)]
IT -- "ticket resolved (inbound webhook)" --> W["/ingest/itsm/{provider}/{id}"]
W -- verify + reverse-lookup --> R[resolve incident]
R -- sync OTHERS, skip origin --> D
- Open — when a signal opens a new incident, the correlator's observer calls
the dispatcher, which pages / posts / opens-a-ticket on each of the tenant's
connectors and records the external reference (ticket id / page dedup key) in
incident_integrations. A correlated follow-up signal does not re-page. - Resolve (outbound) — resolving an incident (via the API, or an inbound webhook) syncs the resolution to every linked connector.
- Resolve (inbound) — an ITSM / on-call system posts to
POST /ingest/itsm/{provider}/{id}; probectl verifies the delivery, maps the external ref back to the incident, resolves it, and syncs the other systems.
Idempotency
Ticket creation and paging are idempotent. A UNIQUE (tenant, incident, connector) link row means an incident is opened at most once per connector, so a
delivery retry or a control-plane restart never double-pages or duplicates a
ticket — the dispatcher checks for an existing link before opening. Pager
connectors additionally pass a stable dedup_key / alias derived from the
incident id (probectl-<id>), so even a duplicate trigger coalesces server-side.
Bidirectional sync + loop protection
The tricky case: an on-call engineer closes the ServiceNow ticket. probectl
resolves the incident and syncs that resolution to the other connectors — but
never echoes it back to its origin (the ServiceNow ticket is already closed;
re-closing it could ping-pong forever between two systems). The dispatch carries
the origin as its source, and the dispatcher skips that connector when fanning
out — while still marking the origin's link resolved so the mirror stays
accurate (Dispatcher.Resolved(..., source)). A duplicate inbound webhook for an
already-resolved incident is a no-op.
Inbound contract + security
POST /ingest/itsm/{provider}/{id} is an ingest surface (mounted off /v1, like
the change webhook). It authenticates each delivery, not a session:
- Include
X-Probectl-Signature: sha256=<hmac-of-body-under-secret>orX-Probectl-Token: <secret>(constant-time compared). An unsigned, forged, or wrong-token delivery is rejected with401before any state change (fail closed). Verification routes throughinternal/crypto(crypto.Verify/crypto.ConstantTimeEqual). - The delivery is bound to the credential's tenant (
id→ tenant), never a value from the payload — so one tenant can never resolve another's incident, even with the same external ref (RLS + a tenant-scoped reverse lookup). - The body is treated as untrusted and is size-limited (1 MiB).
probectl understands ServiceNow ({"sys_id","state"}, state 6/7 = resolved)
and Jira (statusCategory.key == "done") shapes natively; every provider
(including PagerDuty / Opsgenie) also supports the portable contract:
{ "external_ref": "probectl-<incident-id>", "status": "resolved" }
Outbound delivery uses the hardened, certificate-validating HTTP client (TLS is never disabled); the provider credential is sent only as an auth header and is never logged.
Configuration
See configuration.md for the
key reference. Example (a tenant paging PagerDuty + ticketing Jira, with inbound
sync from Jira):
PROBECTL_NOTIFY_CONNECTORS=00000000-0000-0000-0000-000000000001|pagerduty|https://events.pagerduty.com/v2/enqueue|<routing-key>,00000000-0000-0000-0000-000000000001|jira|https://acme.atlassian.net/rest/api/2/issue?project=OPS&resolve_transition=31|alice@acme.com:<api-token>
PROBECTL_NOTIFY_INBOUND=jira1:00000000-0000-0000-0000-000000000001:jira:<webhook-secret>
Out of scope
probectl is not a SIEM (see siem.md) and not a CMDB; it does
not own on-call schedules or escalation policies (those stay in PagerDuty /
Opsgenie). Connectors mirror the incident outward — there is no auto-remediation
here.