Collective internet-outage view
What this is. The feature that answers "is it us, or is the internet broken?"
When a test starts failing, the operator needs to know whether the fault is inside
their network or out in some ISP / region they don't control. probectl answers
that by joining two things it can honestly know — and being explicit about the
limits of both. It lives in internal/outage, surfaced at GET /v1/outages.
The two inputs:
- Public outage signals — IODA (Georgia Tech) and the Cloudflare Radar outage annotations. These are fetched opt-in, ingested once (shared across all tenants, never tenant-owned data), and cached so a feed failure keeps the last-good events instead of going blank.
- Your own vantage points — the synthetic-result stream your agents already produce. When several distinct targets inside one external scope (an ISP's ASN, a country) start failing together, that is a vantage-detected outage. When one of your failing tests sits inside the scope of an active external event, that is a correlation ("your checkout test is failing because AS64500 is melting").
The honesty contract — the thing that keeps this feature truthful:
Coverage = your vantage points + public open data. probectl does not operate a global probe fleet and never pretends to. Every response carries coverage notes; degraded modes (feeds off, enrichment off) are stated plainly, not papered over.
That is why the API always returns scope_resolution and emits notes like "IP→ASN
enrichment is off — vantage detection and impact correlation are unavailable": a
view built on partial data must say so.
The outage-signal model (the contract)
One normalized Event for every source:
| Field | Meaning |
|---|---|
source |
ioda | cloudflare_radar | vantage |
scope |
the join key: {kind: asn|country|region, code, name} (e.g. AS15169, BR) |
severity / confidence |
documented heuristics derived from source scores — not vendor-calibrated probabilities |
start / end |
event window; empty end = ongoing |
evidence_url |
deep link into IODA / the Radar outage center |
External events are shared (public data). Everything derived from customer
telemetry — vantage events, affected_tests — is tenant-scoped and computed
per caller: tenant isolation is the outermost boundary (a project
non-negotiable). The engine partitions all
customer-derived state by tenant, and a tenant's Snapshot can reach nothing
from any other tenant.
Detection + correlation semantics
The thresholds are deliberately conservative — internet weather is situational
awareness, not a pager. (All values live in engine.go.)
- Vantage detection (per tenant, per scope, 15-minute window): fires when ≥2 distinct targets in the scope are failing (each ≥50% failure rate over ≥2 samples) and those failing targets are ≥50% of the scope's observed targets. The episode is latched (it won't re-fire while live), clears on recovery (failing ratio drops below 25%), and then re-arms. One target failing alone is never an outage — that case is what ordinary alerting is for.
- Correlation: a failing result whose peer IP resolves into the scope of an
active external event raises
outage.external_correlated— once per(tenant, event), carrying the affected test as evidence.
Both are signals into the incident pipeline (plane outage, severity
warning) — situational awareness, never a pager storm, and never an automated
action: detection is a signal, not an IPS, so probectl never blocks traffic on
the back of one.
Scope resolution (peer IP → ASN / country) rides the open-data enricher, gated by
PROBECTL_FLOW_ENRICH_ASN. Without it the external view still renders, and the
response says plainly that vantage detection and correlation are off.
Feeds, AUP, sovereignty
PROBECTL_OUTAGE_FEEDS_ENABLED=false by default: turning it on is what makes the
outbound fetches, so the no-phone-home default holds. Fetches use a hardened,
certificate-validating TLS client; fetched bodies are untrusted input with size
caps; and a down or rate-limited feed keeps its last-good events rather than
breaking the view. Per-feed AUP / provenance is tracked and served, which matters
for MSP resale:
| Feed | License / terms | Commercial use |
|---|---|---|
ioda |
IODA data-usage terms (academic project; attribution: "IODA, Georgia Institute of Technology") | unknown — confirm before resale |
cloudflare_radar |
CC BY-NC 4.0 (attribution: "Cloudflare Radar"); API token required | restricted (non-commercial) |
%%{init: {'theme':'base','themeVariables':{'background':'#0d1117','primaryColor':'#161b22','primaryTextColor':'#e6edf3','primaryBorderColor':'#3b82f6','lineColor':'#8b949e','secondaryColor':'#21262d','tertiaryColor':'#0d1117','clusterBkg':'#161b22','clusterBorder':'#30363d','fontFamily':'ui-monospace, SFMono-Regular, Menlo, monospace'},'flowchart':{'curve':'basis','nodeSpacing':55,'rankSpacing':55,'padding':12}}}%%
flowchart LR
IODA[IODA v2 events] -->|opt-in, TLS, last-good| REF[outage.Refresher]
RADAR[Cloudflare Radar annotations] -->|opt-in, token| REF
REF --> ST[outage.Store<br/>SHARED external events]
RES[synthetic results] --> OC[OutageConsumer]
EN[open-data enricher<br/>IP→ASN/country] --> OC
OC --> ENG[outage.Engine<br/>per-tenant vantage + correlation]
ST --> ENG
ENG -->|tenant-scoped signals| INC[incidents]
ENG -->|Snapshot per caller| API[GET /v1/outages]
API --> UI[Internet outages page]
Configuration
| Variable | Default | Purpose |
|---|---|---|
PROBECTL_OUTAGE_ENABLED |
true |
the local engine (vantage detection + correlation; no outbound calls) |
PROBECTL_OUTAGE_FEEDS_ENABLED |
false |
opt-in public feeds (outbound fetches) |
PROBECTL_OUTAGE_FEEDS |
(all) | which feeds to load: ioda, cloudflare_radar |
PROBECTL_OUTAGE_REFRESH |
10m |
feed refresh cadence |
PROBECTL_OUTAGE_RETENTION |
48h |
event window kept / queried |
PROBECTL_OUTAGE_RADAR_TOKEN |
(none) | Cloudflare API token (secret-ref resolvable); Radar is omitted without it |
GET /v1/outages requires the metrics.read permission (it reads the synthetic /
metrics plane).
Out of scope by design: owning a global probe / BGP fleet. The view leans on your vantages plus open data — and says so.