probectl /docs GitHub ↗
                 _               _   _
 _ __  _ __ ___ | |__   ___  ___| |_| |
| '_ \| '__/ _ \| '_ \ / _ \/ __| __| |
| |_) | | | (_) | |_) |  __/ (__| |_| |
| .__/|_|  \___/|_.__/ \___|\___|\__|_|
|_|        see everything · send nothing

Self-hosted, multi-tenant network observability — five planes, one OpenTelemetry-native control plane,
and an AI assistant that explains root cause across them. Telemetry never leaves your network.

CI tag Go Report Card Go status license

Why · What it answers · Capabilities · How it works · Quickstart · Docs · License

probectl unifies five observability planes — active/synthetic testing, BGP/routing intelligence, flow analytics, device telemetry, and eBPF host/L7 — into one OpenTelemetry-native control plane, then layers cross-plane AI root-cause analysis, a security/threat signal layer, change-aware topology with what-if simulation, and cost/SLO intelligence on top.

One codebase serves two operating modes: sovereign single-tenant (a regulated or air-gapped org self-hosts; the deployment is the tenant) and multi-tenant / provider (an MSP self-hosts once and serves many hard-isolated, white-labeled tenants). The single-tenant install is just the one-tenant case — there is no separate code path, no enterprise fork to drift out of sync. Tenant is the outermost scope and security boundary on every record, agent, query, metric, event, and object.

Status: the platform is built — all five planes plus the intelligence, security, and provider/MSP layers in the tables below are shipped; the work now is hardening toward GA. Compose + Helm are HTTPS-by-default. The license is intentionally TBDsource-available, not open source (yet) (details).

Try it in ~60 seconds (Docker only, no Go toolchain — full walkthrough in the Quickstart):

docker compose -f deploy/compose/eval.yml up --build -d
docker compose -f deploy/compose/eval.yml --profile tools run --rm viewer   # → your first data

Why probectl

When something on the network breaks, the symptom and the cause usually live in different places. A slow checkout page might be a BGP route flap three networks away, a saturated uplink, a DNS timeout, a misbehaving host, or a config change that shipped ten minutes ago — and the tools that each see one of those are typically five separate products with five separate dashboards. You find out at 2 a.m., by stitching them together by hand.

probectl collapses that. Every plane lands in one tenant-scoped control plane, gets correlated into a single incident, and an AI assistant explains the root cause across planes instead of leaving you to guess which dashboard to open first.

Three choices set it apart:

  • It stays yours. Self-hosted and never phones home — no telemetry beacons, no "call home," nothing. Hosted observability SaaS works by shipping your network data to a vendor's cloud; probectl keeps every byte inside your own infrastructure. For regulated, air-gapped, or sovereignty-conscious operators that isn't a nice-to-have — it's the requirement.
  • It's unified and standard. One OpenTelemetry-native model spans all five planes, so a flow record, a probe result, and a BGP event share the same schema and the same query layer. The receiver ingests all three OTLP signals — metrics, traces, and logs — bounded for correlation, and re-exports probectl's own signals as OTLP metrics; the schemas follow OTel resource + network semantic conventions everywhere (docs/otlp.md).
  • It's multi-tenant to the core. The same binary runs as a single sovereign tenant for one org, or as a hard-isolated, white-labeled, individually-metered platform an MSP resells — one codebase, one security boundary.

What it answers

probectl is organized around the questions operators actually ask at 2 a.m.:

  • "The Berlin office says the app is slow — is it the network, the path in between, or the server?" — synthetic probes, ECMP/MPLS-aware path discovery, and flow show you where the latency is, not just that it exists.
  • "Did the 14:03 change cause this?" — change-aware topology correlates config/deploy events with the symptoms that followed them.
  • "Why did this prefix go dark — is it us, or the internet?" — BGP/routing intelligence from RouteViews/RIPE RIS, RPKI validity, and a collective outage view separate a you-problem from an everyone-problem.
  • "What breaks if I drain this node?" — the topology what-if simulates the blast radius before you touch production.
  • "Who's saturating this link, and what's it costing?" — flow analytics plus per-tenant FinOps egress attribution.

Or just ask the built-in assistant "why is checkout slow for tenant X?" — it runs the cross-plane correlation and answers with cited evidence, scoped to exactly what the caller is allowed to see.

Who it's for

  • Regulated & sovereignty-conscious orgs (finance, healthcare, public sector, defense, critical infrastructure) that need deep network observability but cannot send telemetry to a third-party cloud.
  • MSPs & internal platform teams serving many customers or business units — self-host once, serve hard-isolated, white-labeled, individually-metered tenants from one control plane.
  • Network & platform engineers tired of hand-correlating five dashboards who want a single OTel-native source of truth they actually own.

Capabilities

The five observability planes:

Plane What it covers
Active / synthetic canaries (ICMP/TCP/UDP/HTTP/DNS/…), ECMP/MPLS-aware path discovery, browser-synthetic checks, endpoint digital-experience monitoring
BGP / routing RouteViews + RIPE RIS ingestion, route/path analysis, RPKI validity, a collective internet-outage view
Flow analytics NetFlow / sFlow / IPFIX into ClickHouse, with per-tenant anomaly detection
Device telemetry SNMP polling + gNMI streaming, folded into the topology graph
eBPF host / L7 service map + L7 visibility, observation-only (the Retina model). Default builds replay recorded fixtures (no kernel access needed — CI/macOS/demo path); live kernel capture is the separate -tags ebpf build on a BTF kernel (build matrix)

Intelligence, security, and platform layers built across the planes:

Layer What it does
AI assistant cross-plane RCA grounded in correlated incidents, natural-language semantic query, AI test authoring, and an MCP server (read-only tools + a proposal-only remediation tool) — all tenant- then RBAC-scoped. Default engine: a deterministic in-process heuristic — no LLM is involved or contacted unless you explicitly connect one (local Ollama/vLLM for full air-gap, or a cloud provider as explicit opt-in; start with docs/ai-quickstart.md)
Topology a versioned, change-aware dependency graph with what-if impact simulation
Security / threat TLS/cert posture + NDR-lite, confidence-scored detections (a signal exported to your SIEM — never an inline IPS)
Cost / SLO FinOps egress-cost attribution, an OpenSLO engine, and segmentation/compliance validation with evidence
Guarded remediation the AI proposes a fix grounded in RCA + a dry-run; a human approves; probectl never executes — proposal-only, blast-radius-limited, fully audited
Multi-tenancy pooled / siloed / hybrid isolation, selectable per deployment and per tenant
Provider / MSP plane tenant lifecycle, fleet-across-tenants, per-tenant metering + quotas, white-label branding, and audited break-glass (no implicit access to tenant telemetry)
Sovereignty & crypto mTLS/SPIFFE agent identity, envelope encryption, per-tenant BYOK, per-tenant export + verifiable erasure, and an optional build against the FIPS 140-3-validated Go Cryptographic Module (CMVP cert #5247; probectl itself holds no product-level certificate — see docs/hardening.md)

How it works

Lightweight agents — a single Go binary, each bound to one tenant — run the probes and watch the wire, then push results onto a bus. The stateless control plane consumes that stream, persists each signal to the store that fits it (Postgres for state, ClickHouse for high-cardinality events, Prometheus/VictoriaMetrics for metrics), and continuously builds incidents and a versioned topology graph. Every record, query, metric, and message is scoped by tenant_id first, then by your RBAC — the API, web UI, AI assistant, and MCP server all read through that same boundary, so a query cannot cross a tenant line even by mistake.

External intelligence (RouteViews, RIPE RIS/Atlas, RPKI, threat-intel, cloud pricing) is fetched once, cached, and enriched per tenant; if a feed is rate-limited or down, that view degrades gracefully instead of taking the platform with it.

%%{init: {'theme':'base','themeVariables':{'background':'transparent','primaryColor':'#161b22','primaryTextColor':'#e6edf3','primaryBorderColor':'#3b82f6','lineColor':'#768390','secondaryColor':'#21262d','tertiaryColor':'#0d1117','clusterBkg':'#161b22','clusterBorder':'#30363d','titleColor':'#e6edf3','edgeLabelBackground':'#161b22','fontFamily':'ui-monospace, SFMono-Regular, Menlo, monospace'},'flowchart':{'curve':'basis','nodeSpacing':55,'rankSpacing':55,'padding':12}}}%%
flowchart TB
    Provider["Provider / Management Plane — MSP operators (distinct privilege domain)<br/>tenant lifecycle · fleet-across-tenants · metering/billing · white-label<br/>audited break-glass (no implicit tenant-data access)"]

    subgraph CP["Control Plane — Go, stateless, TENANT-AWARE"]
        Edge["REST (OpenAPI 3.1) · gRPC (agents, mTLS) · MCP · Webhooks/OTLP<br/>Auth (SSO/RBAC/ABAC) · Audit · Tenant → Org → Team → Project"]
        Subsys["subsystems: tenancy · path · bgp · opendata · threat · change ·<br/>topology · cost · slo · compliance · ai · remediation · …"]
    end

    Agents["Agents — Go, single binary, tenant-bound<br/>canary plugins · path engine · eBPF host/L7"]
    Analyzer["BGP analyzer (Python)<br/>RouteViews/RIS MRT + RIS Live"]
    Bus["Bus — Kafka / in-process<br/>(tenant-tagged)"]
    Stores["Postgres · ClickHouse · Prometheus/VM<br/>topology graph · object store"]
    External["External (read-only, cached, degrade gracefully)<br/>RouteViews · RIPE RIS/Atlas · RPKI · PeeringDB · MaxMind/Cymru · CT logs · threat-intel · cloud pricing"]

    Provider -->|tenant-scoped, isolated| CP
    Agents -->|gRPC mTLS| Edge
    Analyzer -->|probectl.bgp.events| Bus
    Agents -->|results, tenant-tagged| Bus
    Bus --> Subsys
    Subsys -->|queries, tenant-first| Stores
    External -.->|ingest once, scope per tenant| Analyzer
    External -.->|cached| Subsys

    classDef prov fill:#26215C,stroke:#7F77DD,color:#CECBF6
    classDef agent fill:#042C53,stroke:#378ADD,color:#B5D4F4
    classDef analyzer fill:#04342C,stroke:#1D9E75,color:#9FE1CB
    classDef bus fill:#412402,stroke:#EF9F27,color:#FAC775
    classDef store fill:#173404,stroke:#639922,color:#C0DD97
    classDef ext fill:#2C2C2A,stroke:#888780,color:#D3D1C7
    class Provider prov
    class Agents agent
    class Analyzer analyzer
    class Bus bus
    class Stores store
    class External ext

The provider/management plane spans tenants for operations only — never for silent data access; any access is explicit, time-bounded, tenant-consented, and separately audited. Full data-flow and per-subsystem diagrams live in docs/architecture.md.

What probectl is not

It's a signal layer, not an enforcement layer. Threat detections are confidence-scored and exported to your SIEM — probectl does not inline-block traffic or act as an IPS. The AI proposes remediations; a human approves and an operator acts — there is no autonomous execution. And it complements, rather than replaces, a full APM/distributed-tracing stack or a SIEM/log-analytics platform. probectl is honest about its edges by design.

Editions

The full five-plane platform — all observability, the AI assistant, security/threat, topology, cost/SLO, and single-tenant self-hosting — is core, and free. Commercial code lives in a publicly-readable ee/ tree (the fence is the license + trademark, not source secrecy) and is gated at runtime by an offline-verifiable, signed license that never phones home. Enterprise adds the validated-module (FIPS) build, BYOK/governance, multi-region HA, and guarded remediation; Provider/MSP adds the management plane, hard tenant isolation, metering/billing, and white-label. Unlicensed commercial features are simply hidden (no lockware). See docs/editions.md.

Quickstart (run it)

One idea first: the control plane is a consumer, not a producer. It ingests, correlates, stores, and serves — but it never observes the network itself. The things that watch the wire and run probes are the producers: the agents and collectors. So a control plane with no producers attached collects nothing/readyz goes green and the dashboards stay empty. That's expected, not a bug. To see data you have to attach at least one producer.

Fastest path to first data — the evaluation stack

This brings up the control plane plus an eBPF agent in fixture mode (replaying a recorded, clearly-labelled file of SAMPLE flows — no kernel, works on macOS/Windows/Linux), so you watch a real signal flow end-to-end with one command and no Go toolchain:

docker compose -f deploy/compose/eval.yml up --build -d
# ~20s for the control plane to migrate + start, then:
docker compose -f deploy/compose/eval.yml --profile tools run --rm viewer

viewer prints the /v1/topology service map the control plane folded out of those sample flows — and that JSON is your first data (pretty-printed here; your at timestamp will differ):

{
  "at": "2026-06-10T18:42:07Z",
  "coverage": {
    "path_edges": 0, "flow_edges": 2, "routing_edges": 0, "device_edges": 0,
    "notes": [
      "no routing-plane (BGP) edges — prefix impact may be incomplete",
      "no device→hop interface links — device-level impact unavailable"
    ]
  },
  "edges": [
    {"from": "service:10.0.1.5", "to": "service:10.0.2.9", "kind": "flow"},
    {"from": "service:10.0.1.5", "to": "service:10.0.3.3", "kind": "flow"}
  ],
  "nodes": [
    {"id": "service:10.0.1.5", "kind": "service", "label": "10.0.1.5"},
    {"id": "service:10.0.2.9", "kind": "service", "label": "10.0.2.9"},
    {"id": "service:10.0.3.3", "kind": "service", "label": "10.0.3.3"}
  ],
  "topology_running": true
}

That's the whole pipeline in one read: the agent replayed three sample flows (one host talking to an HTTPS endpoint and a Postgres), the control plane folded them into two service edges, and the coverage block honestly reports which planes this little graph does not yet see.

This stack is evaluation-only (loopback dev-auth — every request is an unauthenticated admin — plus plaintext Kafka and a self-signed cert), so it's never reachable from your network and never for production. Walk it all the way through — including synthetic (canary) probes and the build-from-source path — in docs/getting-started.md, and meet every producer you can attach in docs/deploying-agents.md.

Production-shaped stack

The shipped all-in-one deploy is deploy/compose/probectl.yml: the control plane over HTTPS with a bundled Postgres (a self-signed cert is generated on first boot), no evaluation weakenings.

cp deploy/compose/.env.example deploy/compose/.env     # set POSTGRES_PASSWORD (required) + PROBECTL_ENVELOPE_KEY
docker compose -f deploy/compose/probectl.yml up -d
docker compose -f deploy/compose/probectl.yml cp control:/certs/ca.crt ./ca.crt
curl --cacert ./ca.crt https://localhost:8443/readyz

Once /readyz is green, open the UI at https://localhost:8443 — then (per the consumer/producer rule above) register an agent and run your first synthetic test, or ask the assistant a question. The API is HTTPS-only (no plaintext port). Full guide, real certificates, SSO, and the Kubernetes/Helm path: docs/install.md; day-2 operation (audit, roles, SSO): docs/admin.md.

Build from source

Prerequisites: Go 1.26+, Docker (with Buildx) for the dev stack and images, and Python 3.12+ for the analyzer tooling.

make build          # build all binaries into ./bin
make test           # unit tests across the workspace
make lint           # gofmt + go vet + golangci-lint, and ruff + black
make compose-up     # start the dev dependency stack (Postgres/Kafka/ClickHouse/Prometheus)
make run            # run probectl-control locally
make help           # list every target

Repository layout

cmd/            # binaries: probectl-control, probectl-agent, probectl-ebpf-agent,
                #           probectl-flow-agent, probectl-device-agent,
                #           probectl-endpoint, probectl-license, probectl (CLI)
internal/       # subsystem packages (control, tenancy, path, bgp, crypto, ai, ...)
ee/             # commercial tree (provider plane, white-label, metering, BYOK,
                #   remediation) — publicly readable; core never imports it
pkg/            # shared, public libraries
proto/          # protobuf schemas (gRPC + bus) — buf-managed
analyzer/       # Python BGP analyzer
migrations/     # sequential, idempotent SQL migrations
web/            # frontend (React + Vite + TypeScript, themeable design tokens)
deploy/         # compose (eval + production + dev stacks), docker, helm,
                #   agent hardening profiles, terraform, gitops
docs/           # configuration, development, architecture, runbooks
test/           # integration harness (separate Go module)

Documentation

New here? Start with Why probectl and How it works above, then walk the zero-to-first-data journey in getting started. Going deeper:

Topic Doc
Getting started (zero → first real data) docs/getting-started.md
Deploying agents & collectors (the producers) docs/deploying-agents.md
Install & deploy (compose / Helm / air-gapped) docs/install.md
Day-2 admin (audit, roles, SSO) docs/admin.md
Architecture deep-dives docs/architecture.md
Every config key docs/configuration.md
Editions & licensing model docs/editions.md
Tenant isolation (pooled/siloed/hybrid) docs/isolation.md
Provider / MSP plane docs/provider-plane.md
Using the AI (ask → local model → MCP, in 10 min) docs/ai-quickstart.md
AI RCA · semantic query · MCP docs/ai-rca.md · docs/ai-query.md · docs/mcp.md
Guarded remediation (policy) docs/remediation.md
FIPS / hardening · multi-region HA · BYOK docs/hardening.md · docs/multi-region.md · docs/byok.md
Development & CI docs/development.md
Vulnerability disclosure SECURITY.md

Getting help

Bug reports and questions: GitHub Issues on this repo. Security vulnerabilities: never a public issue — follow SECURITY.md. Most "how do I…" answers live in the documentation table above; start with docs/getting-started.md.

Contributing

Read CONTRIBUTING.md. Commits follow Conventional Commits (enforced by commitlint in CI) and carry a DCO sign-off (git commit -s). Before pushing, run make ci — it runs the linters, the unit tests, and the cross-tenant isolation gate, the same checks CI enforces on every pull request. The non-negotiable rules (tenant isolation, no phone-home, crypto only through internal/crypto, TLS on every listener) are summarized in CONTRIBUTING.md and enforced by standing CI gates.

License

Source-available — not open source (yet). The source is published to be read, audited, and self-hosted, but it is not released under an OSI-approved open-source license, and no open-source rights are granted at this time.

The license is intentionally TBD: the open-core / reseller boundary is still an open decision, with a Business Source License (BSL)–family, open-core model intended (a core that may open over time; commercial use of the provider/MSP and Enterprise features reserved). Until a grant is added here, treat the code as all rights reserved.

Rendered live from github.com/imfeelingtheagi/probectl — found a mistake? edit this page.