probectl threat model

This is the system-wide threat model: what probectl protects, who might attack it, and — boundary by boundary — what stops them and what honestly does not yet. It is versioned with the code and should be reviewed on any change to a trust boundary and at each release. Every mitigation cites the code, CI gate, or document that enforces it, so you can verify rather than trust.

Companion docs: agent-whitepaper.md (the agent in depth), incident-response.md (what happens when something breaks), ../hardening.md, ../isolation.md, and the project non-negotiables.

1. What we protect (assets)

Asset	Why it matters	Where it lives
Tenant telemetry (flows, probe results, paths, device/BGP/L7 events)	The product's reason to exist; cross-tenant leakage is the declared highest-severity failure	Kafka (transit), ClickHouse, Postgres, TSDB, object store
Tenant / config state (tenants, RBAC, SSO config, SLOs, incidents)	Controls who sees what	Postgres (RLS)
Audit chains (tenant + provider streams)	The forensic record; targeted by any competent attacker	Postgres hash chains + signed WORM exports to object storage (`internal/audit/worm.go`)
Secrets (DB/bus creds, SNMP/API credentials, license keys)	Lateral-movement fuel	envelope encryption via `internal/crypto`; reference-based resolution (`internal/secrets`)
The agent fleet	Privileged (`CAP_BPF`) code on customer hosts — the scariest asset to lose	operator-managed hosts; no self-update by design
AI prompts / evidence	The one place tenant data may deliberately leave the network	air-gapped built-in model by default; remote only behind three gates (../ai-egress.md)
The supply chain (source → CI → artifacts)	A compromise here multiplies into every deployment	GitHub + cosign-signed releases, SHA-pinned CI actions, digest-pinned base images

2. Trust boundaries

[tenant user] --TLS/session--> [control plane] <--mTLS+SPIFFE--> [agents]
      |                          |       ^  ^                      (CAP_BPF hosts)
 [provider operator]             |       |  +--TLS/auth-- [OTLP/webhook senders]
  (separate domain,              v       |
   break-glass only)        [bus (Kafka, TLS)]--> [stores: PG(RLS)/CH(row policy)/TSDB]
                                 |
                            [AI adapter] --three-gated TLS--> (optional remote LLM)
                            [MCP server] --tenant-then-RBAC--> callers

B1 tenant↔tenant (inside every shared component) · B2 agent↔control plane · B3 ingest surfaces (bus, OTLP, webhooks) · B4 control plane↔stores · B5 operator/provider plane↔tenant data · B6 AI/MCP↔models and callers · B7 build/release↔deployments · B8 agent↔monitored host.

3. Attacker profiles

External unauthenticated — internet/intranet reach to exposed listeners.
Malicious tenant (the defining multi-tenant adversary) — valid credentials, hostile intent, aims at B1.
On-path network attacker — can intercept/inject between components.
Compromised monitored host — controls traffic the agent observes and the local libssl the L7 probe attaches to.
Compromised agent node — has the agent's identity and CAP_BPF.
Malicious/compelled provider operator — legitimate provider-plane access, wants silent tenant-data reach (B5).
Supply-chain attacker — targets deps, CI, artifacts (B7).
Prompt-injection attacker — plants payloads in telemetry that the AI layer will read (B6); needs no credentials at all.

4. STRIDE by boundary — mitigations (evidence) and gaps (register IDs)

B1 — Tenant ↔ tenant (the outermost boundary)

This is the boundary that matters most: a malicious tenant with valid credentials trying to reach another tenant's data. The full storage-layer mechanism is in tenant-isolation.md; the summary by threat:

Threat (STRIDE)	Mitigation — evidence
Info disclosure: cross-tenant read	RLS forced at the storage layer (`internal/tenancy`, migrations); the cross-tenant-isolation CI job runs the suite against real Postgres on every pass. ClickHouse adds `tenant_id` partition keys, `ErrNoTenant` pre-flight refusals, and DB-level row policies, gated against real ClickHouse (`internal/store//isolation_clickhouse_test.go`). For the TSDB, the query proxy forces the tenant label and refuses any unscoped forward* (`internal/promapi/upstream.go`)
Tampering: writing into another tenant	tenant resolved at the edge and propagated API → bus → store; bus messages are tenant-keyed; consumers stamp and verify
DoS: noisy neighbor	per-tenant fairness gate ahead of the pipeline (`internal/fairness`); per-agent and per-tenant cardinality caps (`internal/pipeline/cardinality.go`); bounded async publish that sheds load with a counter, never silently (`internal/bus/kafka.go`); a noisy-neighbor SLO gate runs every CI pass at the documented latency floor (`internal/perf/scale.go`)
Elevation: tenant → other tenant via AI/MCP	the AI/MCP query layer enforces tenant first, then RBAC, on every call; an end-to-end tenancy assertion runs over the public API (`test/e2e`)
Repudiation	per-tenant tamper-evident audit chains; erasure produces store-by-store attestations

B2 — Agent ↔ control plane

Threat	Mitigation — evidence
Spoofed agent / spoofed control plane	mTLS with a SPIFFE-style tenant-bound identity (no plaintext agent transport exists) and a mandatory trust-domain pin — wrong trust domain, rejected handshake
Tampering in transit	TLS 1.2+/1.3 via the hardened configs in `internal/crypto`; no plaintext agent transport
Fleet takeover via updates	No self-update channel exists; upgrades are operator-driven waves of cosign-signed artifacts with registry verification and halt-on-error (`internal/agent/rollout.go`)
Rogue agent floods	fairness + cardinality caps as in B1; per-agent registry identity, version-skew-gated handshake (`internal/lifecycle/version.go`)

B3 — Ingest surfaces (bus, OTLP, webhooks)

Threat	Mitigation — evidence
On-path read/inject	Kafka requires TLS unless explicitly dev-flagged; plaintext is refused, both in code and at chart render (`internal/bus/security.go`; the agent chart fails closed)
Spoofed OTLP/webhook senders	the OTLP receiver is TLS-only, authenticates a bearer token to a tenant, rejects cross-tenant payloads, and treats the payload as untrusted with a bounded size (`internal/otel/otlp`, ../otlp.md); webhooks are HMAC-verified where the sender signs (`internal/change`) — a missing required signature fails closed
Malformed / poison input	fuzz smoke runs on the untrusted parsers every CI pass (`make fuzz-smoke`); malformed results are dropped, never panicked on; store-write failures are retried then dead-lettered with the original bytes — counted, never silently lost
SSRF via probe targets	a canary/probe target guard blocks probes aimed at internal metadata endpoints and the like

B4 — Control plane ↔ stores

Threat	Mitigation — evidence
On-path DB read	`sslmode=require` by default; ClickHouse/TSDB TLS via `https` URLs and a hardened client (`crypto.HardenedHTTPClient`)
Audit purge by DB owner	the provider audit chain is exported as Ed25519-signed WORM segments to object storage with continuous chain verification (`internal/audit/worm.go`; see ../hardening.md §0b)
Schema drift	sequential idempotent Postgres migrations + an expand/contract CI gate; ClickHouse migrations are versioned with a checksummed ledger — an edited shipped version is refused (`internal/store/chmigrate`)
Crypto misuse	all primitives sit behind `internal/crypto` (FIPS-swappable); a crypto-import CI guard (`scripts/check_crypto_imports.sh`) blocks direct primitive imports anywhere else

B5 — Provider/operator plane ↔ tenant data

Threat	Mitigation — evidence
Silent operator read of tenant telemetry	no implicit read access; break-glass is explicit, time-bounded, tenant-consented, and lands in a separate tamper-evident provider audit stream (proven by the `ee/provider` no-implicit-access test suite)
Operator-side credential abuse	auth fails closed by default; rate-limit + lockout + audit on auth; any `insecure_skip_verify` is admin-permission-gated and audited
Disgruntled-insider erasure	WORM export survives a DB purge (see B4); offboarding erasure is attested store-by-store

B6 — AI / MCP

Threat	Mitigation — evidence
Tenant data exfiltration via the model	the built-in model is air-gapped by default; remote egress requires three gates — a boot-time operator acknowledgement env var, per-tenant default-deny consent, and a per-call audit event recording exactly which data categories left (../ai-egress.md)
PII leaving in prompts	a redaction pass runs before any remote prompt (`internal/ai/redact.go`): IPs and secrets masked, hostnames per policy
Prompt injection via telemetry	per-session random evidence IDs, structured delimiter framing with defanged escapes, and fail-closed citation grounding — a fully injected answer degrades to "insufficient evidence" rather than obeying the injection; the adversarial test suite includes a deliberately compromised model stand-in (`internal/ai/rca.go`)
MCP caller over-reach	tenant first, then RBAC, on every tool — enforced at the MCP layer and again at the stores
Model-as-actor	detection is a signal, never an IPS; remediation is observe-only / human-gated by default — both hard product guardrails

B7 — Supply chain (build → release → deploy)

Threat	Mitigation — evidence
Malicious dependency / action	every workflow action is SHA-pinned and a CI gate enforces it (`scripts/check_action_pins.sh`); every PR runs the `dependency-scan` and `image-scan` CI jobs, and the weekly `security-scan` workflow re-runs govulncheck / npm audit / trivy on a schedule and archives the raw reports as evidence; base images are digest-pinned
Tampered release artifact	cosign keyless signing of binaries and images, with SPDX SBOMs (../ops/verify-artifacts.md); releases refuse to cut from a red CI run
Tampered eBPF object at run time	a SHA-256 manifest is baked in at build; loaders verify the embedded bytes before any kernel load and fail closed (`internal/ebpf/integrity.go`)
Unsigned bits reaching the fleet	rollout planning refuses any artifact without a recorded digest, verification method, and verifier

B8 — Agent ↔ monitored host

Threat	Mitigation — evidence
Agent as an enforcement / attack tool	observe-only is CI-enforced: a static gate forbids enforcing eBPF program types (`internal/ebpf/observeonly_test.go`), and programs are additionally load-and-attach tested on real LTS kernels (the `ebpf-kernel-matrix` job). See agent-whitepaper.md §3
Privilege escalation from the agent	the minimal capability pair `CAP_BPF`+`CAP_PERFMON`, a default-deny seccomp profile, a read-only root, and a non-root systemd unit with ambient caps (`deploy/agent/`, `deploy/helm/probectl-agent`)
Sensitive payload capture	TLS-plaintext (L7) capture is off by default and requires explicit enable plus per-tenant consent naming the agent's tenant; bodies are zeroed at the redaction boundary by default (`internal/ebpf/l7policy.go`)
Host resource exhaustion	chart resource limits; ring-buffer drops are counted, never silent; overhead is benchmarked with a CI tripwire (../agent-overhead.md)

5. Known gaps (the honest list — tracked, not hidden)

A threat model that claims no gaps is not honest. These are the open items as of this revision:

Gap	Status
Large / extra-large full-stack load numbers + SLO sign-off	the load harness and a CI smoke test have landed; the reference-hardware run is human-scheduled
Multi-region RTO/RPO at representative scale	the CI failover drill runs continuously; a representative-scale run and sign-off are pending
Reference-host agent overhead (live kernel ring buffer)	the userspace pipeline is measured; the on-host live row is pending
`LICENSE` is a placeholder pending counsel	a legal artifact owned outside the codebase

6. Review log

Version	Date	Change	Reviewer
1.0	2026-06-07	Initial model; mitigations cross-checked against code and CI at commit time	maintainer (solo); external review welcome via SECURITY.md