Tenant isolation at the storage layer
Tenant isolation is probectl's outermost security boundary — the first of the project's non-negotiables: one tenant must never read another tenant's data, full stop.
Most apps enforce that kind of rule in handler code — a WHERE tenant_id = ?
on every query. That works right up until the day someone forgets the WHERE,
an injection slips through, or a bug routes a query to the wrong place. probectl
treats that day as inevitable and pushes the boundary down into the database
itself, so the data store refuses to hand over another tenant's rows even when
the application code above it is wrong.
This page documents that database-layer defense store by store, and — because honesty is the whole point of a security doc — exactly what each store's privileged accounts can still do.
Scope note. This is the storage-layer threat model — the mechanism and its residual risk. For the three deployment shapes (pooled / siloed / hybrid) and how to choose between them, see ../isolation.md. The two docs are complementary; if anything here disagrees with the code, the code wins and that is a bug to file.
Postgres — durable state (tenants, agents, incidents, RBAC, audit, SLOs)
What: every tenant-owned table is invisible across tenants at the row level, enforced by Postgres itself, not by probectl.
How — Row-Level Security, forced. Each tenant-owned table carries a non-null
tenant_id and a tenant_isolation policy keyed to a per-transaction session
variable, the probectl.tenant_id GUC ("Grand Unified Configuration" — Postgres's
name for a runtime setting). tenancy.InTenant (in internal/tenancy/tenancy.go)
opens a transaction, runs SET LOCAL ROLE probectl_app, sets that variable to
the caller's tenant, and only then runs the repository code. The effect: even a
predicate-free SELECT * FROM incidents returns only the caller's rows, because
the policy silently ANDs tenant_id = current_setting('probectl.tenant_id')
onto every query.
Why it actually holds — the three things that make RLS more than a suggestion:
- The app role cannot bypass it.
probectl_appis createdNOSUPERUSER NOBYPASSRLS(migration0007_app_role.sql). RLS in Postgres is ignored by superusers and by any role with theBYPASSRLSattribute — so if probectl connected as one of those, every policy would be decorative. It does not. - Even the table owner is bound. Tenant tables use
FORCE ROW LEVEL SECURITY, not justENABLE. PlainENABLEexempts the table's owner from its own policies — a subtle gap a security audit named.FORCEcloses it. - A misconfigured deployment refuses to serve. Before the control plane
accepts traffic,
maincallstenancy.AssertIsolationPosture(internal/tenancy/posture.go). Inside a realprobectl_app-scoped transaction it checks that the effective role is non-super and non-bypass-RLS, and that every table with atenant_idcolumn hasFORCErow security. If any check fails — RLS silently off, migrations not applied, the wrong role — it is a fatal startup error. The control plane will not start. Thecross-tenant-isolationCI job proves this check passes on a correctly migrated database and rejects a deliberately unforced table, so the check is not a no-op.
The provider plane is a separate, weaker role. MSP operators run as
probectl_provider (also NOBYPASSRLS), granted only operational-metadata
tables (fleet, lifecycle) and never telemetry — see tenancy.InProvider. It
sets no tenant variable, so the per-tenant policies match nothing for it and only
the explicit provider grants apply. An operator literally cannot SELECT a
tenant's flows through this path.
ClickHouse — high-volume telemetry (flow, path, threat, change, cost)
ClickHouse holds the firehose: flows, L7 events, threat signals. It gets a layered defense because the pooled service account that ingests data is, by necessity, powerful.
Layer 1 — application scoping (always on). Every tenant-scoped query leads
with WHERE tenant_id = {tenant:String}, and an unscoped call is refused in code
with ErrNoTenant (internal/store/pathstore/clickhouse.go). This is the
primary boundary in a default deployment.
Layer 2 — values are server-bound, never string-concatenated. Every value in
a ClickHouse query travels as a bound parameter: a {name:Type} placeholder in
the SQL plus a matching HTTP param_name that the server substitutes
(ClickHouse's native parameterized-query mechanism — not client-side escaping).
A tenant id shaped like x' OR '1'='1 arrives as data, never as syntax, so it
cannot break out of the WHERE clause. The old hand-rolled escaping helpers were
deleted; the few things SQL cannot bind — table names, ClickHouse user names —
are validated against a strict regex and refused on mismatch. A CI gate
(scripts/check_stringbuilt_sql.sh, the no-stringbuilt-sql check) fails the
build if string-built ClickHouse SQL ever reappears.
Layer 3 — a DB-level row policy for direct access. EnsureRowPolicies
installs a probectl_tenant_isolation policy (USING tenant_id = currentUser())
so that an operator connecting with a per-tenant ClickHouse credential (the
siloed / direct-access convention; see ../isolation.md) is
constrained by ClickHouse itself and cannot cross tenants — independent of
probectl's code.
The honest gap (and its opt-in fix). probectl's own pooled deployment holds
one service credential, and that account is deliberately policy-exempt
(probectl_service_access USING 1) because it must insert, migrate, and run
admin counts across all tenants. So if Layer 1's WHERE scoping were ever
bypassed — a code bug, an injection that defeated Layer 2 — that single account
could read across tenants. Stated plainly so it is never a surprise.
The backstop, opt-in via PROBECTL_FLOWSTORE_TENANT_SCOPING=true, removes that
residual reach from the read path entirely:
- Every tenant-scoped read attaches a per-request custom setting,
SQL_probectl_tenant=<tenant>. (Admin / cross-tenant reads — migrations, global counts — pass no setting, by design.) - A dedicated reader user (
PROBECTL_FLOWSTORE_READER_USER, e.g.probectl_reader) getsEnsureReaderRowPolicy: a policyprobectl_reader_scope ... FOR SELECT USING tenant_id = getSetting('SQL_probectl_tenant'), with no permissive escape. Because the reader user's setting defaults to''server-side, an unset or dropped setting matches no rows — fail closed. - Production routes tenant data reads through the reader user; the
write/service user keeps full access for inserts and migrations only. A
compromised query path that omits the
WHEREstill cannot cross tenants — the reader policy constrains it at the database.
Operator prerequisites (documented, not auto-configured): allow the custom
setting prefix (<custom_settings_prefixes>SQL_</custom_settings_prefixes>),
create the reader user with a default SQL_probectl_tenant = '', and grant it
SELECT only. Until you enable this, Layer 1 is the boundary and the service
account remains read-capable across tenants.
What the service/write account can still do (residual, by design): insert into any tenant's partition, run cross-tenant counts, and — without the reader split — read across tenants. This is required for ingest, migrations, and retention. It is bounded by three things: the application never issuing an unscoped read, the secret-management of that one credential, and — when enabled — the reader split taking it off the read path entirely.
At-rest encryption of sensitive tenant values
Sensitive columns (alert-channel secrets, integration tokens, ...) are sealed
through internal/tenantcrypto — either the deployment-wide envelope key, or, on
the licensed tier, a per-tenant keyring / BYOK (see ../byok.md).
Keyless development runs pass the value through in the clear, but the stored
format is self-describing, so a sealed value can never be silently misread as
plaintext.
Fail closed in production. Setting PROBECTL_REQUIRE_AT_REST_ENCRYPTION=true
makes keyless passthrough a fatal startup error: the control plane refuses to
run without a resolvable key rather than silently writing plaintext. The shipped
hardened/regulated recipes set it. It is off by default only so keyless local dev
still boots. (See ../hardening.md §0c.)
How this is verified
These are the test suites that hold the claims above honest; most run on every CI pass against real datastores.
internal/tenancy(-tags isolation): the RLS posture check passes on a migrated DB and rejects an unforced table (non-vacuous); a cross-tenant query returns nothing. This is the standingcross-tenant-isolationCI gate.internal/store/flowstore: the read-path setting attach, the reader-policy DDL shape (no permissive escape), the empty-reader rejection; and (-tags isolation) a non-service ClickHouse reader issuing a predicate-free read sees only its own tenant's rows — proving the row policy, not the appWHERE, is what scopes it.internal/pipeline(-tags isolation): end-to-end ingest injection — a payload claiming another tenant is rejected and never lands in the victim's partition, against real ClickHouse and the RLS-scoped registry on real Postgres. Plus (no infra) siloed records route only to namespaced topics, with fail-closed construction.internal/otel/otlp(-tags isolation): an OTLP push authenticated as one tenant but naming another is rejected; the sink only ever sees the authenticated tenant.internal/config: the at-rest-required and ClickHouse-scoping knobs parse and default off.