Tenant isolation models
"Isolation model" answers a simple question: how physically separate is one
tenant's data from another's? probectl offers three answers, and you can pick
per deployment and per tenant — a single install can run most tenants pooled
and a few high-compliance ones siloed. This page covers the models; the
storage-layer enforcement mechanics they all share (forced RLS, partition keys,
the cross-tenant test suite) live in
security/tenant-isolation.md.
The mental model is a spectrum:
- Pooled — everyone shares the same tables/databases/topics, told apart by
tenant_id. Cheapest, densest, and the default. Stays core (free). - Siloed — each tenant gets its own Postgres schema, ClickHouse database, bus topics, and object-store namespace. The strongest separation.
- Hybrid — pooled control plane (the cheap, shared part) but a per-tenant ClickHouse database for the high-volume flow telemetry. A middle ground.
Siloed and hybrid are ee/ features, unlocked by the siloed_isolation license
feature (an MSP-tier capability); pooled always works regardless of license.
| Model | Postgres (control/config state) | ClickHouse (flows) | Bus topics | Object store |
|---|---|---|---|---|
| pooled (default) | shared tables, RLS keyed on the per-transaction tenant setting | shared table, tenant_id partition key |
shared topics, tenant-keyed messages | shared backend, tenant/<id>/ key prefix |
| hybrid | pooled (shared control plane, by design) | per-tenant database, optionally on a residency data plane | per-tenant namespaced topics | per-tenant silo/<id>/ key namespace |
| siloed | per-tenant schema (tenant-owned tables copied in; RLS recreated inside) | per-tenant database (+ data plane) | per-tenant namespaced topics | per-tenant key namespace |
The key idea: physical separation is layered on top of the pooled scoping,
never instead of it. A siloed schema still re-creates the
tenant_isolation RLS policies, every transaction still binds the tenant setting,
bus messages stay tenant-keyed, and every read is still tenant-scoped at the query
layer. So even if routing sent a query to the wrong silo, the query would return
nothing rather than another tenant's rows — the defenses stack, they don't
replace each other.
Fail closed on routing — everywhere, including the bus. The isolation router
(ee/silo.Router, installed at the editions attach seam) resolves each tenant's
storage targets from the tenant registry. A routing error fails the
operation — a siloed tenant is never silently downgraded to the pooled stores,
and a pooled query can never reach a siloed tenant's stores.
(tenancy.InTenant resolves targets before opening the transaction; the flow
store splits a batch per target and fails the whole batch on a routing error.)
Bus lanes are no exception: if a siloed tenant's lane cannot be resolved when a
result arrives, the control plane drops that result with a loud error rather
than publishing it onto the shared topic
(internal/agenttransport/service.go) — a siloed tenant's telemetry must never
silently ride the shared lane. Availability comes from the agent, not from a
fallback: the agent's store-and-forward buffer retries delivery, so a transient
routing blip delays the data instead of mis-routing it.
How each leg works
- Postgres (siloed): each tenant gets a schema named
t_<uuid>(the tenant UUID lowercased, dashes stripped —silo.SchemaName) containing every tenant-owned table. The table set is derived live frominformation_schema— anypublictable with atenant_idcolumn, minus a provider-owned deny list — so the silo automatically tracks whatever tables the schema actually has. Each table is createdLIKE public.<t> INCLUDING ALL, with the RLS policy recreated and the app-role grants applied.tenancy.InTenantroutes a siloed tenant by runningSET LOCAL search_path TO <schema>, public, so global tables (permissions, tenants) still resolve inpublic. - ClickHouse: a per-tenant database
probectl_t_<uuid>holding the same flow table; inserts are split per target and reads route by the query's tenant. With a residency pin, both run against that data plane's ClickHouse URL. - Bus: topics gain a namespace segment, e.g.
probectl.t-<slug>.network.results. The control plane publishes a siloed tenant's results/RUM onto its own lane and subscribes to every siloed lane known at startup. (A tenant siloed after boot is picked up from its lane after the next restart; the shared lanes stay subscribed throughout, so nothing is dropped.) - Object store: keys move under
silo/<tenant-id>/…(the pooled layout istenant/<id>/…). Note the honesty caveat below: in this release that is a key namespace on the same backend, not a separate storage system.
Residency: exactly what is and is not pinned
This section is deliberately precise because a residency claim you cannot back
up is a compliance liability. PROBECTL_DATAPLANES names the available planes
(e.g. eu=https://ch-eu:8123;us=https://ch-us:8123); a siloed or hybrid tenant
provisioned with residency: eu gets its ClickHouse database created on and
routed to that plane.
Pinned today: the tenant's ClickHouse flow data — the high-volume telemetry store, which is what residency rules usually care about most.
Not pinned today (and you should not claim otherwise): the Postgres
control/config state (it is a shared control plane), the metrics TSDB (metrics
stay tenant-labeled in the deployment's TSDB), the object store
(namespace-isolated on a single backend), and the bus brokers. Multi-region
control-plane and HA mechanics are covered separately in
multi-region.md; per-tenant encryption keys are a separate
capability (byok.md).
ClickHouse database-level tenancy
Code-level scoping is the first line of defense — every flow/path query pins a
tenant_id, and an unscoped call refuses with ErrNoTenant. But that only
protects access through probectl. To also protect a tenant credential used
directly against ClickHouse, EnsureRowPolicies installs database-level row
policies: per shared table, a policy filtering every user to tenant_id = currentUser() (by convention each per-tenant ClickHouse user is named exactly the
tenant id), plus a permissive policy for probectl's own service account. So even
someone holding a tenant's raw ClickHouse credentials cannot read another tenant's
rows — the guarantee holds independently of this codebase. The cross-tenant CI
gate exercises flowstore + pathstore against a real ClickHouse to prove it.
Lifecycle
Provisioning a tenant (POST /provider/v1/tenants with isolation_model +
residency, or the console's Isolation selector) creates the isolated stores
before the call returns — a siloed tenant never exists without its silo. A
provisioning failure is loud, and because the DDL is idempotent the call is simply
re-runnable. Offboarding tears the isolated stores down (DROP SCHEMA … CASCADE, DROP DATABASE) — they are per-tenant containers, safe to drop. Pooled
rows (a hybrid tenant's shared control state) are left untouched; their export and
verifiable deletion is the separate compliance flow. Teardown is idempotent too: a
partial failure is fixed by calling offboard again.
Migrations across silos (the operational cost of siloed)
This is the price you pay for siloing: migrations are written against public, so
every silo schema has to be brought up to the current shape separately. probectl
does this by catch-up — it re-derives the tenant-owned table set, creates any
missing tables (the same LIKE + RLS recipe), and adds any missing columns (an
information_schema diff). This works because the migration gate only admits
expand-only changes (see lifecycle.md): create-missing +
add-missing-columns covers every migration the gate allows. The rarer
destructive "contract" phases (drops/renames) are run by the operator across
silos. Catch-up runs automatically at startup for every siloed tenant and is
idempotent, and per-tenant drift is computable (DriftFor) so the lag is always
visible, never silent. The window between a freshly-deployed replica writing a
new public table and an old silo catching up is bounded by the deploy itself —
roll the control plane, then let catch-up converge.
Configuration
| Variable | Default | Purpose |
|---|---|---|
PROBECTL_DATAPLANES |
(none) | named residency planes: name=clickhouseURL[;…] |
Everything else rides existing keys (PROBECTL_FLOWSTORE_*, and
PROBECTL_FLOW_RETENTION_DAYS applies to per-tenant databases too). Siloed and
hybrid provisioning requires a license granting siloed_isolation; without it
those models are refused (pooled always works). See configuration.md.
Tests
Unit tests cover the planner/catch-up/teardown DDL recipes, naming, drift
diffing, router fail-closed semantics, flow-store routing (per-target inserts,
pinned planes, malformed-name refusal), topic naming, and the provider-API
lifecycle (license gating, residency validation, teardown-on-offboard, and
pooled↔siloed handler parity). The headline integration test (live Postgres) is
TestSiloedPhysicalSeparation, which asserts: schema creation; physical
separation (a siloed tenant's rows exist only in its schema — zero in public,
and vice versa); pooled↔siloed parity of the same tenant-scoped operation;
in-silo RLS defense-in-depth; router correctness; catch-up after a simulated
later migration; and teardown (gone, idempotent, pooled data untouched).