Per-tenant metering, usage & billing export
What this is
When an MSP self-hosts probectl and serves many tenants, it needs to answer "how much did each tenant use this month?" — to bill them. This is the plane that produces those numbers: per-tenant usage counters and snapshots, a usage/showback API, per-tenant creation quotas, and a billing-export feed the MSP feeds into its existing professional-services-automation (PSA) or billing system.
It is a commercial (Provider/MSP tier) feature. The implementation lives in
ee/billing and is unlocked by the metering license feature; the core
platform ships only an inert seam (internal/usage). A community or unlicensed
deployment therefore meters nothing — the seam is a no-op that records
nothing and allows everything — and the provider-console usage surfaces stay
hidden. (For why the line is drawn here, see
docs/editions.md.)
probectl deliberately does not build an invoicing engine — it exports. The first export target is generic CSV + JSON Lines: vendor-neutral, because every PSA imports CSV. Vendor-shaped connectors (ConnectWise, Autotask, Stripe) are follow-ups, to be built once a design partner names the one they need.
The meters
There are two kinds of meter, and the distinction drives everything downstream: a counter only ever goes up (you sum it over a period), and a gauge is a point-in-time level (you take the peak over a period).
| Meter | Kind | Unit | Source |
|---|---|---|---|
agents |
gauge | count | periodic snapshot, counted inside the tenant's own scope |
tests |
gauge | count | periodic snapshot, counted inside the tenant's own scope |
results_ingested |
counter | count | the result pipeline, as results flow |
ingest_bytes |
counter | bytes | result payload bytes, same stream |
flow_events |
counter | count | flow batches landing in the flow store |
ai_calls |
counter | count | AI assistant questions |
The counters are derived from the tenant-tagged streams that are already
flowing — the core call sites call the internal/usage seam as results, flow
batches, and AI questions pass through. There is no parallel metering pipeline.
Counters are bucketed hourly at the moment of recording (so an hour boundary
is exact regardless of when the buffer flushes), buffered in memory, and flushed
to Postgres every minute. If a flush fails, the buffered deltas are merged
back and retried on the next tick — billing-critical losslessness: counts can
be delayed, but never lost and never double-counted, because each flush is one
transaction.
Why the gauges are exact
The gauges (agents, tests) are the source-of-truth counts, and they are
collected carefully. A snapshot collector lists the tenants, then counts each
tenant's resources by running count(*) inside that tenant's own scope
(tenancy.InTenant: row-level-security-bound for pooled tenants, schema-routed
for siloed tenants). There is no cross-tenant read path at all, so a siloed
tenant's resources are counted exactly once, in its own schema, and pooled and
siloed tenants cannot double-count each other by construction.
Usage API + export feed
These are provider-plane routes (operator session). When the metering feature
is not licensed they are hidden — a request gets a 404, not a 403, so the
feature's existence isn't even advertised.
GET /provider/v1/usage?from&to&tenant_id&rollup=hour|day— usage records. Defaults are month-to-date with day rollup. Counters sum across the period; gauges take the peak (the fair capacity snapshot — you bill for the most agents a tenant ran, not their average).GET /provider/v1/usage/export?format=csv|jsonl&…— the billing feed (csvis the default). The column set is a stable contract — only additive changes are allowed, so an importer never breaks:
tenant_id,tenant_slug,meter,kind,period_start,period_end,value,unit
Timestamps are RFC 3339 in UTC. JSON Lines carries the same field names, one object per line.
Records persist in the usage_records table (migration 0026_metering.sql).
This is provider-plane billing data about tenants: it is written and read by
the probectl_provider database role through an explicit row-level-security
policy, but it still carries the standard per-tenant policy too, so a tenant can
read its own usage through tenant-scoped paths. It is never copied into silo
schemas — billing stays pooled by design.
Quotas
The tenant_quotas table (one row per tenant) holds max_agents and
max_tests; null means unlimited. Quotas are managed by an admin operator
(separation of duties; the action is audited as provider.quota_set; and when
the license has lapsed into read-only degrade, quota writes are blocked) via
GET/PUT /provider/v1/tenants/{id}/quotas or the console's Usage card.
What a quota does — and deliberately does not do:
- It gates control-plane resource creation only: creating a test (denied
with
403 quota_exceeded) and registering a new agent (denied with the gRPCResourceExhaustedstatus). An existing agent re-registering is never rejected — a running fleet must not break on a restart. - Telemetry is never quota-dropped. Observability must not silently lose
data; throttling pooled ingest is the fairness layer's job, not the quota
layer's (see
docs/fairness.md). - Enforcement counts live state inside the tenant's own scope (exact, not cached); quota lookups cache for 30 seconds and invalidate immediately on update.
- An infrastructure failure (a database blip) degrades open — the create is allowed. A quota is a billing control, not a security boundary, and the metering trail still records what actually happened.
Console
The provider console's Usage & showback card shows month-to-date per-tenant
meters, offers one-click CSV/JSONL export, and (for admins) the per-tenant quota
editor. It is hidden entirely when the metering feature is not licensed.
Configuration
There are no configuration keys. The flush cadence (1 minute) and snapshot
cadence (15 minutes) are fixed; the feature activates when the license grants
metering (Provider/MSP tier). Quotas and usage live in Postgres alongside the
tenant registry.