probectl performance baseline

What this is

This is probectl's checked-in performance baseline — a cheap, repeatable early-warning system that runs in CI, not a full scale validation. Think of it as a smoke detector, not a fire drill: it exists to catch a regression now, the moment it lands, rather than discovering it later at the full scale gate.

The harness lives in internal/perf. It is the reusable engine that the full load gate and the per-tenant fairness work both build on.

The guiding idea is a smoke, not a soak: the thresholds below have generous headroom on purpose. The guard is meant to trip on an order-of-magnitude regression — a pooled-tenancy cardinality blow-up, or a Postgres row-level- security cost problem — and to stay quiet through ordinary CI timing jitter. If it goes red, something is genuinely, badly slower.

What the harness drives

Two drivers exercise the core path agents → bus → stores → query:

Ingest baseline (DriveIngest, no services) — publishes synthetic probe results through the lightweight path (bus → pipeline consumer → in-memory TSDB) at a defined scale, and measures end-to-end throughput and publish latency. It also asserts correctness: every result is ingested, and each tenant's series land under its own tenant_id (no cross-tenant label mixing).
Pooled multi-tenant smoke (DrivePooled, against Postgres) — K tenants share the pooled stores; every tenant-scoped query runs concurrently (mixed-tenant load) and must return exactly its own rows. This is the isolation-under-load assertion (a cross-tenant leak would inflate a count; a scoping bug would deflate it), and it is the first place a row-level-security cost problem shows up.

Scenarios (CI smoke sizes)

Scenario	Shape	Volume
Ingest baseline	4 tenants × 5 agents × 5 tests × 20 results	2,000 results → 6,000 series
Pooled multi-tenant	20 tenants × 100 rows, 10 query-reps, 16 concurrent	2,000 rows, 200 tenant-scoped queries

(Each result produces three series — probe success, probe duration, and one custom metric — which is why 2,000 results become 6,000 series.) These sizes are intentionally small (sub-second to a few seconds) so the job stays a CI smoke. The full L/XL multi-region sizes and the formal noisy-neighbor gate live in the full-stack load gate — see scale-gate.md.

Regression-guard thresholds

The guard is perf.M6Baseline() in code, asserted by the smoke tests. Update the code and this table together when the numbers move materially.

Metric	Threshold	Rationale
Ingest throughput	≥ 3,000 results/sec (floor)	the lightweight path; catches a ~10–50× regression
Pooled query p95	≤ 250 ms (ceiling)	a tenant-scoped list under mixed load; catches a row-level-security cost blow-up
Pooled isolation	0 mismatches (hard)	any wrong row count is a correctness failure, never tolerated

Illustrative numbers

The figures below are illustrative, not guarantees — they come from one development run on Apple-class arm64, in-process, and they exist only to show the shape of the headroom behind the thresholds. They are not asserted by any test (only the floors and ceilings above are), and they will differ on your hardware. CI runners are slower, which is exactly why the thresholds carry so much room. Refresh this section from a green perf-smoke CI run if the baseline shifts.

Metric	Observed (dev run)	Threshold	Environment
Ingest throughput	~188,000 results/sec (no race); ~50,000 with `-race`	≥ 3,000	dev arm64, memory bus + memory TSDB
Ingest publish p95	~11 µs	—	dev arm64
Pooled query p50 / p95 / p99	~2.4 ms / ~11 ms / ~16 ms	p95 ≤ 250 ms	dev arm64, Postgres, 20 tenants × 100 rows, 16 concurrent
Pooled query throughput	~4,600 queries/sec	—	dev arm64, Postgres
Pooled isolation	0 mismatches (200/200 queries correct)	0	dev arm64, Postgres

These are the lightweight single-consumer path. The Kafka / multi-consumer profile and real-TSDB remote-write numbers belong to the full-stack load gate, not here.

Running it

make perf-smoke          # ingest baseline (no services) + pooled (uses PROBECTL_DATABASE_URL)

The ingest baseline needs no services. The pooled smoke uses PROBECTL_DATABASE_URL and skips when no database is reachable, so the target is safe to run locally without the dev stack. In CI the perf-smoke job runs both against a TLS Postgres started for the run (verify-full, like production). The pooled run also exercises the cross-tenant isolation property under concurrency, complementing the dedicated cross-tenant-isolation gate.

Scope / placement notes

The harness is a main-module library (internal/perf), not the black-box test/ module — matching how every other integration test in this repo is structured (in-module, integration-tagged, reaching Postgres via PROBECTL_DATABASE_URL). The test/ module stays reserved for the full-stack soak.
Resource-use capture (control-plane + store CPU/memory under sustained load) is deferred to that full-stack soak, where it is meaningful against real services. This baseline focuses on the load-bearing early-warning signals — throughput, query latency, and isolation correctness.