ADR: rebuild-on-restart for the in-process derived stores
- Status: Accepted — 2026-06-07. Addendum: the one durability gap this ADR tracked (alert silences/acks, the exception below) has since been closed — they now persist in a tenant-scoped Postgres table and are restored on boot. The rebuild-on-restart decision for the three derived stores is unchanged.
- Context: a review noted that topology, threat detections, and alert firing state live in process memory and are lost when the control plane restarts. This ADR decides whether that is a bug to fix or a property to accept.
The plain version
Some of probectl's state lives only in RAM inside the control plane. When the process restarts, that RAM is gone. The question is: do we add a database to persist it, or is losing it actually fine?
The answer for three of these stores is: it's fine, because they are derived — they are a cache of something durable, not the original copy. Think of them like a search index over your email: if the index is wiped, you don't lose any email; the index just rebuilds itself by re-reading the mailbox. Same here. So we formally adopt rebuild-on-restart for them rather than adding persistence — and we prove the cold-start behavior with tests. One genuinely non-derivable piece of state (alert silences/acks) is the exception: it was called out as a tracked gap rather than silently dropped, and has since been fixed with a small persisted table (see the exception section).
Decision
Three in-process stores are derived views of a durable source, not systems of record:
| Store | What it holds | Durable source it re-derives from | Restart behavior |
|---|---|---|---|
internal/topology (MemoryStore / IndexedStore) |
the tenant-scoped temporal graph | the bus stream (probectl.ebpf.flows and friends) plus the path/flow stores it observes; the topology consumer re-subscribes at boot |
rebuilds as new observations arrive; a cold start is an empty graph that refills within the stream/retention window |
internal/threat (DetectionStore) |
recent NDR/posture detections per tenant, bounded | the detection stream; the durable trail is the incident record + SIEM export, which is already persisted | rebuilds from the stream; the forensic copy already left via incidents/SIEM, so nothing forensic is lost |
internal/alert (Engine firing state) |
per-series firing/pending state | re-derived on the next evaluation against the metric source | firing state re-derives automatically (see the exception below) |
Why rebuild rather than persist
These are caches of a stream, not the system of record. The authoritative data already lives durably: flows and paths in ClickHouse, metric series in the TSDB, the detections' forensic copy in incidents plus the SIEM export, and the audit trail in its tamper-evident chain.
Persisting a second copy of a derived view would buy us nothing and cost three things:
- Write-path cost — every observation would have to be written twice.
- A schema to migrate — more migrations, more surface.
- A consistency problem — the persisted snapshot and the live stream can disagree, and now you have to reconcile them.
…all for state that is correct again within one evaluation or observation cycle. Rebuild-on-restart is the simpler, correct choice — and it is now a decided, tested property instead of an implicit one.
Cross-tenant isolation is unaffected. Each store is keyed by tenant, and a cold start cannot surface another tenant's data: it surfaces nothing until it has re-derived from that tenant's own inputs.
The exception: alert silences and acknowledgements
Silences and acks are operator inputs, not derivable from any stream. Re-deriving firing state can reconstruct what is firing, but it cannot reconstruct "operator X silenced this alert until 3pm" — that fact existed only because a human typed it. When this ADR was accepted, silences/acks were in-process and lost on restart — deliberately failing in the safe direction, louder, not quieter: a restart could never hide a firing alert; the worst case was a human re-applying a silence.
That tracked follow-up has since landed. Silences and acks now ride a small
tenant-scoped Postgres table (alert_ops, RLS-confined) the same way alert
rules already do: written when the operator acts, reloaded at boot,
re-applied when their series fires again, and deleted when the episode
resolves — so restart-restored state never outlives the episode semantics the
in-memory engine always had. The fail-louder ordering is preserved: if the
reload itself fails, alerting continues without the saved silences (logged
loudly) rather than blocking on the table.
Consequences
- A control-plane restart shows a briefly empty topology/detection view while the streams re-fill — expected and documented. Firing alerts re-derive on the first evaluation, with persisted silences/acks re-applied as their series fire.
- No new migration, write path, or snapshot-consistency surface for derived state. (The silences/acks fix is exactly the carve-out the exception predicted: one small table for non-derivable operator input — not a second copy of any stream.)
- The cold-start contract is enforced by tests in
internal/topologyandinternal/alert: a fresh store is empty and correct, and re-derives from its inputs.
Tests proving cold start
internal/topology: a newMemoryStorereturns an empty snapshot for any tenant (no stale data, no panic) and rebuilds the graph as observations replay.internal/alert: a newEnginehas no active alerts and no silences/acks at construction, and re-derives firing state from the metric source on the first evaluation after a restart (modeled as a fresh engine). The engine itself stays memory-only by design — durability is the wiring around it, which persists operator actions and restores them at boot.