Incident response plan
How a probectl security incident is detected, triaged, handled, evidenced, and disclosed. Companion to SECURITY.md (reporting intake + coordinated disclosure) and threat-model.md. Two scopes, kept deliberately distinct:
- Product/vendor incidents — a vulnerability or compromise in probectl code, releases, or build infrastructure. Vendor-led; affects all deployments.
- Deployment incidents — a compromise inside one operator's installation. Operator-led under their own IR; probectl's job is to give them the evidence tooling and this playbook's deployment sections.
1. Severity matrix
| Sev | Definition | Examples | Acknowledge | Mitigation target |
|---|---|---|---|---|
| SEV-1 | Cross-tenant data exposure (the declared highest-severity failure — tenant isolation is the outermost boundary); RCE in agent/control plane; compromised release artifact or signing path | tenant A reads B; malicious artifact signed | 4h | fix or kill-switch ≤ 24h; advisory ≤ 72h |
| SEV-2 | Auth/RBAC bypass within a tenant; audit-chain integrity break; secret exposure; AI egress gate bypass | consent gate skipped; WORM verify fails | 24h | ≤ 7 days |
| SEV-3 | Vulnerability with significant mitigating factors; DoS of a plane; dependency CVE in a reachable path | fairness bypass; exploitable parser crash | 72h | next patch release |
| SEV-4 | Hardening gap, defense-in-depth finding, non-reachable dep CVE | scanner findings | 7 days | scheduled |
Cross-tenant anything starts at SEV-1 until proven otherwise. When in doubt, classify up.
2. Roles
Solo-founder reality, stated plainly: the maintainer is Incident Commander, investigator, and fixer. The compensating structure is this written plan; predefined external deputies (counsel for notification-duty questions; the affected operator's security team for deployment incidents — co-responders by design, since they hold the infrastructure); and an immutable trail (the signed WORM audit export, see ../hardening.md §0b) so every action is reviewable after the fact. The role table will be re-cut at the first security hire.
| Role | Holder today | Duties |
|---|---|---|
| Incident Commander | maintainer | declare sev, run timeline, own comms |
| Investigator/fixer | maintainer (+ operator's team for deployment scope) | root cause, patch, verify |
| Counsel | external (engaged per incident) | notification obligations, advisory wording |
3. Response flow
- Intake. A report through the SECURITY.md channels (private advisory /
security.txtcontact), a CI security gate (security-scan.yml), the WORM chain-verification alarm, or an operator report. - Declare + log. Assign a severity; open a private timeline doc; record UTC timestamps for every action from this point on.
- Contain.
- Product scope: pull or revoke the affected artifacts (cosign identities make "which artifact" provable) and gate the release lane.
- Deployment scope: the operator isolates per the runbooks — agents
halt-on-error via the rollout machinery, credentials rotate, and any
insecureoverride is audited.
- Preserve evidence — before remediation mutates state.
- Export the audit chains: the provider stream's signed WORM segments are
already off-database and chain-verified (
internal/audit/worm.go); snapshot the tenant streams. - Run
scripts/backup_postgres.shandscripts/backup_clickhouse.shto capture store state with SHA-256 manifests. - Generate a support bundle (
internal/support) for config/version posture. - Hash and retain the agent binaries/objects in question, comparing against the baked-in integrity manifest and the release signatures.
- Export the audit chains: the provider stream's signed WORM segments are
already off-database and chain-verified (
- Eradicate + recover. Fix forward; release through the normal lane (the red-CI refusal stays on); upgrade the fleet as staged waves of signed artifacts with halt-on-error — an emergency changes the wave sizes, never the verification.
- Notify (§4) and disclose (§5).
- Post-incident review within 7 days: timeline, root cause, a tracked follow-up for every systemic gap found, and a re-review of the threat model if a trust boundary was involved.
4. Customer notification
The self-hosted nuance: the vendor cannot see deployments, so notification means telling operators what to check and what to ship, fast.
- Channels: GitHub Security Advisory (the canonical record) plus release notes; direct contact for known enterprise/MSP operators.
- SEV-1: an initial notice within 72h of confirmation, even if incomplete — affected versions, severity, interim mitigations, and indicators of compromise (how an operator checks their own audit chains and registry for the pattern). Update cadence at least every 72h until resolved.
- SEV-2: notice with the fix release.
- Content discipline: facts only — affected versions, the upgrade path (../ops/fleet-rollout.md), and verification steps (../ops/verify-artifacts.md). MSP operators re-notify their tenants under their own obligations; the per-tenant audit trail gives them the evidence to do it.
5. Disclosure
Coordinated disclosure per SECURITY.md: reporter credit, advisory published when a fix is available, CVE requested via GitHub for qualifying issues. Embargo target ≤ 90 days, shorter when exploitation is observed.
6. Drill
Tabletop this plan twice yearly (one product-scope, one deployment-scope using the backup/failover drills as the recovery legs) and after any SEV-1/2. Record outcomes in the review log below.
| Version | Date | Change |
|---|---|---|
| 1.0 | 2026-06-07 | Initial plan |