Agent enrollment & SVID rotation

How an agent gets — and keeps — its cryptographic identity. This is the operator how-to; the decision and threat model behind it are in adr/agent-enrollment.md.

The intuition: an agent is useless until it has an SVID (a short-lived mTLS client certificate whose SPIFFE identity names its tenant and agent id). Until then the mTLS transport refuses its connection, the ingest path won't vouch for it, and nothing it sends lands anywhere. The trust root is repo-managed — you do not hand-distribute certificates.

The lifecycle is four steps: set up the CA once, mint a join token, redeem it on the agent, and let the runtime rotate forever after.

One-time deployment setup

probectl-control agent-ca init

This generates the certificate hierarchy: root (10y, signs intermediates only) → issuing intermediate (1y, sealed at rest via the deployment envelope) → leaf SVIDs (24h). The ROOT private key is printed once to stdout for offline custody (HSM / sealed envelope / offline vault) and is never stored — runtime operation never needs it. Re-running refuses to overwrite the trust root.

The control plane's agent gRPC listener verifies every connecting agent's certificate against this agent CA, which it reads from a file (PROBECTL_AGENT_TLS_CA_FILE). Export that public trust bundle (root + intermediate — never a key) with:

probectl-control agent-ca export /etc/probectl/agent-ca.crt   # "-" writes to stdout

Point PROBECTL_AGENT_TLS_CA_FILE at the result. export copies only public certificates, so it needs no envelope key and works anywhere the database is reachable (set PROBECTL_DATABASE_URL). It writes one world-readable file and does not create parent directories — the target directory must already exist.

Enrolling an agent

1. Mint a join token (operator action; both surfaces store only a hash of the token):

# CLI — talks directly to the control plane's DATABASE, not the API
# (set PROBECTL_DATABASE_URL; works even while the API is down)
probectl-control enroll-token -tenant <tenant-uuid> [-agent <id>] [-name <label>] [-ttl 1h]

# or the admin API (requires the agent.write permission; audited, and the
# token is scoped to the CALLER's tenant)
POST /v1/agents/enroll-tokens   {"agent_id": "...", "ttl_seconds": 3600}

The token (pjt_…) is shown once, is single-use, expires (default 1h), and is tenant-scoped — the token, not the agent, names the tenant. The CLI also prints the server-certificate pin for first contact — but only when PROBECTL_TLS_CERT_FILE points at the serving certificate; without it, no pin prints and you use --ca-file in step 2 instead.

2. Redeem it on the agent host:

probectl-agent enroll \
  --server https://control.example:8443 \
  --token pjt_... \
  --dir /var/lib/probectl-agent/identity \
  --ca-pin <hex sha256>        # for self-signed quickstarts; or --ca-file ca.crt

The agent generates its private key locally (it never leaves the host), sends a CSR, and receives: the leaf SVID (SPIFFE URI spiffe://probectl/tenant/<t>/agent/<a> — client-auth only, with the SAN set by the server), the intermediate, and the trust bundle — all written 0600 into --dir. The agent is simultaneously registered in its tenant's registry, so ingest verification vouches for it immediately. A provided --ca-pin that mismatches refuses the connection — there is no trust-on-first-use fallback. With neither --ca-pin nor --ca-file, the system trust roots verify the server (the right choice when the control plane serves a publicly-issued certificate).

3. Point the agent config at the identity (the paths enroll just wrote):

tls:
  cert_file: /var/lib/probectl-agent/identity/cert.pem
  key_file:  /var/lib/probectl-agent/identity/key.pem
  ca_file:   /var/lib/probectl-agent/identity/ca.pem
identity:
  server: https://control.example:8443   # enables automatic rotation

One subtlety about tls.ca_file: it is what the agent uses to verify the control plane's server certificates (the gRPC listener, and the HTTPS endpoint that rotation calls). The enrollment-written ca.pem — the agent CA bundle — verifies them only if you issued those server certificates from the agent CA. If they come from a different CA (for example the gen-cert quickstart CA), point ca_file at that CA instead. Two trust checks, two CAs: the agent verifies the server against tls.ca_file; the server verifies the agent against PROBECTL_AGENT_TLS_CA_FILE. The worked laptop example is in getting-started.md.

Enroll on first boot (token-on-boot)

Steps 2–3 can also happen automatically on startup, which suits containers and DaemonSets: ship a join token instead of a pre-provisioned identity, and the agent enrolls itself the first time it boots. On startup, if no identity exists yet (cert.pem + key.pem are absent) and a token is available, the agent enrolls — writing the identity into the directory of tls.cert_file — and then runs. The full config is still required (the normal tls: paths name where the identity will land; keep the cert.pem/key.pem filenames, since those are what enrollment writes):

control_plane:
  grpc_addr: control.example:9443
tls:
  cert_file: /var/lib/probectl-agent/identity/cert.pem   # enrollment writes here
  key_file:  /var/lib/probectl-agent/identity/key.pem
  ca_file:   /etc/probectl/control-ca.crt   # must EXIST at first boot (see below)
identity:
  server: https://control.example:8443
enroll:
  token_file: /var/run/secrets/probectl/join-token   # or the env var below
  # ca_pin: <hex sha256>   # alternative first-contact trust for self-signed deploys

# equivalently, env-only (e.g. a token mounted from a Kubernetes Secret):
PROBECTL_AGENT_JOIN_TOKEN=pjt_...  probectl-agent -config agent.yml

PROBECTL_AGENT_JOIN_TOKEN takes precedence over enroll.token_file. The enrollment target defaults to identity.server; enroll.server overrides it. Each key also has an env form (PROBECTL_AGENT_ENROLL_TOKEN_FILE, PROBECTL_AGENT_ENROLL_SERVER, PROBECTL_AGENT_ENROLL_CA_PIN) — all documented in configuration.md.

First-contact trust still applies on boot. The boot enrollment verifies the control plane with enroll.ca_pin if set, else with the file at tls.ca_file — which must therefore already exist at first boot (mount it alongside the token) — else with the system roots. A missing ca_file is treated as a transient failure: the agent retries and eventually gives up rather than ever connecting unverified.

It is idempotent and fail-closed: an existing identity is never overwritten (renewal stays the rotation loop's job); a transient failure (e.g. the control plane isn't up yet, or a 5xx) retries with capped backoff — 1 s doubling up to 30 s — for up to five minutes, then exits with an error; a definitive rejection (an HTTP 4xx: a used, expired, invalid, or revoked token; a malformed CSR) exits immediately with a clear error instead of looping — mint a fresh single-use token and retry. The token is never logged. With no token, behavior is unchanged — you enroll out of band with the steps above.

Rotation

SVIDs live 24h. With identity.server set, the runtime rotates automatically at roughly 2/3 of the lifetime (checked once a minute): it generates a fresh key, proves possession of the current one (an ECDSA signature over the new CSR), and calls POST /enroll/agent/rotate over HTTPS — verified against tls.ca_file; the pin is first-contact only. The server verifies the presented chain against its own hierarchy, verifies the proof, checks that the issued serial is one it recorded, checks the revocation list, and the identity can never change on rotation (the server sets the SAN from the proven identity; CSR-requested names are ignored). Files are replaced atomically and the mTLS client hot-reloads them on the next handshake — no restart, no ingest gap. A failed rotation retries every minute while the current SVID is still valid, logging loudly.

Security properties (what to rely on)

Property	Mechanism
Replay-proof bootstrap	single-use token, consumed atomically (a replay finds no row); hash-at-rest; short expiry (default 1h); an unused token can additionally be voided in the database
Tenant binding	the SPIFFE URI SAN is set by the SERVER from the token's tenant; an agent cannot request one
Key custody	agent keys are generated on the agent (CSR flow); the root key lives offline; the intermediate key is sealed at rest
Bounded theft	24h leaf TTL; every issued serial is recorded and feeds the handshake revocation list
Throttled bootstrap surface	`/enroll/agent` and `/enroll/agent/rotate` ride the per-IP login throttle; no signing happens before the token/proof check

Revoking an agent

probectl-control revoke-agent -tenant <uuid> -agent <id>     # CLI (database-direct)
POST /v1/agents/{id}/revoke                                  # admin API (agent.write, audited)

Both persist the revocation (so it survives a restart) and feed the mTLS handshake deny-list. The API pushes it live immediately; the running control plane also reloads the persisted list every 30s, which is how CLI-side revocations propagate. From the next connection, a revoked agent's handshakes are refused, its live serials are denied, and its SPIFFE id is denied (so even a re-issued cert is refused) — and enrollment and rotation both refuse the identity. There is no resurrection path short of an operator un-revoking it in the database.

For the full threat-model delta and the stated residuals, see adr/agent-enrollment.md.