CI pipeline overview
A deeply-technical-ELI5 map of what happens every time you push — what runs,
why each gate exists, and what "green" actually proves. The source of truth is
.github/workflows/ci.yml; this doc explains it.
When it runs
ci.yml triggers on push to main and on every pull request. A
concurrency group keyed on the ref cancels any in-progress run when you push
again, so a rapid second push abandons the first (no wasted runners). The
workflow starts with only contents: read permission — it can read the repo but
not write to it, so CI can never push to main.
The repo's other three workflows do not run on a normal push:
| Workflow | Trigger | Purpose |
|---|---|---|
ci.yml |
push to main + every PR |
the gate described here |
release.yml |
push of a v* tag |
build/sign/publish a release — its require-green-ci job refuses any tag whose commit doesn't have a green ci run |
nightly.yml |
daily 03:17 UTC (+ manual) | the slow stuff: e2e (black-box full stack against the real compose dependencies), ingest-bench (consumer hot-path benchmarks), and scale-gate-m (the M-profile SLO regression guard, make scale-gate-m, plus an M-tier full-stack run against real Kafka + Prometheus) |
security-scan.yml |
weekly, Mondays 05:17 UTC (+ manual) | scheduled dependency-CVE scans (govulncheck, npm-audit, trivy-fs) that upload evidence artifacts and fail on Critical findings — so a new CVE in an unchanged repo still goes red |
The shape: fan-out, then one umbrella
A push is like sending the change through a checkpoint with 32 specialist
inspectors, each examining one thing, all in parallel. A 33rd job —
verify-all — is the supervisor: it needs: 28 of them, runs with
if: always(), writes a receipt artifact, and fails loudly listing any
non-green gate. It is the one status you actually watch: green verify-all =
the whole pipeline passed.
Four jobs sit outside the roll-up and report as their own checks: commitlint
and dco run only on pull requests (on a push to main they'd register as
"skipped" and falsely redden the umbrella), and image-scan and sbom are the
image-vulnerability and bill-of-materials jobs. If you turn on branch
protection, docs/ops/branch-protection.md
recommends requiring verify-all plus those four explicitly.
The guiding principle, visible throughout the repo: every rule in the Non-negotiables has a matching CI gate, and the gates execute the thing rather than just inspecting it — they boot real kernels, real databases, real Kafka, and drill real failovers. Verification you ran, not verification you described.
The jobs, by stage
1. Cheap gatekeeping — "is this change even well-formed?"
- action-pins — every GitHub Action is pinned to a commit SHA, not a moving
tag, so a hijacked upstream action can't slip in; a second pass
(
scripts/check_supply_pins.sh) rejects:latestimage tags and unpinned tool installs anywhere in the workflows (supply-chain). - secret-scan — gitleaks: no credentials/keys committed. Deliberate
redaction-test fixtures are allowlisted in
.gitleaks.toml. - commitlint — commit messages follow Conventional Commits (
feat(...),fix(ci): ...). - dco — every commit carries a
Signed-off-by(the Developer Certificate of Origin; whatgit commit -sadds). - no-devauth-in-release — proves the dev-auth bypass is physically absent
from a release binary (symbol + literal absence + a boot-time refusal of
PROBECTL_AUTH_MODE=dev), and compiles a tagged build to confirm the check isn't vacuous.
2. Static checks — compile, lint, spec; no services yet
- lint-go / lint-python — gofmt +
go vet+ golangci-lint on Go, plus the repo's guard scripts (crypto primitives only viainternal/crypto, no swallowed errors, hardened HTTP clients, no string-built SQL, unified TLS config); ruff/black on the BGP analyzer. - editions-gate — enforces the open-core boundary via
make editions-gate:scripts/check_editions_imports.sh(self-tested) proves core never importsee/, then the core-only build (-tags probectl_core) compiles and passes its tests with theee/tree linked out entirely. - fips-gate — the FIPS artifact (
-tags probectl_fips, validated Go Cryptographic Module viaGOFIPS140) builds, and its power-on self-test passes with the validated module active — the build is real, not just tagged. - proto —
buf lint, thenbuf breakingagainstmain(the schemas are the agent/bus wire contract, so incompatible changes block the merge — see CONTRIBUTING.md), then regenerates and asserts the committed*.pb.gomatch the.protofiles (no drift, no codegen poisoning). - openapi-gate — every registered
/v1route exactly matches the OpenAPI 3.1 spec; no undocumented routes ship. - migration-gate — DB migrations are additive / expand-only: it rejects
destructive or blocking changes (drop column, column-type change, rename,
adding
NOT NULL) so release N's schema still works with N-1's code during a rolling upgrade.
3. Tests + coverage
- test-go — the Go unit suite across every workspace module, run with
-race, plus a fuzz smoke over the untrusted-input parsers, the backup-encryption/erasure gate, an agent-overhead bench smoke, and cross-compile checks (linux amd64+arm64; the endpoint agent additionally builds for macOS and Windows). - test-python — the BGP analyzer's tests, installed from a hash-locked
requirements file (which is itself checked against
pyproject.tomlfor drift), with an 85% coverage floor. - coverage — per-package statement-coverage floors on the service-free
logic/parser/probe packages (
scripts/check_coverage.sh); retainscoverage.out+ a per-package summary as a downloadable receipt artifact (seedocs/quality/coverage.md). - web — the frontend: typecheck, lint, production build; the test step is the WCAG 2.2 AA accessibility + theme-token gate.
- browser-worker — the Playwright browser-synthetic worker, run inside the official Playwright image (a real scripted login, timings + failure screenshot).
4. eBPF — building and loading real BPF objects
- ebpf-kernel-matrix — boots real LTS kernels (5.15 and 6.6) in QEMU (via
vimto) and actually loads + attaches the BPF programs (tracepoint + uprobes, one flush cycle). One matrix entry raises kernel lockdown to INTEGRITY inside the VM and proves load+attach still works on a hardened kernel. amd64 runs under KVM; the arm64 runner has no/dev/kvm, so it compiles + digest-verifies the arm64 objects but skips the (too-slow emulated) boot. - ebpf-image-live — the shipped
probectl-ebpf-agentimage must carry the live CO-RE loader (built fromDockerfile.ebpf, asserting-tags=ebpf), not the fixture replayer.
5. Real-infrastructure suites — spin up actual Postgres / Kafka / ClickHouse
- cross-tenant-isolation — the
tenant-isolation non-negotiable,
executed (
make test-isolation): RLS posture + cross-tenant injection against a real Postgres (over TLS,sslmode=verify-fulllike production) and a real ClickHouse (including its row-policy DDL). - integration — boots the control plane against real Postgres (TLS,
verify-full) plus a remote-write Prometheus; migrations are idempotent,/readyzpasses, the result pipeline round-trips — andinternal/storemust hold its 60% integration-coverage floor. - perf-smoke — a cheap, repeatable latency/throughput baseline; the first place a pooled-cardinality or RLS-cost regression would surface.
- backup-drill — backup → wipe → restore actually runs; nonce-marked rows must survive a full database drop.
- failover-drill — kills the primary, promotes the streaming standby, and times RTO (first write on the new primary) / RPO (acked rows lost).
- load-smoke — synthetic agents → real Kafka → the production consumer (retry/DLQ + cardinality caps) → Prometheus remote-write → tenant-scoped PromQL, end to end.
6. AI quality
- rca-eval — a blocking evaluation of the AI root-cause analysis against fixtures, with quality floors (0.85 / 0.85). If RCA quality regresses, the build fails — the AI is held to a measured bar, not just "it compiled."
7. Deploy / packaging config
- helm-gate — the Helm charts lint for every reference profile and uphold
the secure-by-default invariants (non-root/read-only pod,
NetworkPolicy/PDB/HPA, drain probe, HSTS, no default credentials); the
rendered manifests are schema-validated with
kubeconform, the GitOps (ArgoCD/Flux) manifests are checked, and the shipped compose file must passdocker compose config. - terraform-gate — the Terraform module is
fmt-clean andterraform validates via an example root that consumes it.
8. Supply chain & artifacts
- dependency-scan — vulnerability scan of the dependency graph.
- build-images — builds the multi-arch container images.
- image-scan — scans those built images for known vulnerabilities.
- sbom — emits a software bill of materials for the build.
9. The umbrella
- verify-all — depends on the gates above, writes
verify-all-summary.jsonas a receipt, and goes red listing any gate that wasn't green. This is the single signal of overall pipeline health.
What a green pipeline proves
When verify-all is green for a commit, you've demonstrated — on that exact
commit — that the code:
- compiles and passes lint in every mode (normal, FIPS, eBPF);
- passes unit, real-Postgres, real-ClickHouse, and real-Kafka tests;
- keeps tenants isolated at the storage layer (RLS, enforced and tested);
- ships container images that scan clean, with an SBOM and SHA-pinned actions;
- didn't drift its API spec, protobufs, or migrations, and keeps zero-downtime upgrade compatibility;
- holds the AI's root-cause quality above its floor;
- survives a backup/restore and a primary failover.
That is a heavier pipeline than most projects run — closer to what a regulated enterprise expects — which matches where probectl is aimed (sovereign, multi-tenant, source-available).
What the pipeline cannot self-apply
Branch-protection enforcement (making GitHub refuse to merge a red PR) is a
server-side repository setting, not something a workflow in the tree can turn
on. Until an admin configures it, every check above is advisory at the merge
button — the recommended ruleset (require verify-all, plus the four jobs
outside its roll-up) and the one-time console steps are in
docs/ops/branch-protection.md. The release path
has its own independent backstop either way: release.yml refuses to publish a
tag whose commit lacks a green ci run.