Cost, reliability, chaos & carbon intelligence
What it is
Four engines that turn the network telemetry probectl already collects into answers the business cares about: what is this traffic costing us, are we keeping our reliability promises, will our monitoring actually catch a real outage, and what is the environmental footprint of moving these bytes? All four read the same observed data — none of them adds a new agent or a new probe — and all four are scoped to one tenant (one isolated customer or organization) at a time.
- Network cost intelligence (FinOps) puts a dollar figure on observed traffic and attributes it to a service and team. FinOps is the practice of making cloud spend visible and tying it to the team that creates it.
- Service Level Objectives (SLOs) turn a reliability promise — "checkout succeeds 99% of the time" — into something probectl watches continuously and rolls up to the business owner. Definitions are written in OpenSLO, an open, vendor-neutral SLO file format.
- Network chaos testing deliberately injects a known network fault — added delay, packet loss, a full outage — to prove your monitoring and SLO alerts actually fire when the network breaks.
- Carbon and power reporting estimates the electricity and carbon spent moving your bytes, for sustainability disclosure.
Terms of art (FinOps, egress, SLO/SLI, error budget) are defined in the glossary.
Why it exists
Bytes are easy to count; the things that matter about them are not. A flow record tells you a megabyte crossed the wire — it does not tell you that the megabyte crossed a region boundary and therefore cost real money, that it ate into checkout's reliability budget, or that it burned a measurable amount of grid electricity. These four engines attach that missing meaning.
- Cost. Clouds bill for egress — traffic that leaves a boundary such as an availability zone, a region, or the provider's network entirely — so where a byte travels matters far more than how many bytes there are. A "chatty" pair of services quietly talking across a region boundary can run up a bill nobody notices until the invoice lands weeks later. This engine surfaces that leak in close to real time and points at the team responsible.
- Reliability. A dashboard full of green checkmarks does not answer the question an executive asks: how much room do we have left before we break our promise, and is something burning that room down right now? An SLO turns reliability into a spendable balance you can reason about.
- Chaos. A monitoring platform that has never been shown a real failure is itself an untested alarm. You test a smoke detector with a match under a test rig, not by setting the corridor on fire — chaos testing is that controlled match for your observability.
- Carbon. Sustainability disclosure (often called environmental, social, and governance, or ESG, reporting) increasingly asks "what is your network footprint?" The data to estimate it already flows through probectl.
None of these engines acts on the network. Cost never throttles traffic, SLOs never block anything, and the carbon view never changes a thing — every output is a signal a human reads, never an enforcement point.
How it works
Cost: classify, then price, then attribute. Each observed flow has two ends.
probectl resolves each end to a zone or region using operator-declared rules in
Classless Inter-Domain Routing (CIDR) notation — a way of naming an address
range by prefix, e.g. 10.0.1.0/24 is 256 addresses. probectl cannot guess your
subnet layout, so you declare it. The most specific rule wins (a /24 beats an
overlapping /16). Each flow then gets a traffic class, the bytes are
multiplied by a per-class price, and the result is attributed to a service and
team:
| Class | Meaning | Default rate ($/GiB) |
|---|---|---|
same_zone |
both ends in the same zone | 0 (free on the major clouds) |
inter_az |
same region, different zones | 0.01 |
inter_region |
different regions | 0.02 |
internet_egress |
leaves to a public address | 0.09 |
unknown |
zones unmapped or unresolvable | unpriced — volume still tracked |
The engine prices observed volume against published list rates — think of it as
the taxi's own meter, not your credit-card statement. It tells you who rode where
in real time; reconciling that against the actual cloud invoice stays your
finance team's job. The honesty rules are strict: with no price table it runs in
volume-only mode (bytes attributed, dollars never invented), and every response
carries flags (priced, zones_mapped) plus the pricing source and an as-of
date, so staleness is visible rather than hidden.
SLO: a spendable error budget, watched on two clocks. The thing you measure is the service-level indicator (SLI) — the ratio of good probes to total probes. The SLO is the target you hold it to. If your target is 99%, you are allowed to fail 1% of the time; that 1% is your error budget, a balance you spend as failures happen. Burn rate is how fast you spend it:
burn rate = errorRate(window) / (1 − target)
Burn rate 1 spends exactly the whole budget over the SLO window and lands on empty right at the end — sustainable. Burn rate 14.4 spends the whole month's budget in about two days — an emergency. The hard part is telling a real outage from a blip, so probectl requires two windows — a long one and a short one — to both exceed the threshold before it alerts (a logical AND, the established Google site-reliability-engineering method). The long window proves the problem is sustained; the short window proves it is still happening now:
| Tier | Long window | Short window | Burn ≥ | Severity |
|---|---|---|---|---|
| fast | 1h | 5m | 14.4 | page |
| medium | 6h | 30m | 6 | page |
| slow | 3d | 6h | 1 | ticket (warning) |
A brand-new SLO with almost no data would trip every threshold on a single
failure, so the engine stays quiet (reporting cold_start: true) until an SLO
has seen at least 50 events in its full window. Alerts latch per episode and
re-arm only after the long window drops back under the threshold, so one
outage cannot flap out a stream of pages.
Chaos: a blast radius the design makes impossible to widen. The chaos injector is an in-process relay for datagrams (small connectionless network packets) that perturbs only traffic explicitly addressed to its own listener. It touches no kernel state, no firewall rules, no agent or tenant traffic, and it is not reachable from any application programming interface (API) — actions against the network are human-gated by design in probectl, and this one is not even wired in. A chaos run has a fixed shape: healthy baseline → inject → observe → heal → observe. The baseline-first shape is what makes the result evidence — an alert already firing before the fault proves nothing.
Carbon: two multiplications, honestly labeled. The carbon engine reuses the same flow stream and the same zone and owner attribution as the cost engine, so the dollar view and the carbon view always agree on who owns which traffic:
kWh = (bytes / GiB) × energy-coefficient(traffic class)
gCO2e = kWh × grid intensity (gCO2e/kWh)
The energy coefficients (kilowatt-hours per gigabyte) come from published
transmission-energy literature and scale with path locality. Grid intensity (the
carbon your electricity carries) is operator-set — set your real figure, since it
varies enormously by region. Every response is labeled measured: false: these
are coefficient-based estimates of transmission energy, good for relative
attribution and trends, never an audited absolute footprint. The engine never
pretends otherwise, and like everything else here it is fully local — nothing is
fetched and nothing leaves your network.
Use it
Wire up cost classification and attribution with operator-declared rules, then set monthly budgets:
# CIDR → zone (region derived from the trailing zone letter, or set explicitly)
export PROBECTL_COST_ZONES="10.0.1.0/24=us-east-1a,10.9.0.0/16=eu-west-1a"
# CIDR → service:team (this is what makes per-team showback possible)
export PROBECTL_COST_SERVICES="10.0.1.0/24=checkout:payments,10.0.2.0/24=inventory:logistics"
# Monthly USD budgets; a breach raises ONE signal per budget per month
export PROBECTL_COST_BUDGETS="team:payments=500,service:checkout=120"
Read the tenant's cost summary:
curl --cacert ca.crt "https://control.example/v1/cost/summary"
What you should observe: totals broken down by class, service, and team; the top
"chatty" zone-pair conversations (a pair is flagged once it crosses 1 GiB of
paid cross-zone or cross-region traffic — same-zone and internet traffic do not
count, since the point is internal leaks); a 7-day hourly trend; budget status;
and the honesty flags priced and zones_mapped. If you have not supplied a
price table, you will see priced: false and byte counts with no dollars — that
is correct, not a failure.
Define an SLO as a standard OpenSLO document and point PROBECTL_SLO_DIR at
the directory holding it. probectl evaluates a deliberate subset and rejects
anything outside it loudly at load time — an SLO you think is being tracked
must actually be tracked:
apiVersion: openslo/v1
kind: SLO
metadata:
name: checkout-availability
labels:
team: payments # the business-unit mapping for roll-up
spec:
service: checkout
indicator:
spec:
ratioMetric:
good:
metricSource:
type: probectl
spec: { canary_type: http, target: checkout.acme.example, outcome: success }
total:
metricSource:
type: probectl
spec: { canary_type: http, target: checkout.acme.example }
timeWindow: [{ duration: 30d, isRolling: true }]
budgetingMethod: Occurrences
objectives: [{ target: 0.99 }]
Read the tenant's SLO status at GET /v1/slos. What you should observe:
attainment versus objective, error budget remaining, total events, the
cold_start flag, and per-window burn rates with their firing state. A new SLO
reports cold_start: true until it has 50 events.
Run a chaos self-test against your own echo path in your own test harness —
start a proxy, point a udp or voice probe at it, and flip the fault on:
healthy baseline → SLO quiet, probes pass, latency normal
inject a partition → probes fail for real, the multi-window burn alert fires
heal → attainment recovers, the alert clears
If injecting a known fault does not make the alert fire, that is a failure of the platform's core promise — which is exactly what the self-test is built to catch.
Read the carbon estimate at GET /v1/carbon, after setting your grid figure:
export PROBECTL_CARBON_GRID_GCO2E=230 # your utility's published gCO2e/kWh
curl --cacert ca.crt "https://control.example/v1/carbon"
What you should observe: totals in kilowatt-hours and grams of carbon, the same
class/service/team breakdown as cost, a 7-day trend, and a methodology block
stating measured: false with the coefficient source and grid figure used.
Pitfalls & limits
- Cost is attribution, not your invoice. The engine prices observed volume at list rates; it does not reconcile your actual cloud bill, and full billing-API reconciliation is out of scope by design. Treat it as "who is generating expensive traffic, and roughly what does it cost."
- No zone rules means
unknown, not a guess. Without CIDR→zone rules every flow lands in theunknownclass with bytes tracked but no dollars. probectl never invents a rate or guesses your subnet layout. - A malformed price table fails startup on purpose. Silently mispriced cost data is worse than none, so a broken override file refuses to boot. To run with no pricing at all, switch to volume-only mode explicitly.
- SLOs cover the network planes, not your application. probectl evaluates a subset of OpenSLO (ratio metrics, occurrence budgeting, a single rolling window, one objective). Application-level instrumentation is out of scope — probectl correlates the network, it does not own your app's metrics.
- Cold-start silence is deliberate. A low-cadence probe may take a while to
reach 50 events; until then the SLO honestly reports
cold_start: truerather than paging on statistical noise. - Chaos is a self-test, not an API. The injector cannot be triggered remotely and never mutates a live cluster. Transmission-Control-Protocol and Hypertext-Transfer-Protocol stream faults and cluster-level chaos orchestration (killing pods and nodes) are out of scope — dedicated chaos tools own those.
- Carbon is an estimate, structurally. Every response is labeled
measured: false. No supported device reports actual wattage today, and the engine never fabricates it. Good for relative comparison and trends; not an audited footprint. A single grid figure applies per deployment — there is no per-region split yet.
Reference
- Cost endpoint:
GET /v1/cost/summary(needs the metrics-read permission). A budget breach raises onecost.budget_exceededsignal per budget per month into the incident pipeline; it never throttles traffic. - Cost config:
PROBECTL_COST_ENABLED,PROBECTL_COST_ZONES,PROBECTL_COST_SERVICES,PROBECTL_COST_BUDGETS,PROBECTL_COST_PRICES_FILE(JSON override of the built-in public list rates),PROBECTL_COST_PRICED(falsefor volume-only mode). - SLO endpoints:
GET /v1/slos(tenant-scoped statuses),GET /v1/slos/openslo(the loaded definitions as an OpenSLO YAML stream). Config:PROBECTL_SLO_ENABLED,PROBECTL_SLO_DIR. Burn-rate crossings raise anslo.burn_ratesignal into the incident pipeline. - What-if tie-in: the read-only topology what-if simulation reports
impacted_slos— the SLOs whose service or target sits inside a simulated failure's blast radius — so "what breaks if this link dies?" answers in SLO terms. - Carbon endpoint:
GET /v1/carbon(needs the metrics-read permission). Config:PROBECTL_CARBON_ENABLED,PROBECTL_CARBON_GRID_GCO2E(accepted range 1–5000). - Deep dashboarding for cost is federated to Grafana over the same flow series probectl already exposes — there is no separate dashboard surface to maintain. See the ecosystem-integrations material.
- Related capabilities (separate pages): Alerting and incidents (where these signals land); Topology & change (the what-if simulation that reports impacted SLOs); Flow analytics (the byte stream cost and carbon both read).
Covers: F41, F42, F47, F48