Topology graph
What this is
A network has a shape: agents reach targets through a chain of router hops,
services call other services, autonomous systems originate prefixes, devices
carry interfaces. internal/topology is probectl's live model of that shape — a
tenant-scoped, versioned (time-travelling) graph that stitches together
the signals the other planes already produce:
- path discoveries (traceroute) → agent → hop → hop → host adjacency;
- the eBPF service map → service → service call edges;
- BGP routing events → autonomous-system → prefix origin edges;
- device telemetry → device nodes (and device → hop links where the telemetry exposes interface IPs).
It is the substrate two things sit on top of: the AI semantic-query / root-cause layer traverses it to explain why an incident happened, and the Topology page in the UI renders it. Because it is one shared graph, a single failure can be reasoned about across planes — "this hop went down, which broke these paths, which back these services, which back these SLOs."
Model
A node and an edge each have a kind and a stable id, so the same router or service observed by two different planes folds into one vertex.
- Nodes (
NodeKind):agent,hop(a traceroute responder / L3 hop),host(a path target),service(an eBPF workload),prefix(a BGP prefix),as(an autonomous system),device(a managed network device). Ids are derived from the identity, so they're stable across observations:agent:<id>,hop:<ip>,host:<ip>,service:<workload>,prefix:<cidr>,as:<asn>,device:<address>. - Edges (
EdgeKind):path(hop → hop adjacency),flow(service → service),routing(as → prefix),device(device → the hop it carries, via an interface IP). An edge's canonical id isfrom|kind|to. Edge attributes follow OTel conventions where they exist — aflowedge carriesdestination.port,network.transport, andnetwork.protocol.name.
Versioning / temporal (designed in, not bolted on)
Every node and edge carries a validity interval [FirstSeen, LastSeen].
Re-observing an element extends its interval and merges its attributes, so the
graph remembers when each thing existed — which is exactly what root-cause
analysis needs (the graph as it was at the incident moment, not as it is now).
SnapshotAt(tenant, t)returns the graph as it was at timet.Latest(tenant)returns the full current graph.
%%{init: {'theme':'base','themeVariables':{'background':'#0d1117','primaryColor':'#161b22','primaryTextColor':'#e6edf3','primaryBorderColor':'#3b82f6','lineColor':'#8b949e','secondaryColor':'#21262d','tertiaryColor':'#0d1117','clusterBkg':'#161b22','clusterBorder':'#30363d','fontFamily':'ui-monospace, SFMono-Regular, Menlo, monospace'},'flowchart':{'curve':'basis','nodeSpacing':55,'rankSpacing':55,'padding':12}}}%%
flowchart LR
P["path plane"] --> B
F["eBPF service map"] --> B
R["BGP routing"] --> B
D["device telemetry"] --> B
B["Observe{Path,ServiceEdge,Routing,Device}"] --> G["temporal graph<br/>nodes + edges with [first,last] seen"]
G --> Q["query API: Snapshot/Latest · Neighbors · Traverse"]
Q --> AI["AI semantic-query / RCA layer<br/>tenant-then-RBAC"]
Q --> UI["Topology view + what-if"]
Query API (the contract everything else consumes)
The Store interface (internal/topology/store.go) is tenant-scoped: every
method takes a tenant and never returns another tenant's graph — tenant
isolation is probectl's outermost boundary (see
security/tenant-isolation.md).
SnapshotAt(tenant, t)/Latest(tenant)— the graph, or its state att.Neighbors(tenant, nodeID, t)— a node's adjacency att.Traverse(tenant, from, to, t)— the shortest directed path between two nodes (the traversal RCA walks).Observe{Path,ServiceEdge,Routing,Device}(tenant, …, at)— fold one plane's telemetry into the graph.
MemoryStore is the simple in-memory implementation. The From{Path, ServiceEdge,BGPEvent} adapters (internal/topology/adapters.go) map the real
signal types into the builder inputs, so a bus consumer can feed the graph
straight from live telemetry. The indexed engine below and any external
graph-database adapter implement this same interface — callers never change.
All planes feed the live graph
TopologyConsumer (internal/control/topologyapi.go) folds the streams the
control plane already receives into the graph: eBPF service edges
(probectl.ebpf.flows), BGP routing events (probectl.bgp.events), and device
telemetry (probectl.device.metrics). Path discoveries fold in at save time.
Every batch's claimed tenant is verified against the agent registry; unscoped or
unverifiable records are dropped (guardrail 1).
Device → hop linkage depends on the telemetry exposing interface IPs. When it
does (ObserveDevice with InterfaceIPs), the device node links to the hops it
carries. When it does not — which is the case for the SNMP/gNMI device telemetry
today — the device node still exists, but without links, and that gap is
reported as a coverage note (below), never silently treated as "complete."
What-if / impact simulation
Simulate answers: "if node or link X fails, what breaks?" It runs on a
snapshot of the versioned graph (a zero time = the live graph), removes the
failed element, and recomputes reachability. Fail any node (hop:…, service:…,
as:…, prefix:…, device:…, agent:…) or any edge (from|kind|to) and you
get:
- broken agent→target paths — routes with no surviving alternative;
- rerouted paths — with the surviving route returned alongside the original;
- impacted services — the transitive callers of a failed service/host
(reverse reachability over
flowedges); - impacted prefixes — prefixes a failed AS originated (a failed prefix is its own impact);
- disconnected nodes — reachable from some agent before, from none after;
- SLO impact — via the
SLOSourceseam (when the SLO engine is wired; absent = an explicit coverage note, never a silent empty).
Two honesty rules matter. First, an unknown target is an error, never an
empty "no impact" — a typo in a simulation must not look like a clean result.
Second, simulation accuracy depends on graph completeness, so every result
carries a coverage block — per-plane edge counts plus notes for missing
planes ("no flow-plane (eBPF) edges — service impact may be incomplete"). The
simulation is strictly read-only: it runs on a copy and never mutates the
graph. Acting on a prediction is a separate, human-gated capability (see
remediation.md) — probectl predicts, a human decides.
Visualization
ToViz(snapshot) projects a snapshot to a layout-agnostic Viz shape — just
nodes (id, kind, label) and edges (from, to, kind, label). The UI
computes positions client-side, so the server stays layout-agnostic.
Engine: in-memory vs indexed
IndexedStore implements the same Store contract as MemoryStore, but backs
it with forward/reverse adjacency indexes, so Neighbors and Traverse are
proportional to a node's degree instead of the whole edge set — the behaviour
large graphs need. The engine is selected by PROBECTL_TOPOLOGY_ENGINE
(indexed, the default | memory); the switch is transparent behind the query
API. A scale test exercises both correctness and interactivity at roughly 30k
nodes. An external graph-database adapter can implement the same interface when
a deployment outgrows a single process.
API + surface
GET /v1/topology[?at=RFC3339]— the caller's tenant graph (live, or as it was at?at=), in the layout-agnostic node/edge shape plus the coverage block. Permission:topology.read. (When topology isn't wired on a deployment, the endpoint returnstopology_running: falsewith empty node/edge lists.)POST /v1/topology/whatif {target, at?}— the simulated impact for failing one node or edge. Permission:topology.read. An unknown target is a404.- The Topology page renders the layered graph (columns by kind, capped for legibility on dense graphs with an honest "showing N of M"), node drill-down, time travel, and the what-if overlay (failed element dashed, impacted elements highlighted, broken/rerouted lists with their routes).
Out of scope (by design)
Acting on what-if predictions (that's a separate, human-gated remediation capability); dependency mapping beyond the signals the planes actually emit (probectl links what it observes, and reports the gaps where it can't). The AI/RBAC-aware query layer sits on top of this graph, enforcing tenant first, then RBAC — the topology store itself is the tenant-scoped foundation.