NDR-lite detection engine
What this is
The threat-intel layer (see threat-intel.md) catches
known-bad things by name. This engine catches suspicious-looking behavior — patterns that look
like an attack even when no name is on a list yet. That is the "NDR" idea:
Network Detection and Response, the behavioral half of threat detection.
probectl's version is NDR-lite: it runs behavioral detectors over telemetry the platform already collects — DNS lookups (from synthetic DNS canaries and eBPF L7), flow records (device flows and eBPF flows), TLS posture, threat-intel, and the topology service map — and turns suspicious patterns into confidence-scored signals that flow into incidents, the triage surface, and the SIEM.
It lives in internal/threat/ — the engine in ndr.go, the rule model in
rules.go, the default ruleset in rules/ndr-default.yaml.
The one rule that shapes everything: signals, never blocks
Like threat-intel, this engine is never an IPS — detection is a signal,
never an enforcement action, one of probectl's
non-negotiables. It
does not block traffic, terminate connections, or act inline. The engine emits
incident.Signal values and nothing else — there is literally no enforcement
code path in the package. A behavioral detector firing on a public network will
sometimes be wrong, and a tool that auto-blocks on "looks weird" is a tool that
takes down production. So probectl flags; a human decides.
Status: on by default
The engine ships on (PROBECTL_NDR_ENABLED defaults to true; set it to
false to turn it off). Unlike threat-intel, it is safe to run by default
because it makes no outbound calls — it only processes telemetry probectl
already gathered locally (sovereignty-safe). Threat-intel enrichment, which
does reach out to feeds, stays a separate opt-in.
The detectors
Each detector watches for one shape of bad behavior. All thresholds are tunable per the rule model (no code change required).
| Kind | Behavior detected | Key tunables |
|---|---|---|
dns_dga |
many distinct high-entropy ("algorithmically generated") lookups from one source | min_names, entropy, ratio, window_s |
dns_exfil |
high unique-subdomain volume under one domain (payload smuggled in query names) | min_queries, qname_bytes, unique_ratio, window_s |
beaconing |
metronome-regular callbacks to one dst:port (a C2 heartbeat); confidence rises with regularity and with the destination's threat-intel reputation | min_samples, max_jitter, min_interval_s, max_interval_s |
egress_volume |
egress bytes far above the source's own moving-average baseline | min_samples, spike_factor, min_bytes |
egress_intel |
egress to Tor exits / IOC-listed hosts / configured bad ASNs | min_confidence, lists.bad_asns |
lateral |
east-west fan-out on service ports; topology-known service relationships are excluded | fanout, window_s, lists.ports |
A few of these reward a closer look at the mechanism:
beaconingmeasures jitter — the coefficient of variation (stddev ÷ mean) of the gaps between callbacks (intervalStatsinndr.go). Malware phoning home tends to be suspiciously regular; a human browsing is not. Low jitter plus a known-bad destination is a strong signal.dns_dgascores the Shannon entropy of the first DNS label — random strings likex7f3q9zk.examplecarry more bits-per-character than real words, which is the fingerprint of domain-generation-algorithm malware.egress_volumecompares a flow against the source's own EWMA (exponentially-weighted moving average) baseline, and crucially updates that baseline after judging, so a spike can never quietly raise its own bar.lateralexcludes destinations that the topology service map already knows are normal neighbors, so expected service-to-service traffic does not read as an attacker spreading sideways.
False-positive management is the product
The hard part of NDR is not detecting things — it is detecting them without drowning the analyst. Every layer here is tunable without touching code:
- Cold-start guards — a detector will not judge until it has a minimum
number of samples/names; an empty baseline never fires. (In
observeDGA, for instance, a window with fewer thanmin_namesdistinct names is skipped outright.) - Confidence scoring — each detection carries
detector.confidence(the rule's base, plus evidence strength, plus a threat-intel boost) and the evidence itself (beacon.jitter,dns.generated_ratio,egress.ratio, …) so the analyst sees why it fired. - Suppression — a per-
(rule, tenant, entity)re-fire window (suppress): once an entity trips a rule, it stays quiet until the window passes, so a persisting behavior re-raises occasionally, not on every single observation. - Detection-as-code — rules are Sigma-style versioned YAML. Override any
rule by ID (including
enabled: falseto switch one off) or add new ones.
Detection-as-code
The embedded default ruleset lives at internal/threat/rules/ndr-default.yaml
and ships with the binary (the engine works out of the box). Operators overlay
it by pointing PROBECTL_NDR_RULES_DIR at a directory of *.yaml/*.yml
files. A rule in your directory with the same id as a default
replaces that default; a new id adds a detector:
# /etc/probectl/ndr/tuning.yaml
rules:
- id: ndr-beaconing-default # same id → REPLACES the default
version: 2 # bump on every change
kind: beaconing
name: Periodic beaconing (tuned)
severity: warning
base_confidence: 45
suppress: 4h
thresholds: { min_samples: 12, max_jitter: 0.08, min_interval_s: 10, max_interval_s: 3600 }
- id: ndr-egress-intel-default
version: 2
kind: egress_intel
name: Egress to hostile infrastructure
severity: critical
base_confidence: 60
suppress: 30m
lists: { bad_asns: ["64496"] }
A malformed rules directory fails startup (fail closed — tuning the
operator believes is live must actually be live, not silently dropped). Rules
are validated in validateRules: non-empty unique IDs, a known kind,
version >= 1, severity in {info, warning, critical}, and
base_confidence in 0..100. Two rules sharing one ID in a single file is
treated as an operator mistake and rejected, not guessed at.
Pipeline
%%{init: {'theme':'base','themeVariables':{'background':'#0d1117','primaryColor':'#161b22','primaryTextColor':'#e6edf3','primaryBorderColor':'#3b82f6','lineColor':'#8b949e','secondaryColor':'#21262d','tertiaryColor':'#0d1117','clusterBkg':'#161b22','clusterBorder':'#30363d','fontFamily':'ui-monospace, SFMono-Regular, Menlo, monospace'},'flowchart':{'curve':'basis','nodeSpacing':55,'rankSpacing':55,'padding':12}}}%%
flowchart LR
DNS[DNS canaries] --> C[NDRConsumer]
L7[eBPF L7 DNS] --> C
EB[eBPF flows] --> C
FL[device flows] --> C
C --> E[threat.Engine<br/>tenant-scoped state · rules]
TI[IOC store] -.reputation.-> E
TOPO[service map] -.expected edges.-> E
E -->|"incident.Signal (ndr.*)"| INC[incidents]
E --> TRI[triage /security]
E --> SIEM[SIEM export]
Every observation and every piece of detector state is partitioned by
tenant (tenantState in ndr.go); a record without a tenant_id is dropped
at the boundary, so one tenant's entities can never influence another's
detections (tenant isolation is the outermost boundary — see
security/tenant-isolation.md). State is
bounded — per-tenant entity maps
cap out and evict the stalest entry, and windows are fixed-size rings — so a
cardinality flood cannot exhaust memory. The whole engine is rebuildable from
the stream; nothing here is the durable record.
The threat-intel store and the topology map are optional context: a nil
intel source or topology source simply disables the respective boost/exclusion
(graceful degradation), it never errors.
Configuration
| Variable | Default | Purpose |
|---|---|---|
PROBECTL_NDR_ENABLED |
true |
the engine + consumers (local-only processing) |
PROBECTL_NDR_RULES_DIR |
(none) | detection-as-code overlay directory (*.yaml/*.yml) |
Detections surface as ndr.<kind> threat-plane signals in three places: the
Security triage view (/security) with rule/confidence/evidence
provenance, the incident timeline, and the SIEM export. They share the
detection store with threat-intel matches, because both are recognized by one
signal recognizer (see threat-intel.md).