Voice / RTP tests
What it is
The voice canary answers one question: if you placed a phone call across
this path right now, how good would it sound? It does that not by guessing from
ping times, but by sending real RTP packets — the same packet format and
timing a softphone uses — to a target that echoes them back, then scoring the
echoes the way a phone's own quality meter would.
From those echoes it computes three numbers operators recognize:
- MOS (Mean Opinion Score, 1.0–4.5) — the headline "call quality" score, derived from the ITU-T G.107 E-model.
- Jitter — how unevenly packets arrived (RFC 3550 interarrival estimator).
- Packet loss — how many of the RTP packets never came back.
It is a standard canary like ICMP or HTTP: you create a test, it runs on a
schedule, and the result flows through the normal pipeline (the
agent-to-first-result journey is getting-started.md).
The implementation lives in internal/canary/voice.go.
Creating a test
POST /v1/tests
{"name": "voip to pbx", "type": "voice", "target": "pbx.acme.example:5004",
"interval_seconds": 60, "timeout_seconds": 3,
"params": {"codec": "g711", "duration_seconds": "3", "dscp": "46"}}
| Param | Default | Meaning |
|---|---|---|
codec |
g711 |
codec preset: g711 (PCMU, 50 packets/s, 160 B payload) or g729 (50 packets/s, 20 B payload) |
duration_seconds |
3 |
simulated call length (1–10 s; the codec sends 50 packets/s) |
dscp |
46 |
DSCP marking (EF = 46 by convention for voice) |
The target must echo the UDP datagrams back — that can be a probectl agent acting as a responder, or any plain UDP echo service. A target that does not echo shows up as 100% loss and an explicit "voice path unmeasurable" failure. It never invents a score from a silent target.
How it works — sending a call and listening to the echo
The canary opens a UDP connection and, for the configured duration, streams RTP-framed packets at the codec's cadence:
- Frame and send. Each packet carries a real RTP header (version 2, the codec's payload type, a per-packet sequence number, and an 8 kHz timestamp) followed by a payload of the codec's size. For G.711 that's a 12-byte header + 160-byte payload every 20 ms — exactly 50 packets per second.
- Match the echoes. A reader goroutine receives the reflected packets and matches each one to the packet it sent, keyed by the stream's SSRC and the RTP sequence number. The send time and the receive time of every matched packet are recorded.
- Score from the round-trips. From those send/receive pairs the canary
derives per-packet RTT, loss, and jitter, then runs the E-model to turn delay
- loss + codec into an R-factor and finally a MOS.
The DSCP marking is applied to the outgoing socket so the packets traverse the network in the same QoS class real voice would — testing the path voice actually takes, not a best-effort one.
The model (and its honesty)
A MOS number looks authoritative, and the biggest risk with voice scoring is
that people over-trust it. So probectl uses a deliberately simplified,
transport-only E-model and says so on every result
(voice.model = itu-t-g107-e-model-simplified). Here's the actual math the code
runs, and what each piece assumes:
- The rating
R = 93.2 − Id − Ie,eff. The93.2is the default-parameter R0 (no advantage factor, no simultaneous-impairment terms — the conservative baseline). - Delay impairment
Id = 0.024·d + 0.11·(d − 177.3)·H(d − 177.3)— the G.107 delay curve, with its characteristic 177.3 ms "knee" past which delay hurts much more. The one-way delaydis estimated (you can't measure one-way delay from an echo) asRTT/2 + codec delay + a modeled jitter buffer— and the result discloses that estimate (voice.one_way_estimate). The jitter buffer is modeled as twice the measured jitter, floored at 40 ms and capped at 120 ms, matching what a real endpoint would add. - Loss impairment
Ie,eff = Ie + (95 − Ie)·Ppl/(Ppl + Bpl)— using the G.113 codec parameters (g711: Ie 0, Bpl 25.1;g729: Ie 11, Bpl 19) under a random-loss assumption. - MOS is mapped from R via the G.107 Annex-B cubic, clamped to a 1.0 floor and a 4.5 ceiling.
- Jitter is the RFC 3550 §6.4.1 interarrival estimator over the received
packets (
J += (|D| − J)/16), reported asvoice.jitter.ms.
Narrowband codecs only. That 93.2 baseline is the narrowband model.
Wideband / HD-voice codecs (G.722, Opus) need the wideband E-model variant, which
is a different calculation — so it's marked out of scope and documented rather
than approximated with the wrong formula.
Result fields (the contract)
Metrics: voice.mos, voice.r_factor, voice.jitter.ms, voice.one_way.ms,
voice.loss.pct, plus the standard latency family
(rtt.min/avg/max/stddev.ms, loss.ratio, packets.sent/received).
Attributes: voice.codec, voice.model, voice.method
(rtp-udp-echo-reflection), voice.jitter_buffer_ms, voice.one_way_estimate.
Results render in the result-detail view (reached from the Targets & Tests pages) with MOS up front (≥4.0 good / ≥3.6 fair / below = poor) and the model named alongside it — so a computed score is never presented as a measured listening test.
What it deliberately does not do
- No SIP / call control. There's no signaling — this measures RTP transport quality only. It does not register, dial, or tear down a call.
- No wideband codecs (see the narrowband note above).
- No reflector infrastructure to own. Agent-to-agent reflection reuses the existing a2a responder rather than standing up a separate echo fleet.