Interrupts & barge-in
Voice-agent UX lives or dies on how fast the pipeline shuts up when the user starts talking. This page documents how EdgeVox's InterruptController coordinates barge-in across TTS, the LLM backend, and the agent loop — and the hard latency budget it enforces.
Latency budget
| Stage | Target | Where it's enforced |
|---|---|---|
VAD → InterruptController.trigger() | <20 ms | audio worker, pure-Python RMS / ONNX VAD |
| TTS flush | <100 ms | TTS worker observes interrupted.is_set() |
| LLM generation stops | ≤40 ms after trigger | cancel_token piped into llama-cpp stopping_criteria |
| Skill cancel (opt-in) | <200 ms | poll loop inside _dispatch_skill |
The LLM number is the one that matters — without the cancel_token threaded into stopping_criteria, a barge-in during a long reply would leave the LLM grinding through max_tokens for seconds.
Two signals
InterruptController exposes two threading.Event channels:
interrupted— general "stop what you're doing". TTS, the agent loop between hops, and skill dispatch poll or wait on this.cancel_token— dedicated channel fed intollama_cpp.Llama'sstopping_criteriaviaLLM.complete(stop_event=…). Only set whenInterruptPolicy.cancel_llm=True. Gives us enforceable mid-generation cancellation (one decode step latency).
reset() clears both events and drops latest so a stale interrupt can't leak into the next turn. history is retained but ring-buffered to 500 entries to cap slow-leak in long voice sessions.
Wiring
InterruptPolicy
Tunable thresholds. Defaults reflect typical robot voice UX:
@dataclass
class InterruptPolicy:
min_duration_ms: int = 250 # sustained speech energy before trigger
energy_threshold: float = 0.012 # normalized float32 RMS (-38 dBFS)
cancel_llm: bool = True # set cancel_token on trigger
cancel_skills: bool = False # preserve mid-grasp skills through brief "um"s
cut_tts_immediately: bool = True # drop in-flight TTS sentence
# Echo-aware (used by EnergyBargeInWatcher when no AEC is in front):
echo_suppression_ratio: float = 2.0 # mic must be N x louder than ref
echo_floor_window_ms: int = 200 # prefix window for floor calibration
tts_release_ms: int = 200 # refractory after TTS stopscancel_skills=False is deliberate: interrupting a Panda mid-grasp because the user said "uh" is worse than letting the grasp finish. Opt in only when the skill surface is short (<200 ms).
Producer side
A VAD or GUI-button worker calls trigger():
ic = InterruptController()
# ... attach to the agent context:
ctx = AgentContext(interrupt=ic)
# mic worker
for frame in mic_stream():
if vad.is_speech(frame) and tts.is_playing():
ic.trigger(reason="user_barge_in", rms=rms)trigger() is idempotent: repeat calls while already interrupted still append to history but reuse the event flag. Subscribers (log workers, analytics) are notified synchronously — keep them fast.
Consumer side
The TTS worker waits on the event and flushes:
while not ic.interrupted.is_set() and pending:
play(pending.pop(0))
if ic.interrupted.is_set():
stop_stream() # drop buffered audioThe agent loop (LLMAgent._drive) calls ctx.should_stop() between hops (both ctx.stop and ctx.interrupt.should_stop()), and threads ctx.interrupt.cancel_token into every llm.complete:
cancel_token = None
if ctx.interrupt is not None and ctx.interrupt.policy.cancel_llm:
cancel_token = ctx.interrupt.cancel_token
result = llm.complete(messages, tools=..., stop_event=cancel_token)Skill dispatch polls ctx.should_stop() every 50 ms and calls handle.cancel() on hit.
Defaults at a glance
EdgeVox enables echo cancellation by default so barge-in works out of the box on typical USB-mic + laptop-speaker setups. The chain:
AEC = specsub(frequency-domain spectral subtraction, pure numpy, no extra deps). Set by bothedgevox-cli --aec ...and the TUI. Pass--aec noneto opt out.- Energy-ratio gate in
AudioRecorder._process_loop. Even after AEC, the mic must clearly dominate the speaker reference (mic_rms ≥ 3 x player.last_output_rms) for VAD to be trusted. This is the defense against "AEC residual fools VAD" — the most common failure mode without it. - VAD on cleaned audio (Silero, run on the AEC-cleaned chunk).
- Sustained-speech window (
INTERRUPT_SPEECH_FRAMES = 8, ~256 ms) to suppress one-off noise (door slam, cough). - Echo cooldown (
ECHO_COOLDOWN_SECS = 1.5) after TTS stops, so the mic isn't trusted while reverb / AEC tail dies down.
When the speaker is effectively silent (player.last_output_rms < 0.005) the energy-ratio gate is bypassed so quiet user speech still triggers — sensitivity isn't traded against the anti-self-trigger work.
If you write your own pipeline and don't want the recorder, the standalone EnergyBargeInWatcher adds the same protections: pass tts_energy_provider=lambda: player.last_output_rms to give it the live reference signal.
Tuning checklist
If barge-in is still self-triggering with the defaults:
- Confirm AEC is actually active —
edgevox-cli --aec specsub(or--aec dtlnfor a stronger but heavier model). - Lower the speaker volume by 6–10 dB; mic input gain often clips on cheap hardware.
- Raise
INTERRUPT_RMS_RATIOfrom 3.0 to 3.5–4.0 inedgevox/audio/_original.py(constants block at the top).
If real user speech is not triggering:
- Lower
InterruptPolicy.energy_threshold(default0.012) — try0.008for very quiet rooms. - Reduce
min_duration_ms(default250) to200if users speak in short bursts. - Lower
INTERRUPT_MIN_RMS(default0.01) if your mic gain is unusually low.
Repeatable interrupts
Back-to-back barge-ins must re-arm cleanly without depending on the consumer (TUI / VoiceBot) calling force_resume. The recorder owns the post-interrupt re-arm itself:
resume_after_interrupt(delay=0.15)— fired automatically by_process_loopthe moment_on_interruptreturns. After 150 ms (long enough for PortAudio's output ring + room reverb to die down) it sets_suppressed = Falseand_interrupt_detect = False, freeing the recorder to flush the captured user speech into the next STT pass.- Critical difference vs
force_resume—resume_after_interruptdoes not drain the audio queue. After a barge-in the user is typically still talking; draining would lose those samples and force them to re-speak.force_resume(used after a normal turn finishes) still drains because the queue holds nothing but echo at that point. - Generation-counter invalidation — every state-clear path (
play()'s 1.5 sresume_after_cooldown,force_resume,resume_after_interrupt) bumps_suppress_genand checks it before applying. Whichever fires first wins; the others no-op cleanly.
This is what stopped the "interrupt only works once" failure mode where the recorder got stuck in _suppressed=True between Turn 1's interrupt and Turn 2's expected barge-in.
VAD backends
Four watchers ship today, all implementing the BargeInVADWatcher Protocol:
| Backend | Class | Install | Latency | Notes |
|---|---|---|---|---|
| Energy | EnergyBargeInWatcher | built-in | <1 ms | pure RMS threshold; no deps. 5-15% false triggers in noisy rooms |
| WebRTC | WebRTCVADWatcher | edgevox[voice-vad] (BSD) | <1 ms | Google's GMM baseline; much better than RMS |
| Silero v6 | SileroVADWatcher | no extra install | ~1 ms | reuses the ONNX already bundled with faster-whisper; same model the production AudioRecorder uses, so no new download |
| TEN | TENVADWatcher | onnxruntime (core dep) | <1 ms | 306 KB Apache-2 model from Tencent; fetched from nrl-ai/edgevox-models with fallback to TEN-framework/ten-vad upstream |
Pick one directly, or use the factory:
from edgevox.agents import create_vad_watcher
watcher = create_vad_watcher(
"silero", # or "energy" / "webrtc" / "ten"
controller,
is_tts_playing=player.is_playing,
)
threading.Thread(target=watcher.run, args=(mic_stream,), daemon=True).start()All four share the echo-defence scaffolding — sustained-speech window (default 120 ms), TTS-release refractory (default 180 ms) — so switching backends doesn't change barge-in behaviour, only the speech/non-speech classifier. BargeInVADWatcher is a runtime_checkable Protocol; your own watcher just needs a run(frames) method and a stop() flag.
Subscribing to events
InterruptController.subscribe(handler) lets ad-hoc observers log or react. Handlers run on the triggering thread (synchronous) — keep them tiny. Returns an unsubscribe callable:
unsub = ic.subscribe(lambda ev: metrics.inc("interrupts", reason=ev.reason))
...
unsub()Handler exceptions are logged but don't propagate.
See also
agent-loop— where the loop checksctx.should_stop().pipeline— how the audio stack plugs in.