Component Design
A top-down map of every package inside edgevox/. One section per module: what it's for, how the pieces fit, and where you plug in your own implementation.
If you want the dataflow-level view instead, see Architecture and Voice Pipeline. This page is the structural view — classes, protocols, and swap points.
Package map at a glance
Everything above a horizontal line calls into the line below it. Components at the same level don't depend on each other — that's what keeps them swappable.
edgevox/audio/ — capture, playback, VAD, AEC
The job. Turn a microphone and a speaker into two tidy streams of int16 samples at 16 kHz, with voice-activity detection on the way in and echo cancellation on the way out. Everything else in the pipeline assumes this layer just works.
Key pieces.
AudioRecorder— listens on a mic, runs Silero VAD on 32 ms frames, emits a full utterance once speech ends. Also exposeson_levelfor UI meters andon_interruptso the player can duck while the user is talking.InterruptiblePlayer/play_audio— opens one persistent PortAudio output stream and a numpy ring buffer. A callback drains the buffer on every audio tick and pads silence on underrun.interrupt()flushes the buffer; it never touches the stream itself — opening/closing streams is slow enough to starve the event loop.create_aec(backend)— strategy-pattern factory for echo cancellation. Implementations inaec.py:none,nlms,specsub,dtln. The player pushes the played signal into a lock-free_RefBufferthat the recorder consumes as the AEC reference.WakeWordDetector— optional openWakeWord-style ONNX model that can gate the pipeline until a trigger phrase.
Swap point. Subclass AECBackend and register a new choice, or replace AudioRecorder wholesale — the rest of the pipeline only sees numpy arrays.
edgevox/stt/ — speech to text
The job. Turn a chunk of 16 kHz audio into a string, with the right backend for the language.
Key pieces.
BaseSTT— three-line contract:transcribe(audio, language) -> str, plus adisplay_namefor the TUI's model panel.create_stt(language, model_size=None, device=None)— consults the language config, picks a backend, and falls back to Whisper if the preferred backend fails to load. That fallback is why a user with no ONNX runtime still gets Vietnamese (slower, but working).WhisperSTT— auto-sizes the model based on VRAM and RAM:large-v3-turboon an 8 GB GPU,smallon a laptop CPU.SherpaSTT,ChunkFormerSTT— specialist Vietnamese backends, 30 M / 110 M params,int8on CPU.
Swap point. Write a class with transcribe() and either inject it directly into the agent context or add a branch to create_stt() plus a stt_backend value in core/config.py.
edgevox/tts/ — text to speech
The job. Turn a string into a numpy waveform the player can emit. Streaming-aware: long replies yield one chunk at a time so playback can start during LLM decode.
Key pieces.
BaseTTS—synthesize(text) -> np.ndarray(required) andsynthesize_stream(text)(defaults to a single-chunk yield).create_tts()— same language-driven factory pattern as STT. Thebackend=override lets you pick Piper for English or Kokoro for something Piper doesn't ship.- Each backend owns its own
sample_rate; the player resamples to the device rate viasounddevice. - Models pull from
nrl-ai/edgevox-modelson Hugging Face with upstream fallbacks, so first run downloads once and caches.
Swap point. Subclass BaseTTS and register in the factory. Streaming-aware backends should implement synthesize_stream() to avoid the "wait for full sentence" tax.
edgevox/core/ — frames, pipeline, language config
The job. The wiring layer. Defines the typed frames that flow through the pipeline, a dead-simple Processor base class, and the per-language configuration every backend factory reads.
Key pieces.
Frameand its subclasses inframes.py—AudioFrame,TranscriptionFrame,TextFrame,SentenceFrame,TTSAudioFrame,InterruptFrame,StopFrame,EndFrame,MetricsFrame. Everything in the pipeline speaks these.Processor— subclass it, overrideprocess(frame)as a generator. Patterns supported: 1:1, 1:N, N:1 (buffering), passthrough.Pipeline([...])— chains processors into one generator stream.interrupt()sets anInterruptToken, callson_interrupt()on every processor, and lets the next yield produce anInterruptFrame.StreamingPipeline+stream_sentences()inpipeline.py— the higher-level helper that splits an LLM token stream on sentence boundaries (.,!,?) so TTS can start early.config.py—LANGUAGES/LanguageConfig/get_lang(). Single source of truth for:stt_backend,tts_backend,default_voice,kokoro_langcode, test phrases. Add a new language by adding a row here; no factory changes needed.gpu.py— cheap VRAM / RAM detection used by STT/LLM autoconfig.
Swap point. Add a new Frame subtype, a new Processor, or a new row in LANGUAGES. core/ has no hard-coded references to the specific backends — it just moves typed dataclasses.
edgevox/llm/ — inference, tools, tool parsing
The job. Wrap llama-cpp-python in something the agent layer can drive, and recover structured tool calls from whatever format the current model happens to emit.
Key pieces.
LLM(llamacpp.py) — thread-safe wrapper. Core methods:chat_stream(messages, stop_event=…),count_tokens(). Thestop_eventthreadsctx.interrupt.cancel_tokenstraight into llama.cpp'sstopping_criteriaso barge-in halts generation within one decode step.Tool,ToolRegistry,@tool(tools.py) — decorator-based tool registry.load_entry_point_tools()lets third-party packages ship tools via Python entry points.ModelPreset,PRESETS,resolve_preset()(models.py) — every preset declares its chat template, stop tokens, andtool_call_parsers=(...).resolve_preset()validates every parser name against the detector registry at load time, so a typo fails loudly.tool_parsers/— a chain of detectors, one per model family. Critical detail:parse_tool_calls_from_contenttries detectors against raw content before stripping<think>blocks, because Qwen3 emits tool calls inside reasoning blocks.grammars.py— GBNF grammar builders for grammar-constrained decoding.hooks_slm.py—default_slm_hooks()bundle that hardens small models (output repair, JSON coaxing, token budgets)._agent_harness.py— internal harness used byLLMAgent. Prefer the public surface inedgevox.agents.
Swap point. Add a new model: register a ModelPreset and, if its tool-call format is novel, add a detector in tool_parsers/ and list its name in the preset.
edgevox/agents/ — the agent framework
The job. Everything above llm/ that turns "run a chat loop" into "run an agent with hooks, memory, tools, skills, workflows, interrupts, and handoffs." This is the largest package and the most-customized surface.
Key pieces.
Agent(Protocol),LLMAgent,Session,AgentContext,AgentResult,Handoff(base.py) — the polymorphic heart. Workflows are agents too;Sequence([a, b])is itself anAgent.- Hooks — the main extension surface. See Hooks for the full matrix. Hook-owned state lives under
ctx.hook_state[id(self)]. Skill/@skill/GoalHandle— cancellable async actions, distinct from@tool. Cancellation is real:ctx.stopthreads into the skill's cooperative check.- Workflows (
workflow.py) — Behaviour-Tree-flavoured combinators. Nothing clever; they just schedule child agents. - Memory (
memory.py) —MemoryStore(long-term facts),SessionStore(turn history),NotesFile(human-editable scratchpad),Compactor(LLM-driven summarization when the window fills). - Multi-agent (
multiagent.py) —Blackboardfor shared state,BackgroundAgent/AgentPoolfor parallel agents, inbox messaging. InterruptController(interrupt.py) — the barge-in coordinator. Plumbing is covered in Interrupt.ArtifactStore(artifacts.py) — file-like store for structured agent-to-agent handoffs.sim.py—SimEnvironmentprotocol andToyWorldstdlib reference env for tests.
Swap point. Add a hook, a workflow, a memory backend, a skill, or a whole new Agent implementation. Core rule: don't write magic keys into ctx.state from framework code — that field is user scratch only.
edgevox/server/ — FastAPI + WebSocket
The job. Expose the voice pipeline over a WebSocket so a browser (or any networked client) can hold a continuous conversation with the local AI.
Key pieces.
ServerCore(core.py) — holds process-wide singletons of the heavy models (one STT, one LLM, one TTS) plus a global inference lock. The lock serializes calls into llama.cpp so multiple WebSocket clients can coexist without VRAM contention.SessionState(session.py) — per-connection mirror ofAudioRecorder's VAD state machine, applied to wire-delivered audio. Also holds per-session chat history that is swapped in/out of the sharedLLMunder the lock.ws.py— WebSocket handler. Text frames carry control JSON (language switch,/say,/reset); binary frames carryint16mono PCM in and WAV chunks out.main.py— Uvicorn launcher. Withedgevox-serve --agent module:factory, it binds a user-suppliedLLMAgentfactory toServerCore.agentso every turn runs through the full harness; without it, the server uses the legacy streaming path (lower first-token latency, no hooks/tools).audio_utils.py— resampling helpers shared between audio-in and the pipeline's 16 kHz contract.
Swap point. Point --agent at your own factory to inject a custom LLMAgent with your hooks, tools, and memory. The transport layer doesn't care.
edgevox/cli/, edgevox/tui.py, edgevox/ui/ — entry points
The job. The three things a user actually runs: the Textual TUI (default), the headless CLI, and the launcher hooks in pyproject.toml.
Key pieces.
tui.py— Textual app. Layers a live waveform, model-info panel, sparkline latency history, and slash commands (/model,/voice,/lang,/reset) on top of the pipeline. See TUI Commands.cli/main.py— minimal voice bot and text bot for scripting / headless boxes.ui/— placeholder for reusable TUI widgets (currently empty; widgets live insidetui.pyfor now).- Console scripts in
pyproject.toml:edgevox→tui:main,edgevox-cli→cli.main:main,edgevox-setup→setup_models:main.
Swap point. Write your own entry point — the pipeline is just Python. Every entry point is a thin argument parser around create_stt() / create_tts() / LLMAgent.
edgevox/integrations/ — ROS2, chess, simulation
The job. Optional integrations that don't belong in the core (they'd bloat install), each self-contained.
Key pieces.
integrations/ros2_bridge.py,ros2_actions.py,ros2_robot.py,ros2_qos.py— mapsSkillgoals to ROS2 actions with consistent QoS settings. See ROS2 Integration.integrations/sim/—IrSimEnvironment(2D, matplotlib),MujocoArmEnvironment,MujocoHumanoidEnvironment,ExternalROS2Environment. Each implements theSimEnvironmentprotocol fromedgevox.agents.simso agent code is sim-agnostic.integrations/chess/— reference desktop application (see RookApp). Persona + engine plug-ins are themselvesAgentimplementations.
Swap point. Every integration is opt-in and lives behind its own dependency. Add a new one by implementing the relevant protocol (SimEnvironment, Skill, Agent) and shipping it as its own package if you like.
Where to go next
- Architecture — streaming pipeline and latency budget
- Voice Pipeline — agent path vs legacy streaming path
- Agent loop — the six fire points in
LLMAgent.run() - Hooks — extension surface reference
- Tool calling — parser chain and GBNF roadmap