Introduction
EdgeVox is an offline voice agent framework for robots — agents, skills, workflows, and a streaming voice pipeline (sub-second target on consumer GPUs), all running locally on CPU / CUDA / Metal with no cloud dependencies.

What is EdgeVox?
EdgeVox combines two things:
- An agent framework —
@tooland@skilldecorators,LLMAgentwith handoffs, behavior-tree workflows (Sequence,Fallback,Loop,Parallel,Router,Supervisor,Orchestrator,Retry,Timeout), cancellable skills withGoalHandle, and aSafetyMonitorthat preempts before the LLM is consulted. - A streaming voice pipeline — Mic → VAD → STT → LLM → TTS → Speaker. Designed for sub-second first-audio on consumer GPUs (target — measured perf is published in
benchmarks/once a run lands). The pipeline is the substrate that agents run on top of.
Agent code is sim-agnostic: the same Python works on ToyWorld (stdlib), IrSimEnvironment (2D navigation), MujocoArmEnvironment (3D pick-and-place), MujocoHumanoidEnvironment (Unitree G1 / H1), and ExternalROS2Environment (any Gazebo / Isaac / real robot over ROS2).
Key design principles
- Voice is the interface — sub-second streaming pipeline (design target on consumer GPUs), Jetson and CPU fallback paths
- Agents are the program model — write
@tooland@skillfunctions; compose with workflows; delegate across agents with handoffs - Robots are the target — cancellable skills, safety monitor, three simulation tiers, ROS2 bridge
- Everything is offline — no cloud APIs, no telemetry, no vendor lock-in
Simulation tiers
| Tier | Sim | Dependencies | Role | Status |
|---|---|---|---|---|
| 0 | ToyWorld | stdlib only | unit tests, trivial examples | shipped |
| 1 | IrSimEnvironment | pip install ir-sim | 2D visual demo (matplotlib, diff-drive, LiDAR) | shipped |
| 2a | MujocoArmEnvironment | pip install mujoco | 3D physics, Franka pick-and-place | shipped |
| 2b | MujocoHumanoidEnvironment | pip install mujoco | Unitree G1 / H1 from Menagerie, procedural gait + ONNX policy slot | shipped |
| 3 | ExternalROS2Environment | sourced ROS2 workspace | drive Gazebo / Isaac / real robots over standard topics | shipped |


Voice pipeline components
| Component | Default model | Purpose |
|---|---|---|
| VAD | Silero VAD v6 | Voice activity detection (32 ms chunks) |
| STT | Faster-Whisper | Speech-to-text (auto-sized by VRAM) |
| LLM | Gemma 4 E2B IT Q4_K_M | Chat via llama-cpp-python |
| TTS | Kokoro-82M | Text-to-speech (16 languages, 56 voices) |
Shipping a desktop app
EdgeVox is not just a library — RookApp is a reference PySide6 desktop application built on the same LLMAgent you use for robots. One Python process hosts the Qt UI, llama-cpp, and a Stockfish subprocess. No browser, no web server, no Node toolchain, no Tauri.

Next steps
Getting started
- Quick start — install and run in 5 minutes
- Architecture — deep dive into the streaming pipeline
- Component Design — per-module design with mermaid diagrams
Features
- Languages — all 16 supported languages and backends
- Voice pipeline — agent path vs legacy streaming path
- Agents & tools — full agent framework reference
- TUI commands — slash commands, voice switching, debugging
- ROS2 integration — topics, services, action servers
- Robotics examples — IR-SIM mobile robot, MuJoCo arm + humanoid, ROS2 bridge, stdlib scout
- RookApp (desktop) — shipping EdgeVox as a native PySide6 app
Harness architecture
- Agent loop — six fire-points, parallel dispatch, handoff short-circuit
- Hooks — authoring contract, built-ins, ordering rules
- Memory —
MemoryStore,SessionStore,NotesFile,Compactor - Multi-agent —
Blackboard,BackgroundAgent,AgentPool - Interrupt & barge-in — cancel-token plumbing
- Tool calling — parser chain + GBNF grammar roadmap
Reports
- SLM tool-calling benchmark — 18 GGUF presets measured on RTX 3090, parser chain + SLM harness results