Skip to content

Introduction

EdgeVox is an offline voice agent framework for robots — agents, skills, workflows, and a streaming voice pipeline (sub-second target on consumer GPUs), all running locally on CPU / CUDA / Metal with no cloud dependencies.

EdgeVox TUI Screenshot

What is EdgeVox?

EdgeVox combines two things:

  1. An agent framework@tool and @skill decorators, LLMAgent with handoffs, behavior-tree workflows (Sequence, Fallback, Loop, Parallel, Router, Supervisor, Orchestrator, Retry, Timeout), cancellable skills with GoalHandle, and a SafetyMonitor that preempts before the LLM is consulted.
  2. A streaming voice pipeline — Mic → VAD → STT → LLM → TTS → Speaker. Designed for sub-second first-audio on consumer GPUs (target — measured perf is published in benchmarks/ once a run lands). The pipeline is the substrate that agents run on top of.

Agent code is sim-agnostic: the same Python works on ToyWorld (stdlib), IrSimEnvironment (2D navigation), MujocoArmEnvironment (3D pick-and-place), MujocoHumanoidEnvironment (Unitree G1 / H1), and ExternalROS2Environment (any Gazebo / Isaac / real robot over ROS2).

Key design principles

  • Voice is the interface — sub-second streaming pipeline (design target on consumer GPUs), Jetson and CPU fallback paths
  • Agents are the program model — write @tool and @skill functions; compose with workflows; delegate across agents with handoffs
  • Robots are the target — cancellable skills, safety monitor, three simulation tiers, ROS2 bridge
  • Everything is offline — no cloud APIs, no telemetry, no vendor lock-in

Simulation tiers

TierSimDependenciesRoleStatus
0ToyWorldstdlib onlyunit tests, trivial examplesshipped
1IrSimEnvironmentpip install ir-sim2D visual demo (matplotlib, diff-drive, LiDAR)shipped
2aMujocoArmEnvironmentpip install mujoco3D physics, Franka pick-and-placeshipped
2bMujocoHumanoidEnvironmentpip install mujocoUnitree G1 / H1 from Menagerie, procedural gait + ONNX policy slotshipped
3ExternalROS2Environmentsourced ROS2 workspacedrive Gazebo / Isaac / real robots over standard topicsshipped

MuJoCo Franka pick-and-place

Unitree G1 humanoid

Voice pipeline components

ComponentDefault modelPurpose
VADSilero VAD v6Voice activity detection (32 ms chunks)
STTFaster-WhisperSpeech-to-text (auto-sized by VRAM)
LLMGemma 4 E2B IT Q4_K_MChat via llama-cpp-python
TTSKokoro-82MText-to-speech (16 languages, 56 voices)

Shipping a desktop app

EdgeVox is not just a library — RookApp is a reference PySide6 desktop application built on the same LLMAgent you use for robots. One Python process hosts the Qt UI, llama-cpp, and a Stockfish subprocess. No browser, no web server, no Node toolchain, no Tauri.

RookApp — PySide6 desktop chess robot

Next steps

Getting started

Features

Harness architecture

  • Agent loop — six fire-points, parallel dispatch, handoff short-circuit
  • Hooks — authoring contract, built-ins, ordering rules
  • MemoryMemoryStore, SessionStore, NotesFile, Compactor
  • Multi-agentBlackboard, BackgroundAgent, AgentPool
  • Interrupt & barge-in — cancel-token plumbing
  • Tool calling — parser chain + GBNF grammar roadmap

Reports

Offline voice agent framework for robots