RookApp — Desktop Chess Robot
Voice-controlled offline chess partner. Pure PySide6 app — no browser, no web server, no Node toolchain, no Tauri. One Python process hosts the Qt UI, the LLMAgent, llama-cpp, and the Stockfish subprocess.

Install
# 1. Install the desktop extra.
uv pip install -e '.[desktop]'
# 2. Stockfish must be on $PATH at runtime (GPL — we only speak UCI
# over a pipe, so the app stays MIT).
sudo apt-get install -y stockfish # Linux
brew install stockfish # macOS
# Windows: https://stockfishchess.org/download/ — drop on PATH.
# 3. Download EdgeVox STT + LLM + TTS weights (~3 GB, one time).
edgevox-setupThe desktop extra pulls in (licence in brackets):
| Package | Purpose |
|---|---|
PySide6>=6.6 [LGPL-3, dynamic] | UI toolkit |
qtawesome>=1.4 [MIT] | Font Awesome + Phosphor icon glyphs |
rlottie-python>=1.3 [LGPL-2.1, dynamic] | Optional — Lottie-backed robot face |
pillow>=10 [MIT/HPND] | Frame rendering for rlottie |
rlottie-python is optional. When it's missing or the asset bundle isn't available, the face widget falls back to a pure-Qt RobotFaceWidget (checked at runtime via lottie_face.is_available()).
Launch
edgevox-chess-robotThe window paints immediately and the status pill cycles from loading… through step labels ("downloading …", etc.) to online once llama-cpp + Stockfish are up. Model load is handled by a QThreadPool worker so the event loop never stalls.
CLI flags
All flags are optional — every knob also has an env var and a persisted setting, in that priority order (CLI > env > QSettings > default).
edgevox-chess-robot \
--persona trash_talker \ # grandmaster | casual | trash_talker
--user-plays black \ # white (default) | black
--engine stockfish \ # stockfish | maia
--stockfish-skill 12 \ # 0–20
--maia-weights ~/maia-1500.pb.gz # required with --engine maia
-v # or --verbose, debug loggingEnv vars
Same surface as RookConfig.from_env, so migrating from the old chess_robot server flow doesn't require renaming anything:
| Env var | Maps to |
|---|---|
EDGEVOX_CHESS_PERSONA | --persona |
EDGEVOX_CHESS_USER_PLAYS | --user-plays |
EDGEVOX_CHESS_ENGINE | --engine |
EDGEVOX_CHESS_STOCKFISH_SKILL | --stockfish-skill |
EDGEVOX_CHESS_MAIA_WEIGHTS | --maia-weights |
Models
| Role | Default | Notes |
|---|---|---|
| LLM | gemma-4-e2b (preset slug) | MoveInterceptHook handles chess tools deterministically, so the LLM only has to talk naturally. Gemma 4 E2B is the default (~1.8 GB at Q4_K_M) — picked by the chess-commentary benchmark as the best quality/speed point. Users can switch from the Chat model row in the Settings dialog to qwen3-1.7b (~1.1 GB, Apache-2.0 alternative), llama-3.2-3b (~2.0 GB, larger alt), llama-3.2-1b (~0.8 GB, fastest), or qwen2.5-1.5b (~1.0 GB, Apache-2.0 tiny). |
| STT | Whisper (lazy) | Loaded on first mic click — text-only users never pay the cost. |
| TTS | Kokoro (lazy) | Loaded on first reply; muted → not loaded at all. |
In-app controls
The title bar exposes four icon buttons: 🎤 mic, ↻ new game, ☰ menu, and the window controls.
The ☰ button opens a dropdown:
- New game — wipes memory + notes + chat history + persisted session, then prompts Rook to announce the new match
- Settings… — preferences dialog
- About RookApp — brief status line in the title bar
Keyboard shortcut: Ctrl+N / Cmd+N for new game.
Settings dialog
| Field | Options | Applies |
|---|---|---|
| Persona | casual, grandmaster, trash_talker | live — swaps agent instructions, face hook, accent colour. Engine strength waits for next new game so the in-progress board isn't clobbered. |
| Chat model | ⭐⭐⭐ gemma-4-e2b (default) · ⭐⭐ qwen3-1.7b · ⭐⭐ llama-3.2-3b · ⭐ llama-3.2-1b · ⭐ qwen2.5-1.5b | next launch — swapping the GGUF requires a fresh llama-cpp load. Star ratings match the chess-commentary benchmark scoreboard. |
| Piece set | Fantasy (default) · Celtic · Spatial | live |
| Board theme | Wood · Green · Blue · Gray · Dark wood · Night | live |
| Enable voice input | on / off | next launch |
| Mute sound effects | on / off | live (controls whether Kokoro loads at all) |
| Debug mode | on / off | live — two surfaces: (1) taps before_llm / after_llm / on_run_end and dumps the messages array + raw reply + final reply into the chat as monospace bubbles; (2) renders the per-turn analytics breakdown (YOU / ROOK / engine eval) as system-info bubbles. Off by default — the regular chat stays clean. |
| Microphone | PortAudio input devices | next launch |
| Speaker | PortAudio output devices | next launch |
Preferences persist via QSettings("EdgeVox", "RookApp"). A live preview strip in the dialog shows the selected theme + piece set together before you hit OK.
Persona accents
Each persona carries a colour that threads through the title bar, chat chips, persona label, and face highlight:
| Persona | Accent |
|---|---|
grandmaster | #7aa8ff (blue) |
casual | #ffb066 (orange) |
trash_talker | #ff5ad1 (magenta) |
On-disk state
Everything is stored under Qt's per-user AppDataLocation (falls back to ~/.rookapp on bare headless CI):
memory.db—SQLiteMemoryStorein WAL mode for long-term facts, crash-safe atomic writes. Older installs that wrotememory.jsonare migrated transparently on first launch; the legacy file is renamed tomemory.json.migratedand left in place as a backup.notes.md—NotesFilescratchpad theNotesInjectorHookreadssessions.json—JSONSessionStorechat history, restored on next launchgame.json— board + move history (FEN + SAN), so a crashed match resumes exactly where it wasQSettings— platform-native registry/plist/INI for UI preferences (piece set, board theme, audio devices)
New game wipes all four so commentary from a previous game can't leak into a fresh board.
Architecture
Blocking agent turns run on a QThreadPool worker; agent events become Qt signals via the bridge's _Signals bus (state_changed, chess_state_changed, face_changed, reply_finalised, user_echo, error, ready, load_progress, debug_event).
Barge-in
Voice interrupt runs through the same InterruptController the rest of EdgeVox uses:
AudioRecorderenergy-ratio gate detects the user speaking over TTS.VoiceWorker.barge_insignal reachesRookWindow._on_barge_in.TTSWorker.interrupt()cuts playback;Bridge.cancel_turn()trips the controller which plumbscancel_tokenintoLLM.complete(stop_event=…)— llama-cpp halts within one decode step .ctx.stopflips so the agent loop exits between hops.
The recorder is linked to the global InterruptiblePlayer, so TTS playback already pauses the mic queue at the source — no double-gating.
Hooks installed on the agent
MoveInterceptHook— deterministic move application so a missed tool call can't freeze the boardCommentaryGateHook— the brains of Rook's commentary. Reads the post-move snapshot, decides whether to speak, and stashes a groundedcommentary_directivefor the briefing when it does. See Commentary quality below.RichChessAnalyticsHook— hidden system-role briefing. Renders either the slimFACTS + SITUATIONshape (when the gate set a directive) or the legacy rich card (fallback).RobotFaceHook— emitsrobot_faceevents → translated to theface_changedQt signalMoveCommentaryHook— captures the latest move outcomeSilenceSentinelHook,ThinkTagStripHook,VoiceCleanupHook,SentenceClipHook,BriefingLeakGuard—AFTER_LLMsanitation stack before reply reaches the chat bubbleMemoryInjectionHook,NotesInjectorHook,ContextCompactionHook,TokenBudgetHook,PersistSessionHook— standard memory plumbingdefault_slm_hooks()— the SLM hardening stack for 1B-class modelsDebugTapHook— always installed; emits only when Debug mode is on (zero-cost path otherwise). Pulled out of the RookApp bridge intoedgevox.agents.hooks_builtinso TUI / server / CLI can reuse the same tap viaenabled=<callable>predicate.
Commentary quality & evaluation
Small LLMs (1-2 B params) are unreliable at two things the naive prompt asks of them:
- Staying silent on routine moves. They emit filler every turn, no matter how many times the prompt says "you don't have to speak".
- Not inventing tactical claims. Given a FEN + eval briefing they'll cheerfully claim pins, forks, attacks-on-squares that don't exist. Real bug we hit: 1B model said "your knight is pinning my queen" when the user had played a bishop and no pin existed.
RookApp solves both deterministically rather than trusting the model. CommentaryGateHook does three things every turn:
1. Decide whether Rook should speak
Gate fires at ON_RUN_START, priority 85 (after MoveInterceptHook at 90 applies the user's move + engine reply). It inspects:
| Signal | Triggers speech |
|---|---|
| Game over / checkmate / stalemate | Always |
Check given or received (SAN + / #) | Always |
Any capture (SAN contains x) | Always |
| Classification: inaccuracy / mistake / blunder | Always |
Promotion (SAN contains =) | Always |
| Quiet move (classification best/good, no capture/check) | Silent — increment quiet_streak |
quiet_streak reaches 3 | Force a low-intensity keepalive remark |
First turn of the game (greeted flag unset) | Always — persona greeting |
Silent turns HookResult.end("") the run before the LLM even loads a message — zero inference cost, zero fabrication risk.
2. Build a FACTS + SITUATION block
When Rook speaks, the gate assembles a slim, declarative directive. Structure after the prompt-ablation sweep consolidated redundant sections (see chess-commentary-benchmark §7.1):
[CHESS BRIEFING — internal context, do not read aloud verbatim]
FACTS — just happened: The user's move: bishop from f1 to a6 (Ba6).
My move (Rook): knight from b8 to a6, capturing a bishop (Nxa6).
Material change this turn: YOU gained 3 points of material (the user
came out worse in the exchange). React accordingly — this is a good
turn for you. Engine evaluation (from your side): +3.50 pawns — you
are winning decisively.
SITUATION: Rook gained material this turn (+3 points). Tone:
confident; do not praise the user's move.
[END BRIEFING]Every claim is derived from verified env state. The model can only narrate facts that are already in the block; it can't invent a pin because there's no pin line to invent from.
Pronoun discipline, "no markdown / no SAN", the <silent> fallback, and the opening-turn greeting cue all live once in the system-prompt preamble (edgevox.examples.agents.chess_robot.prompts.ROOK_TOOL_GUIDANCE) — repeating them in the per-turn briefing added ~130 tokens per turn without improving quality on Gemma 4 E2B or Llama 3.2 1B.
3. Emit a chat-visible analytics bubble
The gate emits a move_analytics event every turn, whether Rook speaks or not. The bridge translates it into a subdued system-info bubble in the chat — structured per-turn breakdown (piece names, squares, captures, classification, eval from the user's POV) so the user can always see what's happening even during silent phases.
Sign-flip safety
The LLM directive uses Rook's pronouns (you = Rook). The chat bubble uses the user's pronouns (you = user). These are generated from the same eval but with flipped sign and inverted "who is winning" text — _score_line vs _score_line_user_facing in commentary_gate.py. A pair of regression tests (TestScenarioSignFlipUserWhite / TestScenarioSignFlipUserBlack) locks both forms in place so a future edit can't silently send Rook's "you are winning" text to a losing user.
Scripted scenario test harness
tests/chess_robot/test_scripted_scenarios.py drives the gate through full games turn-by-turn against a fake environment. Each scenario is a specific failure-mode probe, not a happy-path check — modelled after actual bugs we've hit or ones that are plausible given the code shape:
| Scenario | What it guards |
|---|---|
TestScenarioSignFlipUserWhite / …Black | Chat bubble must address the user, not the LLM (regression for the 2026-04-19 eval-sign bug). |
TestScenarioGreetingExactlyOnce | Greeting fires on move 1 and never re-fires in the same game. |
TestScenarioQuietStreakKeepalive | After three silent turns the gate forces a keepalive remark, then resets. |
TestScenarioCheckmate | Game-over branch owns terminal turns. |
TestScenarioCaptureDescriptions | Directive names real moving + captured pieces via pre-move board replay. |
TestScenarioClassificationAttribution | "that last move by you" vs "by the user" matches actual parity. |
TestScenarioBlunderHungBishop | Material-change line + SITUATION confidence cue present (regression for "bold move, you're gaining the initiative" on a losing position). |
TestScenarioDirectiveShape | Directive leads with FACTS and omits the deprecated YOUR ROLE / <silent> footer (kept in ROOK_TOOL_GUIDANCE instead). |
TestScenarioIdempotentOnRepeatState | Non-move user input (e.g. "what's the score?") doesn't re-emit analytics or re-inject stale directives. |
Improvement log
Historical fixes the harness now guards against:
- Briefing leak — 1B model parroted the
[CHESS BRIEFING]block back as its reply. AddedBriefingLeakGuardat priority 68 (betweenThinkTagStripHookandVoiceCleanupHook) to strip the block plus header-less leaks where the model dropped the opening marker but kept the closing one. - Comment every move — dialled back by adding
CommentaryGateHook's noteworthy-signal filter. - Hallucinated tactics — replaced FEN/PV-heavy briefing with a focused shape that names every allowed claim. A later prompt-ablation sweep (
scripts/bench_prompt_ablation.py) found theYOUR ROLEand anti-fabrication footer were duplicating rules already inROOK_TOOL_GUIDANCE; the briefing was slimmed toFACTS + SITUATION, saving ~130 tokens per turn at equal or better heuristic quality on Gemma 4 E2B and Llama 3.2 1B. - Eval sign flip in chat bubble — split
_score_lineinto two perspective-aware variants. - Loading panel never hides — migrated the board-area stack from
QStackedLayouttoQStackedWidgetso the non-current widget is actuallyhide()'d. - Template leakage — removed
"bold","ouch","nice one"as literal examples in the directive; the model was copy-pasting them verbatim on losing positions (calling a user blunder a "bold move"). - Non-move input duplicates — gate tracks
last_gate_plyin session state and short-circuits when the board hasn't advanced. - Game-over attribution inversions — small LLMs frequently said "I'll keep playing" after being mated, or thanked the user for a blunder. Terminal turns (mate / stalemate / draw) now bypass the LLM entirely:
CommentaryGateHook._canned_game_endpicks a per-persona templated closer ("Mate. Well played.", "You got me. This time.", "Stalemate — fair enough.") and returns viaHookResult.end. Zero latency, zero attribution risk.
Chess commentary benchmark
The chess commentary benchmark compares 25 LLMs across 35 scenarios (openings / midgame / endgame / terminal / color flips) and informs the default model choice + Settings picker ranking. Heuristic quality score alone is misleading — several models score 99-100 on the grader but fail semantic audit (echo SAN, invert mate attribution, recite the directive). The report documents methodology, decision matrix, and reproduction.
Re-run after any gate or prompt change to catch regressions:
python scripts/bench_chess_commentary.py # full 25-model sweep
python scripts/eval_llm_commentary.py --model gemma-4-e2b # iterate on one model
python scripts/bench_prompt_ablation.py --model gemma-4-e2b # per-briefing-section ablation
python scripts/analyze_bench_results.py # quality × speed ParetoPackaging an installer
Single-file binaries are produced by .github/workflows/rookapp-desktop.yml for tags matching rook-v* (macOS arm/intel, Windows, Linux AppImage). Locally:
uv pip install -e '.[desktop]' pyinstaller
cat > rookapp_entry.py <<'PY'
from edgevox.apps.chess_robot_qt.main import main
if __name__ == "__main__":
main()
PY
pyinstaller \
--name RookApp --onefile --windowed \
--hidden-import edgevox.apps.chess_robot_qt \
--collect-submodules edgevox \
rookapp_entry.pyOutput lands under dist/. The bundle ships code only; STT / LLM / TTS weights download to the Hugging Face cache on first run. Stockfish must still be present on $PATH at runtime.
Licence notes
- PySide6 — LGPL-3, dynamic-linked (MIT-app compatible)
- qtawesome, pillow — MIT / HPND
- rlottie-python — LGPL-2.1 via ctypes (dynamic-linked)
- Maurizio Monge piece sets — MIT
- Kokoro TTS — MIT
- Stockfish — GPL, out-of-process: we talk UCI over a pipe, never link it, so the app stays MIT.