Skip to content

Chess commentary benchmark

Companion to slm-tool-calling-benchmark.md. The other report asks "which small LLMs can emit a tool call?". This one asks "which small LLMs can narrate a chess position in character, without hallucinating, fast enough to feel live?" — the axis RookApp actually cares about at the speaking layer.

Executive summary

  • Benchmarked 25 generalist + tool-specialist LLMs via scripts/bench_chess_commentary.py across 35 curated chess scenarios (opening / middlegame / endgame / terminal) with a per-turn stockfish eval recomputation so the directive the model sees matches what RookApp would see in a real game.
  • Heuristic quality score is not sufficient — several models hit 99–100 on the automated grader but fail semantic audit catastrophically (echo SAN, invert attribution on mate, call the opponent's blunder "a solid move"). Only a hand-audit of mate / capture / blunder turns revealed the real ranking.
  • Default picked: gemma-4-e2b (Q4_K_M, ~1.8 GB). Passes 7/8 high-stakes scenarios in the semantic audit, plays to the persona voice, keeps replies short and grounded. Sits at the fastest acceptable point on the quality/speed Pareto frontier.
  • Canned game-end replies (added as a direct outcome of this benchmark) eliminate the single biggest failure mode of 1B-class models — saying "I'll keep playing" after being mated — at zero LLM cost.
  • Qwen3 family was penalised unfairly on first pass because the grader counted empty <think>\n\n</think> wrappers that Qwen3 emits even under the /no_think soft switch. Applying the same strip the real pipeline uses (ThinkTagStripHook) restores Qwen3 to its true quality (~98 vs ~90), still a tier below Gemma on attribution correctness.

1. Methodology

1.1 Scenario corpus

35 hand-authored chess positions; every SAN sequence is replay-validated by tests/chess_robot/test_eval_scenarios_legal.py (pytest-parametrized over scenarios()).

CategoryCountExamples
Openings (book positions)8Sicilian Najdorf, Caro-Kann, French, London, KID, Italian, QGA, Berlin
Middlegame tactics8user hangs bishop, queen trap, fork setup, promotion, en passant, Opera-game sacrifice
Blunders / attribution-risk5user blunders queen, Rook blunders queen, mid-game mistake, trash-talker reaction
Terminal positions4user delivers mate, Rook delivers mate, stalemate, smothered mate
Color flips (Rook plays white)5bishop capture, mate, check, castle, queen blunder
Greetings2opening (white) greeting, user-plays-black greeting
Persona cross-checks1grandmaster + Rook wins material
Quiet / keepalive2routine pawn push, minor piece trade

Each scenario carries: san_history, eval_cp (recomputed from stockfish at benchmark time, not eyeballed), classification, is_game_over, winner, expected_tone, forbidden_terms (words the reply MUST NOT invent), and a user_task string matching what MoveInterceptHook feeds the LLM in production.

1.2 Directive construction

Matches the real pipeline. For each scenario the harness calls the actual CommentaryGateHook._build_ground_truth() with the fake env + session state, producing the same FACTS + SITUATION block RichChessAnalyticsHook injects at BEFORE_LLM in production. (Pre-slim-refactor versions of this report reference a YOUR ROLE / GROUND TRUTH / MOOD CUE / SITUATION shape — consolidated into FACTS + SITUATION per §7.1 below.)

1.3 Grading heuristic

Base score 100, −12 per flag. Flags in scripts/eval_llm_commentary.grade():

  • forbidden term in reply (pin / fork / skewer / made-up-square)
  • length > 40 words (reply budget)
  • reply starts with bare SAN (Nxd5)
  • reply verbatim-quotes a directive bullet (paste)
  • tone mismatch (upbeat while losing, rattled while winning)
  • <silent> sentinel
  • unclosed <think> markers after strip

The <think> strip matches production (ThinkTagStripHook in sanitize.py) so Qwen3-family models aren't double-penalised for emitting empty thinking-mode wrappers.

1.4 Semantic audit — the step that actually ranks models

The heuristic misses the important failures. A reply like "Nice try, I'll keep playing" passes all flags (no forbidden terms, right length, no SAN opener) but is completely wrong if the user just checkmated. Every model in the top half of the scoreboard was manually audited on 8 high-stakes scenarios for:

  1. Did the reply acknowledge the actual event (capture / check / mate / blunder)?
  2. Pronouns correct — "I" / "my" for Rook's side, "you" / "your" for the user?
  3. No fabricated tactics — no invented pins / forks / pieces / squares beyond the directive?
  4. Persona voice — in character, not a flat chess-report sentence?
  5. Game-over correctness — did the model understand the game ended and on whose side?

2. Scoreboard

Full 25-model run on RTX 3090 (warmed, Q4_K_M GGUFs; per-reply times proportional on other hardware). Local RTX 3080 Laptop (16 GB) numbers, where measured, sit at ~30–40× the 3090 numbers — still comfortably within the 2 s "live" budget for Gemma 4 E2B Q4, the chosen default.

2.1 Top of the scoreboard (by heuristic)

RankModelHeuristicPer-reply (3090)SizeLicenceSemantic verdict
1qwen2.5-1.5b100.00.06 s1.0 GBApache-2.0❌ 4/8 wrong attributions
1hammer-2.1-0.5b100.00.03 s0.5 GBQwen research❌ echoes code blocks
1functionary-v3.2100.00.14 s4.9 GBMIT✓ but heavy
2llama-3.2-1b99.70.04 s0.8 GBLlama-3❌ 5/8 wrong (game-over + direction)
2llama-3.2-3b99.70.07 s2.0 GBLlama-3❌ 5/8 wrong on game-over
2gemma-4-e2b Q6_K99.70.08 s2.6 GBGemma✓ 7/8 clean
2llama-3.2-1b Q5_K_M99.70.04 s0.9 GBLlama-3❌ same as Q4
2smollm3-3b Q3_K_M99.70.11 s1.6 GBApache-2.0❌ 3-5/8 wrong
3smollm3-3b99.30.09 s1.9 GBApache-2.0❌ 3/8 wrong (mate direction)
4gemma-4-e2b Q4_K_M99.00.07 s1.8 GBGemma✓ 7/8 clean — chosen default
4phi-4-mini99.00.10 s2.4 GBMIT✓ mostly clean, borderline size
5qwen2.5-3b98.60.09 s2.0 GBApache-2.0~
5toolace-2-8b98.60.16 s4.6 GBApache-2.0✓ but very heavy
6hermes-3-3b98.30.14 s2.0 GBLlama-3~ wordy, *sigh* action markers
7granite-4.0-350m97.30.04 s0.2 GBApache-2.0❌ recites directive verbatim

2.2 Heuristic ≥ 99 but semantically broken — the flagship finding

Perfect heuristic score does not imply usable output. Sample failures:

text
hammer-2.1-0.5b (100.0) on user_hangs_bishop:
  '```\nI played Ba6\nYou replied with Nxa6\n```'           ← code-block echo

qwen2.5-1.5b (100.0) on user_checkmates:
  "I see you're making a bold move with the queen.          ← game is over;
   I'll have to be careful now."                              reply pretends it isn't

llama-3.2-3b (99.7) on rook_checkmates:
  "Nice try, but I think you're trying to sneak one         ← Rook just mated; reply
   past me!"                                                   pretends user is still attacking

granite-4.0-350m (97.3) on rook_blunders_queen:
  "I lost material this turn (-9 points)."                  ← verbatim directive paste

2.3 Qwen3 family — corrected after <think> strip

Initial run under-scored Qwen3 because its /no_think soft switch still emits empty <think>\n\n</think> wrappers. With the pipeline's existing ThinkTagStripHook applied:

VariantBefore stripAfter stripPer-replySemantic
qwen3-1.7b Q4_K_M90.498.33.0 s✓ close to Gemma
qwen3-1.7b Q5_K_M88.3(not re-run)
qwen3-1.7b Q6_K89.4(not re-run)
qwen3-0.6b Q4_K_M90.43.1 s❌ too small
qwen3.5-0.8b Q4_K_M95.55.9 s❌ attribution still flipped
qwen3.5-2b Q4_K_M98.67.9 s~ surprisingly slow for size

3. Quant sweep

Gemma 4 E2B is the primary candidate; sweep its quants to confirm Q4_K_M is the right cutoff:

QuantDiskHeuristicPer-replyNotes
Q3_K_M1.4 GB98.30.09 sMild quality drop
IQ4_XS1.5 GB94.50.05 sdrops below the 95 floor
Q4_K_M (default)1.8 GB99.00.07 sBest balance
Q5_K_M2.1 GB98.60.07 sNegligible vs Q4
Q6_K2.6 GB99.70.08 s+0.7 points, +0.8 GB disk

Llama 3.2 1B (no Q3 in the repo):

QuantDiskHeuristicPer-reply
Q4_K_M (default)0.8 GB99.70.04 s
Q5_K_M0.9 GB99.70.04 s
Q6_K1.0 GB99.00.05 s

SmolLM3 3B:

QuantDiskHeuristicPer-reply
Q3_K_M1.6 GB99.70.11 s
Q4_K_M1.9 GB99.30.09 s
Q5_K_M2.2 GB99.30.09 s

Takeaways:

  • Q4_K_M is the right cutoff for every tested family. Lower (Q3 / IQ4_XS) costs measurable quality; higher (Q5 / Q6) costs disk and load time without buying much.
  • IQ4_XS is a trap for Gemma E2B — drops below the 95 floor despite marginal size savings over Q4_K_M.

4. Speed / smoothness

Target budget on user hardware:

TierPer-replyPerceived
🟢 live< 2.0 sConversational — user → Rook → TTS loop under 3 s including Kokoro warm-up
🟡 usable2–5 sNoticeable pause
🔴 slow≥ 5 sBreaks conversational illusion
⚠ quality flooranyHeuristic < 95 → disqualified regardless of speed

On RTX 3080 Laptop (16 GB) without GPU offload (CPU fallback), measured per-reply:

  • gemma-4-e2b Q4: 2.3 s — inside the live tier with margin eaten by cold start
  • qwen3-1.7b Q4: 3.0 s — usable, Apache-2.0 alternative
  • qwen3.5-0.8b Q4: 5.9 s — slower than Qwen3-1.7B despite being smaller (thinking-mode decode even under /no_think)
  • qwen3.5-2b Q4: 7.9 s — dramatically slower per param than Qwen3-1.7B

Canned game-end replies (CommentaryGateHook._canned_game_end) short-circuit the LLM entirely on any mate / stalemate / draw turn, writing a persona-appropriate line ("GG! That was a fun one.", "Mate. Well played.", "You got me. This time.") directly via HookResult.end. Cost: 0 ms. Benefit: the single biggest class of 1B-model attribution failure ("I'll keep playing" after being mated) disappears for free.

5. Decision matrix

Acceptance bar: quality ≥ 95 (heuristic) AND semantic audit pass on all 8 high-stakes scenarios AND per-reply < 5 s on CPU-fallback hardware.

ModelQualitySemanticSpeedVerdict
gemma-4-e2b Q4_K_M✓ 99.0✓ 7/8✓ 2.3 sDEFAULT
gemma-4-e2b Q6_K✓ 99.7✓ 7/8Viable, +0.8 GB disk
qwen3-1.7b Q4✓ 98.3~ 6/8✓ 3.0 sSettings alternative (Apache-2.0)
llama-3.2-3b Q4✓ 99.7❌ 5/8 wrong on matePartial fix from canned endings
llama-3.2-1b Q4✓ 99.7❌ 3/8 wrong✓ fastestSettings option for low-RAM
qwen2.5-1.5b✓ 100.0❌ 4/8 wrongRejected — heuristic lies
qwen3.5-0.8b~ 95.5❌ 4/5 wrong🟡 5.9 sRejected
phi-4-mini✓ 99.0Candidate but 3.8 B, heavier than Gemma E2B
hammer-2.1-0.5b✓ 100.0❌ code-block echoes✓ fastestRejected outright
granite-4.0-350m✓ 97.3❌ directive pasteRejected

6. Final recommendation

Default: gemma-4-e2b (Q4_K_M preset, ~1.8 GB). Pinned in RookConfig.llm_path and Settings.llm_model. The Settings dialog exposes five options, ranked by star annotation so users see the recommendation without reading this report:

  • ⭐⭐⭐ Gemma 4 E2B — default, best quality (~1.8 GB)
  • ⭐⭐ Qwen3 1.7B — Apache-2.0 alternative (~1.1 GB)
  • ⭐⭐ Llama 3.2 3B — larger, more reliable than 1B (~2.0 GB)
  • ⭐ Llama 3.2 1B — lightest / fastest, some slips (~0.8 GB)
  • ⭐ Qwen2.5 1.5B — Apache-2.0, tiny (~1.0 GB)

7. Improvements informed by this benchmark

Landing alongside the report:

  1. Canned game-end replies (commentary_gate.py:_canned_game_end). Templated per (persona, outcome) where outcome ∈ {won, lost, draw}. The gate fires HookResult.end(line) — the LLM never runs on game-over. Zero latency, zero attribution risk.
  2. Scenario corpus expanded from 9 → 35, every one legality-validated by test_scenario_replays_legally.
  3. Stockfish eval recomputation (recompute_with_stockfish()) replays each scenario through a real engine at benchmark time, so the eval_cp / classification signals match what RookApp sees in a real game. Flagged several scenarios whose original hand-set eval was off by ±200 cp.
  4. <think> strip in eval harness (_extract_text) mirrors the pipeline's ThinkTagStripHook, giving Qwen3 / thinking-mode models a fair comparison.
  5. prompts.py module split so the eval harness and future CLI / server surfaces can share the persona-and-protocol string without dragging Qt into their import graph.
  6. Gate trigger: castling added to the notable-move filter so O-O / O-O-O turns no longer go silent.
  7. Smoothness column (✅ live / 🟡 usable / 🔴 slow / ⚠ quality floor) in the bench report and scripts/analyze_bench_results.py so a future re-run surfaces the trade-off directly.

7.1 Prompt ablation: slim-briefing refactor

Driven by a fresh sweep (scripts/bench_prompt_ablation.py) on Gemma 4 E2B at 3 personas × 35 scenarios × 3 repeats per variant (243 runs each):

VariantHeuristicΔ vs baselineLatency p95
baseline (full)97.70.10 s
no_move_desc96.3−1.40.10 s
no_material97.3−0.40.10 s
no_situation97.3−0.40.10 s
no_score97.6−0.10.10 s
no_role_header97.8+0.10.10 s
no_classification97.9+0.20.10 s
no_persona98.2+0.50.12 s
no_footer98.7+1.00.11 s
facts_only (no role/sit/footer)98.8+1.10.13 s
no_tool_guidance98.8+1.10.11 s

Findings:

  • move_desc is the only load-bearing briefing section — piece-name / from-square / to-square / captured-piece English descriptions drive 1.4 points of quality. Everything else is within noise or a small improvement when removed.
  • role_header and footer duplicate content already in ROOK_TOOL_GUIDANCE. The role header's pronoun discipline and the footer's "no markdown / no SAN / <silent>" rule both appear verbatim in the system-prompt preamble. Keeping both costs ~130 tokens per turn without measurable quality benefit. Dropped.
  • Briefing-only signals consolidated under two headers: FACTS (move descriptions, classification, material, eval) and SITUATION (tone cue). Section labels switched from first-person (MY REACTION TONE) to declarative (SITUATION) because small models will paste first-person instruction phrases verbatim — caught the 1B model writing "You just played Qxf7, and I concede in persona. That was brutal" in the eval harness when a similar leak risk was tested.
  • Re-run on the slim baseline: heuristic score rose from 97.7 → 98.5 with no change to scenarios or model.

Net result: system prompt shrunk ~17 % (813 → 675 tokens for a typical mid-game turn), test coverage retained, pronoun discipline still enforced once in ROOK_TOOL_GUIDANCE. Further candidate: drop ROOK_TOOL_GUIDANCE entirely (+1.0 point, but latency jumped 57 % — model emits longer replies without the "one sentence" rule). Not landed — the quality/latency trade-off is the wrong direction.

8. Reproduction

bash
# Full sweep (takes ~45 min on a single 3090; ~2-3 hours on a laptop
# if downloads are cold).
python scripts/bench_chess_commentary.py

# Single-model iteration while tuning prompts.
python scripts/eval_llm_commentary.py --model gemma-4-e2b --temperature 0.3

# Post-run quality × speed analysis from the dumped JSON.
python scripts/analyze_bench_results.py

# Legality sanity-check for the scenario corpus.
pytest tests/chess_robot/test_eval_scenarios_legal.py -v

Defaults include the stockfish eval recomputation — the binary must be on $PATH (apt install stockfish / brew install stockfish). The harness fails open and falls back to hand values if stockfish isn't available.

9. Future work

  • Re-run the Qwen3 family at the two quants not already covered (1.7B Q5 / Q6) with <think> strip to confirm the corrected scores hold.
  • Benchmark Gemma 4 E4B and Gemma 3 1B once weights finish downloading — might displace Llama 3.2 3B in the mid-tier Settings slot.
  • Build a live semantic grader that fires a second (tiny) LLM to judge attribution correctness on each reply, so the heuristic catches "reply inverts who won" without human audit.
  • Tune max_tokens (currently 80) down to ~60 once the canned game-end diverts the longest turns — saves ~25 % of the per-reply decode ceiling.
  • Persona-specific prompts: the grandmaster voice is currently "not giddy" + clipped; a sharper persona prompt could lift heuristic scores another 1–2 points without a model change.
  • Re-run the 25-model sweep with the slim prompt to confirm rankings hold across the suite — expected but unverified.
  • Evaluate a confidence-based LLM bypass for quiet speakable turns (e.g. minor-piece trade at level eval) — canned small-talk could cover another ~15 % of turns with zero LLM cost.

See also

Offline voice agent framework for robots