Pure RTN mode powered by AutoRound⭐ If you find it useful, please consider starring the AutoRound project on [GitHub](https://github.com/intel/auto-round)!
Pure RTN mode powered by AutoRoundThe v1.1.0 post had a single "one model wins all" PROD line; v1.8 has per-axis baselines from three live cohorts (May 16-17 '26).
✍️ /consultants coder — per-language routing (mlang v1.0.1, suite hash ddef8095, 6 langs × 13 questions).
| Lang | Primary | Fallback |
|---|---|---|
| c | glm-5.1:cloud | deepseek-v4-pro:cloud |
| cpp | deepseek-v4-flash:cloud | kimi-k2.6:cloud |
| csharp | deepseek-v4-pro:cloud | kimi-k2.6:cloud |
| go | kimi-k2.6:cloud | deepseek-v4-pro:cloud |
| python | glm-5.1:cloud | kimi-k2.6:cloud |
| rust | deepseek-v4-flash:cloud | deepseek-v4-pro:cloud |
No single model wins across languages. Out-of-cohort langs (ts/java/ruby/swift/shell) fall to global default glm-5.1:cloud → kimi-k2.6:cloud.
🛠 tool_executor — 48 trials, suite 7921555c:
• gemma4:31b-cloud — 87.5% / Q=5.00 ← winner (tiebreak: quality → wall → call count)
• kimi-k2.6 / deepseek-v4-pro / glm-5.1 — all 87.5% but lost tiebreakers
• gemini-3-flash-preview — 75% (qualifies)
• qwen3-coder-next — 62.5%, DISQUALIFIED. The Python coder winner is NOT a good tool_executor — "best at writing code" ≠ "best at mechanical tool chains for reading code".
⏱ Stall thresholds (M11a, suite c8306c62):
• kimi-k2.6:cloud cold TTFT = 150s → needs stall=390s. The global 300s default would falsely STARTUP_STALL it.
• qwen3-coder-next fastest startup (390ms).
• deepseek-v4-flash slowest p99 wall (255s) → hard_cap raised to 780s.
🧮 Rubric: pass_rate ≥ 70% AND avg_quality ≥ 3.5. Baselines append-only at docs/consultants-skill-eval-baselines.md. Re-bench any new model with claude-consultants skill-eval <suite> --live --accept-cost.
🔗 Bench dirs: benchmarks/consultants/results/2026-05-17/
path:line claim checked against an mtime-cached code_graph. Catches fabricated filenames before they launder through critics + synthesizer.llamafile:// model prefix. Multi-instance LRU, per-label idle reap, sticky CPU fallback. Stack runs tag. This gives you real control over how hard the model thinks, from skipping the reasoning phase entirely with minimal, all the way up to xtra-Hi for deep, extended thought on hard problems. For most agentic use, auto or low is the move to keep things snappy.contact@skinnertopia.com 🔄 Follow-up to the M1..M5 study above — re-ran M4 against the updated PR #682 head ("turbo" branch by @Tusm11, HEAD 8d989f6) with rebalanced hyperparams (equal weights w=1/1, density 0.7, closer to the PR's worked example). Same AttnLRP signal as M4-orig, same sources.
▶ Qwen3.5-4B-M4-v2-ex-LRP-turbo
ManniX-ITA/Qwen3.5-4B-M4-v2-ex-LRP-turbo
Q6_K HE / MBPP pass@1, M4-v2 inserted:
Δ M4-v2 vs M4-orig: +4.27 pp HE, +2.80 pp MBPP. M4-v2 takes the MBPP medal of the whole study (overtakes M5) while staying competitive on HumanEval. The turbo code path + rebalanced hyperparams clearly beat the original PR head on this configuration.
Findings refresh: Fisher still leads HE; ex-LRP (turbo) now leads MBPP, narrowly ahead of OMv2+LRP. Both LRP variants land within 1 pp on MBPP — strong signal that LRP-driven sparsification is doing real work for code-gen on small Qwen merges.
Big thanks to @Tusm11 for the supercharged ex-LRP turbo head — multimodal support + Iron-Man stabilization + in-place math are a real upgrade. Posted full results + the 6 patches needed to run it against Qwen3_5ForConditionalGeneration on the PR thread:
https://github.com/arcee-ai/mergekit/pull/682