โก FlashHead benchmarks for Llama 3.2, Gemma 3, and Qwen3 are now on embedl/Edge-Inference-Benchmarks ! These are some of the models used in the FlashHead paper - now easier to explore and compare interactively.
โก FlashHead: Fast LM Head Inference - Now a Simple vLLM Plugin flash-head replaces the dense LM head with a two-stage retrieval pipeline - up to 2x inference speedup, training-free. Previously required custom Docker images; now it's just: