Confucius3-Math-DFlash (draft model)

A DFlash block-diffusion speculative-decoding draft model for netease-youdao/Confucius3-Math. Use it as the --speculative-config model to accelerate Confucius3-Math inference (especially single-stream / low-latency math reasoning).

  • Target model: netease-youdao/Confucius3-Math (Qwen2 arch, 48 layers, DeepSeek-R1-distill thinking format)
  • Draft: 5-layer DFlashDraftModel, block size 16, ~1.5B params, taps target hidden states from layers [1,12,23,34,45]
  • Trained with: SpecForge, D-PACE loss, 6 epochs

Results (acceptance length = mean tokens accepted per draft+verify step, thinking mode)

dataset accept length draft accept rate tok/s (single stream)
GSM8K 5.47 30% 493
MATH-500 5.79 32% 526

Higher acceptance ⇒ more tokens emitted per target forward ⇒ larger speedup. Profiled on 1×H200, vLLM 0.22, temperature 0.

Usage (vLLM)

vllm serve netease-youdao/Confucius3-Math \
  --speculative-config '{"method": "dflash", "model": "noctuashap/Confucius3-Math-DFlash", "num_speculative_tokens": 15}' \
  --trust-remote-code

DFlash is supported in vLLM ≥ 0.20.1. --trust-remote-code is required (the draft is a custom DFlashDraftModel, included as dflash.py).

Training data

~148k math-leaning prompts (NuminaMath / MATH / GSM8K / OpenMathReasoning + some code/reasoning/general), regenerated by Confucius3-Math itself (thinking traces kept inline) so the draft matches the target's own output distribution. No correctness filtering (distribution matching, not correctness).

Built with Claude Code.

Downloads last month
16
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for noctuashap/Confucius3-Math-DFlash