Qwen3.5-REAP-20B-A3B-W4A16
Qwen3.5-REAP-20B-A3B-W4A16 is an INT4 weight-quantized version of atbender/Qwen3.5-REAP-20B-A3B, created by applying REAP pruning + AutoRound W4A16 quantization to Qwen/Qwen3.5-35B-A3B.
35B β 20B (REAP 50% prune) β 11 GB (W4A16) | ~3B active per token
Model Specifications
| Property | Original | REAP Pruned | Quantized (this model) |
|---|---|---|---|
| Model | Qwen/Qwen3.5-35B-A3B | atbender/Qwen3.5-REAP-20B-A3B | Qwen3.5-REAP-20B-A3B-W4A16 |
| Total Parameters | ~35B | ~20B | ~20B |
| Active Parameters | ~3B | ~3B | ~3B |
| Experts per Layer | 256 | 128 | 128 |
| Experts Routed per Token | 8 | 8 | 8 |
| Shared Expert | 1 per layer | 1 per layer | 1 per layer |
| Hidden Layers | 40 | 40 | 40 |
| Layer Types | 30 linear + 10 full attn | 30 linear + 10 full attn | 30 linear + 10 full attn |
| Hidden Size | 2048 | 2048 | 2048 |
| MoE Intermediate Size | 512 | 512 | 512 |
| Context Length | 262,144 | 262,144 | 262,144 |
| Precision | BF16 | BF16 | W4A16 (INT4 weights, BF16 activations) |
| Disk Size | ~67 GB | ~35 GB | ~11 GB |
Important: dtype Must Be bfloat16
This model uses hybrid attention with GDN (Gated Delta Network) linear attention layers. These layers produce intermediate values that exceed float16's dynamic range (max 65504), causing NaN outputs. You must use --dtype bfloat16 (or torch_dtype=torch.bfloat16). The model will produce garbage output with float16.
How to Use
Requirements
- transformers >= 5.x (from
mainbranch) β theqwen3_5_moemodel type is not in any released version yet - torch >= 2.4 with CUDA support
- ~11 GB disk + ~14 GB VRAM β fits on any 24 GB+ GPU
Install transformers from main:
pip install git+https://github.com/huggingface/transformers.git@main
With transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "atbender/Qwen3.5-REAP-20B-A3B-W4A16"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16, # Required β float16 causes NaN
device_map="auto",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Write a Python function to compute fibonacci numbers."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True))
With vLLM
Tested with vLLM v0.17.0rc1 (vllm/vllm-openai:cu130-nightly). Requires --enforce-eager and --dtype bfloat16.
vllm serve atbender/Qwen3.5-REAP-20B-A3B-W4A16 \
--dtype bfloat16 \
--quantization gptq \
--trust-remote-code \
--enforce-eager \
--max-model-len 4096
| Parameter | Value | Why |
|---|---|---|
--dtype bfloat16 |
Required | GDN linear attention overflows float16 |
--quantization gptq |
Required | AutoRound uses GPTQ-compatible packing format |
--enforce-eager |
Recommended | Avoids torch.compile issues with hybrid linear/full attention |
--trust-remote-code |
Required | Qwen3.5 MoE uses custom model code |
At ~11 GB, this model fits comfortably on a single 24 GB GPU with room for KV cache:
vllm serve atbender/Qwen3.5-REAP-20B-A3B-W4A16 \
--dtype bfloat16 \
--quantization gptq \
--trust-remote-code \
--enforce-eager \
--gpu-memory-utilization 0.90 \
--max-model-len 8192
Docker
All pipeline steps were run using this Docker image:
FROM modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.9.1-py311-torch2.8.0-vllm0.11.0-modelscope1.32.0-swift3.11.3
RUN pip install --no-cache-dir \
git+https://github.com/huggingface/transformers.git@main \
git+https://github.com/huggingface/peft.git@main \
auto-round
Note: Both
transformersandpeftmust be installed frommainβ the released versions don't supportqwen3_5_moeand have aHybridCacheimport error respectively.
Compression Pipeline
Stage 1: REAP Expert Pruning (35B β 20B)
REAP (Router-weighted Expert Activation Pruning) from Cerebras Research combines router gate statistics with expert activation norms to determine which experts to prune. 50% of MoE experts pruned globally (256 β 128 per layer, across all 40 layers).
Saliency score per expert:
REAP calibration data:
| Property | Value |
|---|---|
| Dataset | NeelNanda/pile-10k |
| Samples | 256 |
| Sequence Length | 2,048 |
| Seed | 42 |
Changes made during pruning:
- 50% of MoE experts pruned globally (256 β 128 per layer, across all 40 layers)
- Fused expert tensors (
gate_up_proj,down_proj) sliced along expert dimension - Router gate weights pruned accordingly (only retained expert rows kept)
- Routing mechanism (
num_experts_per_tok = 8) unchanged - Shared expert and shared expert gate untouched
- Attention layers (both linear and full), embeddings, and all non-MoE components untouched
Stage 2: AutoRound W4A16 Quantization (35 GB β 11 GB)
AutoRound from Intel uses signed gradient descent to find optimal rounding decisions β iteratively adjusting rounding directions to minimize output error across calibration samples.
| Property | Value |
|---|---|
| Quantization tool | Intel AutoRound v0.10.2 |
| Bits | 4 (INT4 symmetric) |
| Group size | 128 |
| Format | auto_round:auto_gptq (safetensors) |
| Calibration dataset | NeelNanda/pile-10k |
| Calibration samples | 64 |
| Sequence length | 512 |
| Quantized sub-layers | 264-265 per block (40 blocks) |
| Skipped (FP16) | shared_expert_gate (40 layers) |
| Skipped (BF16) | mlp.gate (MoE router, 40 layers) β kept unquantized to preserve routing precision |
How It Was Made
The Fun Part
from auto_round import AutoRound
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"atbender/Qwen3.5-REAP-20B-A3B",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"atbender/Qwen3.5-REAP-20B-A3B",
trust_remote_code=True,
)
# Keep shared_expert_gate and MoE router at FP16
layer_config = {}
for name, module in model.named_modules():
if "shared_expert_gate" in name:
layer_config[name] = {"bits": 16, "group_size": -1}
if name.endswith(".mlp.gate"):
layer_config[name] = {"bits": 16, "group_size": -1}
ar = AutoRound(
model,
tokenizer=tokenizer,
device="cuda",
bits=4,
group_size=128,
nsamples=64,
seqlen=512,
batch_size=1,
layer_config=layer_config,
)
ar.quantize_and_save("./Qwen3.5-REAP-20B-A3B-W4A16", format="auto_round")
The Less Fun Part (Patches Required)
These patches are needed for transformers 5.x + AutoRound compatibility:
import types, torch, transformers
# Patch 1: Conv1D removed in transformers 5.x β AutoRound still imports it
if not hasattr(transformers, 'pytorch_utils'):
pytorch_utils = types.ModuleType('pytorch_utils')
pytorch_utils.Conv1D = torch.nn.Linear
transformers.pytorch_utils = pytorch_utils
elif not hasattr(transformers.pytorch_utils, 'Conv1D'):
transformers.pytorch_utils.Conv1D = torch.nn.Linear
# Patch 2: Qwen3.5 misdetected as multimodal (has vision_config in base)
# Forces text-only calibration, excluding vision encoder from quantization
import auto_round.utils
import auto_round.autoround
auto_round.utils.is_mllm_model = lambda *args, **kwargs: False
auto_round.autoround.is_mllm_model = lambda *args, **kwargs: False
# Both patches must come BEFORE: from auto_round import AutoRound
Known Issues and Fixes
Issues found and fixed in this repository
| Issue | Root Cause | Fix Applied |
|---|---|---|
qwen3_5_moe model type not recognized |
Model type not in released transformers (4.57.x) | Install transformers from main branch |
HybridCache import error |
peft incompatible with transformers 5.x | Install peft from main branch |
Conv1D import failure in AutoRound |
transformers.pytorch_utils.Conv1D removed in 5.x |
Inject Conv1D shim before importing AutoRound |
| Qwen3.5 detected as multimodal | is_mllm_model returns True due to vision_config in base |
Override is_mllm_model to return False |
shared_expert_gate quantized |
Scalar gating layer needs FP16 precision | Added to layer_config with bits=16 |
| MoE router quantized (broken routing) | mlp.gate (router weights) must stay unquantized |
Added to layer_config with bits=16 |
| NaN output with float16 | GDN linear attention overflow exceeds fp16 range | Use torch_dtype=torch.bfloat16 always |
Weight format notes
- Expert weights are stored in per-expert format (
experts.gate_up_proj.{id}.qweight) β transformers handles unfusing automatically during loading - Do NOT fuse expert tensors into stacked
[num_experts, ...]format β transformers' quantized loading expects per-expert keys - This is the CausalLM variant (
Qwen3_5MoeForCausalLM), so layer paths aremodel.layers.*, notmodel.language_model.layers.*(which applies to the multimodalConditionalGenerationvariant)
Architecture Notes
Qwen3.5-35B-A3B uses a hybrid attention architecture:
- 30 linear attention layers using GDN (Gated Delta Network) with conv1d kernels
- 10 full attention layers (standard multi-head attention with RoPE) at every 4th layer
- Fused expert format: Experts stored as
Qwen3_5MoeExpertswith stacked tensors in BF16; unfused to per-expert quantized tensors after AutoRound - Shared expert: Each MoE layer has a shared expert (always active) + shared expert gate
- Custom router:
Qwen3_5MoeTopKRouterreturns(softmax_probs, normalized_topk_scores, topk_indices)
Hardware & Runtime
| Stage | Time | Hardware |
|---|---|---|
| REAP saliency collection (256 samples) | ~8 min | 1Γ NVIDIA RTX PRO 6000 Blackwell 98GB |
| REAP expert pruning | <1 sec | CPU |
| AutoRound quantization (40 blocks) | ~27 min | 1Γ NVIDIA RTX PRO 6000 Blackwell 98GB |
| Total pipeline | ~37 min |
Resource Usage (quantization stage):
- Peak VRAM: 35.82 GB
- Peak RAM: ~61 GB
Intended Use
- Research on MoE pruning and compression techniques for Qwen3.5 hybrid attention models
- Exploring sparsity-performance trade-offs in dense-expert MoE architectures (256 experts is unusually high)
- Local / single-GPU deployment of a compressed Qwen3.5 MoE variant (~11 GB fits on any 24 GB+ GPU)
Limitations
- No post-pruning fine-tuning β raw prune + quantize, quality degradation expected
- Aggressive compression β 50% expert removal + 4-bit quantization is significant
- Calibration bias β Both REAP and AutoRound used general text (pile-10k); code-heavy or domain-specific tasks may be disproportionately affected
- Not benchmarked β no formal evals run yet; contributions welcome
- Text-only β This is the CausalLM variant, not the multimodal ConditionalGeneration model
- Requires transformers from main β
qwen3_5_moemodel type not in any released version yet
Citation
@article{lasby2025reap,
title={REAP: Router-weighted Expert Activation Pruning for Scalable Mixture-of-Experts Compression},
author={Lasby, Mike and others},
year={2025},
url={https://github.com/CerebrasResearch/reap}
}
@misc{autoround2024,
title={AutoRound: Advanced Weight Quantization},
author={Intel Corporation},
year={2024},
howpublished={\url{https://github.com/intel/auto-round}}
}
Acknowledgments
- Qwen team β Qwen3.5-35B-A3B base model
- Cerebras Research β REAP pruning method
- Intel β AutoRound quantization framework
License
Apache License 2.0 β same as the base Qwen3.5-35B-A3B model.
- Downloads last month
- 213
Model tree for atbender/Qwen3.5-REAP-20B-A3B-W4A16
Base model
Qwen/Qwen3.5-35B-A3B-Base