Qwen3.5-REAP-20B-A3B-W4A16

Qwen3.5-REAP-20B-A3B-W4A16 is an INT4 weight-quantized version of atbender/Qwen3.5-REAP-20B-A3B, created by applying REAP pruning + AutoRound W4A16 quantization to Qwen/Qwen3.5-35B-A3B.

35B → 20B (REAP 50% prune) → 11 GB (W4A16) | ~3B active per token

Model Specifications

Property	Original	REAP Pruned	Quantized (this model)
Model	Qwen/Qwen3.5-35B-A3B	atbender/Qwen3.5-REAP-20B-A3B	Qwen3.5-REAP-20B-A3B-W4A16
Total Parameters	~35B	~20B	~20B
Active Parameters	~3B	~3B	~3B
Experts per Layer	256	128	128
Experts Routed per Token	8	8	8
Shared Expert	1 per layer	1 per layer	1 per layer
Hidden Layers	40	40	40
Layer Types	30 linear + 10 full attn	30 linear + 10 full attn	30 linear + 10 full attn
Hidden Size	2048	2048	2048
MoE Intermediate Size	512	512	512
Context Length	262,144	262,144	262,144
Precision	BF16	BF16	W4A16 (INT4 weights, BF16 activations)
Disk Size	~67 GB	~35 GB	~11 GB

Important: dtype Must Be bfloat16

This model uses hybrid attention with GDN (Gated Delta Network) linear attention layers. These layers produce intermediate values that exceed float16's dynamic range (max 65504), causing NaN outputs. You must use --dtype bfloat16 (or torch_dtype=torch.bfloat16). The model will produce garbage output with float16.

How to Use

Requirements

transformers >= 5.x (from main branch) — the qwen3_5_moe model type is not in any released version yet
torch >= 2.4 with CUDA support
~11 GB disk + ~14 GB VRAM — fits on any 24 GB+ GPU

Install transformers from main:

pip install git+https://github.com/huggingface/transformers.git@main

With transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "atbender/Qwen3.5-REAP-20B-A3B-W4A16"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,  # Required — float16 causes NaN
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Write a Python function to compute fibonacci numbers."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True))

With vLLM

Tested with vLLM v0.17.0rc1 (vllm/vllm-openai:cu130-nightly). Requires --enforce-eager and --dtype bfloat16.

vllm serve atbender/Qwen3.5-REAP-20B-A3B-W4A16 \
  --dtype bfloat16 \
  --quantization gptq \
  --trust-remote-code \
  --enforce-eager \
  --max-model-len 4096

Parameter	Value	Why
`--dtype bfloat16`	Required	GDN linear attention overflows float16
`--quantization gptq`	Required	AutoRound uses GPTQ-compatible packing format
`--enforce-eager`	Recommended	Avoids `torch.compile` issues with hybrid linear/full attention
`--trust-remote-code`	Required	Qwen3.5 MoE uses custom model code

At ~11 GB, this model fits comfortably on a single 24 GB GPU with room for KV cache:

vllm serve atbender/Qwen3.5-REAP-20B-A3B-W4A16 \
  --dtype bfloat16 \
  --quantization gptq \
  --trust-remote-code \
  --enforce-eager \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192

Docker

All pipeline steps were run using this Docker image:

FROM modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.9.1-py311-torch2.8.0-vllm0.11.0-modelscope1.32.0-swift3.11.3

RUN pip install --no-cache-dir \
    git+https://github.com/huggingface/transformers.git@main \
    git+https://github.com/huggingface/peft.git@main \
    auto-round

Note: Both transformers and peft must be installed from main — the released versions don't support qwen3_5_moe and have a HybridCache import error respectively.

Compression Pipeline

Stage 1: REAP Expert Pruning (35B → 20B)

REAP (Router-weighted Expert Activation Pruning) from Cerebras Research combines router gate statistics with expert activation norms to determine which experts to prune. 50% of MoE experts pruned globally (256 → 128 per layer, across all 40 layers).

Saliency score per expert:

$S_j = \frac{1}{|T_j|} \sum_{t \in T_j} g_{t,j} \cdot \|e_{t,j}\|_2$

REAP calibration data:

Property	Value
Dataset	`NeelNanda/pile-10k`
Samples	256
Sequence Length	2,048
Seed	42

Changes made during pruning:

50% of MoE experts pruned globally (256 → 128 per layer, across all 40 layers)
Fused expert tensors (gate_up_proj, down_proj) sliced along expert dimension
Router gate weights pruned accordingly (only retained expert rows kept)
Routing mechanism (num_experts_per_tok = 8) unchanged
Shared expert and shared expert gate untouched
Attention layers (both linear and full), embeddings, and all non-MoE components untouched

Stage 2: AutoRound W4A16 Quantization (35 GB → 11 GB)

AutoRound from Intel uses signed gradient descent to find optimal rounding decisions — iteratively adjusting rounding directions to minimize output error across calibration samples.

Property	Value
Quantization tool	Intel AutoRound v0.10.2
Bits	4 (INT4 symmetric)
Group size	128
Format	auto_round:auto_gptq (safetensors)
Calibration dataset	`NeelNanda/pile-10k`
Calibration samples	64
Sequence length	512
Quantized sub-layers	264-265 per block (40 blocks)
Skipped (FP16)	`shared_expert_gate` (40 layers)
Skipped (BF16)	`mlp.gate` (MoE router, 40 layers) — kept unquantized to preserve routing precision

How It Was Made

The Fun Part

from auto_round import AutoRound
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "atbender/Qwen3.5-REAP-20B-A3B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "atbender/Qwen3.5-REAP-20B-A3B",
    trust_remote_code=True,
)

# Keep shared_expert_gate and MoE router at FP16
layer_config = {}
for name, module in model.named_modules():
    if "shared_expert_gate" in name:
        layer_config[name] = {"bits": 16, "group_size": -1}
    if name.endswith(".mlp.gate"):
        layer_config[name] = {"bits": 16, "group_size": -1}

ar = AutoRound(
    model,
    tokenizer=tokenizer,
    device="cuda",
    bits=4,
    group_size=128,
    nsamples=64,
    seqlen=512,
    batch_size=1,
    layer_config=layer_config,
)
ar.quantize_and_save("./Qwen3.5-REAP-20B-A3B-W4A16", format="auto_round")

The Less Fun Part (Patches Required)

These patches are needed for transformers 5.x + AutoRound compatibility:

import types, torch, transformers

# Patch 1: Conv1D removed in transformers 5.x — AutoRound still imports it
if not hasattr(transformers, 'pytorch_utils'):
    pytorch_utils = types.ModuleType('pytorch_utils')
    pytorch_utils.Conv1D = torch.nn.Linear
    transformers.pytorch_utils = pytorch_utils
elif not hasattr(transformers.pytorch_utils, 'Conv1D'):
    transformers.pytorch_utils.Conv1D = torch.nn.Linear

# Patch 2: Qwen3.5 misdetected as multimodal (has vision_config in base)
# Forces text-only calibration, excluding vision encoder from quantization
import auto_round.utils
import auto_round.autoround

auto_round.utils.is_mllm_model = lambda *args, **kwargs: False
auto_round.autoround.is_mllm_model = lambda *args, **kwargs: False

# Both patches must come BEFORE: from auto_round import AutoRound

Known Issues and Fixes

Issues found and fixed in this repository

Issue	Root Cause	Fix Applied
`qwen3_5_moe` model type not recognized	Model type not in released transformers (4.57.x)	Install transformers from `main` branch
`HybridCache` import error	peft incompatible with transformers 5.x	Install peft from `main` branch
`Conv1D` import failure in AutoRound	`transformers.pytorch_utils.Conv1D` removed in 5.x	Inject Conv1D shim before importing AutoRound
Qwen3.5 detected as multimodal	`is_mllm_model` returns True due to vision_config in base	Override `is_mllm_model` to return False
`shared_expert_gate` quantized	Scalar gating layer needs FP16 precision	Added to `layer_config` with `bits=16`
MoE router quantized (broken routing)	`mlp.gate` (router weights) must stay unquantized	Added to `layer_config` with `bits=16`
NaN output with float16	GDN linear attention overflow exceeds fp16 range	Use `torch_dtype=torch.bfloat16` always

Weight format notes

Expert weights are stored in per-expert format (experts.gate_up_proj.{id}.qweight) — transformers handles unfusing automatically during loading
Do NOT fuse expert tensors into stacked [num_experts, ...] format — transformers' quantized loading expects per-expert keys
This is the CausalLM variant (Qwen3_5MoeForCausalLM), so layer paths are model.layers.*, not model.language_model.layers.* (which applies to the multimodal ConditionalGeneration variant)

Architecture Notes

Qwen3.5-35B-A3B uses a hybrid attention architecture:

30 linear attention layers using GDN (Gated Delta Network) with conv1d kernels
10 full attention layers (standard multi-head attention with RoPE) at every 4th layer
Fused expert format: Experts stored as Qwen3_5MoeExperts with stacked tensors in BF16; unfused to per-expert quantized tensors after AutoRound
Shared expert: Each MoE layer has a shared expert (always active) + shared expert gate
Custom router: Qwen3_5MoeTopKRouter returns (softmax_probs, normalized_topk_scores, topk_indices)

Hardware & Runtime

Stage	Time	Hardware
REAP saliency collection (256 samples)	~8 min	1× NVIDIA RTX PRO 6000 Blackwell 98GB
REAP expert pruning	<1 sec	CPU
AutoRound quantization (40 blocks)	~27 min	1× NVIDIA RTX PRO 6000 Blackwell 98GB
Total pipeline	~37 min

Resource Usage (quantization stage):

Peak VRAM: 35.82 GB
Peak RAM: ~61 GB

Intended Use

Research on MoE pruning and compression techniques for Qwen3.5 hybrid attention models
Exploring sparsity-performance trade-offs in dense-expert MoE architectures (256 experts is unusually high)
Local / single-GPU deployment of a compressed Qwen3.5 MoE variant (~11 GB fits on any 24 GB+ GPU)

Limitations

No post-pruning fine-tuning — raw prune + quantize, quality degradation expected
Aggressive compression — 50% expert removal + 4-bit quantization is significant
Calibration bias — Both REAP and AutoRound used general text (pile-10k); code-heavy or domain-specific tasks may be disproportionately affected
Not benchmarked — no formal evals run yet; contributions welcome
Text-only — This is the CausalLM variant, not the multimodal ConditionalGeneration model
Requires transformers from main — qwen3_5_moe model type not in any released version yet

Citation

@article{lasby2025reap,
  title={REAP: Router-weighted Expert Activation Pruning for Scalable Mixture-of-Experts Compression},
  author={Lasby, Mike and others},
  year={2025},
  url={https://github.com/CerebrasResearch/reap}
}

@misc{autoround2024,
  title={AutoRound: Advanced Weight Quantization},
  author={Intel Corporation},
  year={2024},
  howpublished={\url{https://github.com/intel/auto-round}}
}

Acknowledgments

Qwen team — Qwen3.5-35B-A3B base model
Cerebras Research — REAP pruning method
Intel — AutoRound quantization framework

License

Apache License 2.0 — same as the base Qwen3.5-35B-A3B model.

Downloads last month: 213

Model tree for atbender/Qwen3.5-REAP-20B-A3B-W4A16

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Finetuned

atbender/Qwen3.5-REAP-20B-A3B

Quantized

(1)

this model