Qwen3.5-REAP-20B-A3B-W4A16

Qwen3.5-REAP-20B-A3B-W4A16 is an INT4 weight-quantized version of atbender/Qwen3.5-REAP-20B-A3B, created by applying REAP pruning + AutoRound W4A16 quantization to Qwen/Qwen3.5-35B-A3B.

35B β†’ 20B (REAP 50% prune) β†’ 11 GB (W4A16) | ~3B active per token

Model Specifications

Property Original REAP Pruned Quantized (this model)
Model Qwen/Qwen3.5-35B-A3B atbender/Qwen3.5-REAP-20B-A3B Qwen3.5-REAP-20B-A3B-W4A16
Total Parameters ~35B ~20B ~20B
Active Parameters ~3B ~3B ~3B
Experts per Layer 256 128 128
Experts Routed per Token 8 8 8
Shared Expert 1 per layer 1 per layer 1 per layer
Hidden Layers 40 40 40
Layer Types 30 linear + 10 full attn 30 linear + 10 full attn 30 linear + 10 full attn
Hidden Size 2048 2048 2048
MoE Intermediate Size 512 512 512
Context Length 262,144 262,144 262,144
Precision BF16 BF16 W4A16 (INT4 weights, BF16 activations)
Disk Size ~67 GB ~35 GB ~11 GB

Important: dtype Must Be bfloat16

This model uses hybrid attention with GDN (Gated Delta Network) linear attention layers. These layers produce intermediate values that exceed float16's dynamic range (max 65504), causing NaN outputs. You must use --dtype bfloat16 (or torch_dtype=torch.bfloat16). The model will produce garbage output with float16.


How to Use

Requirements

  • transformers >= 5.x (from main branch) β€” the qwen3_5_moe model type is not in any released version yet
  • torch >= 2.4 with CUDA support
  • ~11 GB disk + ~14 GB VRAM β€” fits on any 24 GB+ GPU

Install transformers from main:

pip install git+https://github.com/huggingface/transformers.git@main

With transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "atbender/Qwen3.5-REAP-20B-A3B-W4A16"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,  # Required β€” float16 causes NaN
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Write a Python function to compute fibonacci numbers."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True))

With vLLM

Tested with vLLM v0.17.0rc1 (vllm/vllm-openai:cu130-nightly). Requires --enforce-eager and --dtype bfloat16.

vllm serve atbender/Qwen3.5-REAP-20B-A3B-W4A16 \
  --dtype bfloat16 \
  --quantization gptq \
  --trust-remote-code \
  --enforce-eager \
  --max-model-len 4096
Parameter Value Why
--dtype bfloat16 Required GDN linear attention overflows float16
--quantization gptq Required AutoRound uses GPTQ-compatible packing format
--enforce-eager Recommended Avoids torch.compile issues with hybrid linear/full attention
--trust-remote-code Required Qwen3.5 MoE uses custom model code

At ~11 GB, this model fits comfortably on a single 24 GB GPU with room for KV cache:

vllm serve atbender/Qwen3.5-REAP-20B-A3B-W4A16 \
  --dtype bfloat16 \
  --quantization gptq \
  --trust-remote-code \
  --enforce-eager \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192

Docker

All pipeline steps were run using this Docker image:

FROM modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.9.1-py311-torch2.8.0-vllm0.11.0-modelscope1.32.0-swift3.11.3

RUN pip install --no-cache-dir \
    git+https://github.com/huggingface/transformers.git@main \
    git+https://github.com/huggingface/peft.git@main \
    auto-round

Note: Both transformers and peft must be installed from main β€” the released versions don't support qwen3_5_moe and have a HybridCache import error respectively.


Compression Pipeline

Stage 1: REAP Expert Pruning (35B β†’ 20B)

REAP (Router-weighted Expert Activation Pruning) from Cerebras Research combines router gate statistics with expert activation norms to determine which experts to prune. 50% of MoE experts pruned globally (256 β†’ 128 per layer, across all 40 layers).

Saliency score per expert:

Sj=1∣Tjβˆ£βˆ‘t∈Tjgt,jβ‹…βˆ₯et,jβˆ₯2S_j = \frac{1}{|T_j|} \sum_{t \in T_j} g_{t,j} \cdot \|e_{t,j}\|_2

REAP calibration data:

Property Value
Dataset NeelNanda/pile-10k
Samples 256
Sequence Length 2,048
Seed 42

Changes made during pruning:

  • 50% of MoE experts pruned globally (256 β†’ 128 per layer, across all 40 layers)
  • Fused expert tensors (gate_up_proj, down_proj) sliced along expert dimension
  • Router gate weights pruned accordingly (only retained expert rows kept)
  • Routing mechanism (num_experts_per_tok = 8) unchanged
  • Shared expert and shared expert gate untouched
  • Attention layers (both linear and full), embeddings, and all non-MoE components untouched

Stage 2: AutoRound W4A16 Quantization (35 GB β†’ 11 GB)

AutoRound from Intel uses signed gradient descent to find optimal rounding decisions β€” iteratively adjusting rounding directions to minimize output error across calibration samples.

Property Value
Quantization tool Intel AutoRound v0.10.2
Bits 4 (INT4 symmetric)
Group size 128
Format auto_round:auto_gptq (safetensors)
Calibration dataset NeelNanda/pile-10k
Calibration samples 64
Sequence length 512
Quantized sub-layers 264-265 per block (40 blocks)
Skipped (FP16) shared_expert_gate (40 layers)
Skipped (BF16) mlp.gate (MoE router, 40 layers) β€” kept unquantized to preserve routing precision

How It Was Made

The Fun Part

from auto_round import AutoRound
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "atbender/Qwen3.5-REAP-20B-A3B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "atbender/Qwen3.5-REAP-20B-A3B",
    trust_remote_code=True,
)

# Keep shared_expert_gate and MoE router at FP16
layer_config = {}
for name, module in model.named_modules():
    if "shared_expert_gate" in name:
        layer_config[name] = {"bits": 16, "group_size": -1}
    if name.endswith(".mlp.gate"):
        layer_config[name] = {"bits": 16, "group_size": -1}

ar = AutoRound(
    model,
    tokenizer=tokenizer,
    device="cuda",
    bits=4,
    group_size=128,
    nsamples=64,
    seqlen=512,
    batch_size=1,
    layer_config=layer_config,
)
ar.quantize_and_save("./Qwen3.5-REAP-20B-A3B-W4A16", format="auto_round")

The Less Fun Part (Patches Required)

These patches are needed for transformers 5.x + AutoRound compatibility:

import types, torch, transformers

# Patch 1: Conv1D removed in transformers 5.x β€” AutoRound still imports it
if not hasattr(transformers, 'pytorch_utils'):
    pytorch_utils = types.ModuleType('pytorch_utils')
    pytorch_utils.Conv1D = torch.nn.Linear
    transformers.pytorch_utils = pytorch_utils
elif not hasattr(transformers.pytorch_utils, 'Conv1D'):
    transformers.pytorch_utils.Conv1D = torch.nn.Linear

# Patch 2: Qwen3.5 misdetected as multimodal (has vision_config in base)
# Forces text-only calibration, excluding vision encoder from quantization
import auto_round.utils
import auto_round.autoround

auto_round.utils.is_mllm_model = lambda *args, **kwargs: False
auto_round.autoround.is_mllm_model = lambda *args, **kwargs: False

# Both patches must come BEFORE: from auto_round import AutoRound

Known Issues and Fixes

Issues found and fixed in this repository

Issue Root Cause Fix Applied
qwen3_5_moe model type not recognized Model type not in released transformers (4.57.x) Install transformers from main branch
HybridCache import error peft incompatible with transformers 5.x Install peft from main branch
Conv1D import failure in AutoRound transformers.pytorch_utils.Conv1D removed in 5.x Inject Conv1D shim before importing AutoRound
Qwen3.5 detected as multimodal is_mllm_model returns True due to vision_config in base Override is_mllm_model to return False
shared_expert_gate quantized Scalar gating layer needs FP16 precision Added to layer_config with bits=16
MoE router quantized (broken routing) mlp.gate (router weights) must stay unquantized Added to layer_config with bits=16
NaN output with float16 GDN linear attention overflow exceeds fp16 range Use torch_dtype=torch.bfloat16 always

Weight format notes

  • Expert weights are stored in per-expert format (experts.gate_up_proj.{id}.qweight) β€” transformers handles unfusing automatically during loading
  • Do NOT fuse expert tensors into stacked [num_experts, ...] format β€” transformers' quantized loading expects per-expert keys
  • This is the CausalLM variant (Qwen3_5MoeForCausalLM), so layer paths are model.layers.*, not model.language_model.layers.* (which applies to the multimodal ConditionalGeneration variant)

Architecture Notes

Qwen3.5-35B-A3B uses a hybrid attention architecture:

  • 30 linear attention layers using GDN (Gated Delta Network) with conv1d kernels
  • 10 full attention layers (standard multi-head attention with RoPE) at every 4th layer
  • Fused expert format: Experts stored as Qwen3_5MoeExperts with stacked tensors in BF16; unfused to per-expert quantized tensors after AutoRound
  • Shared expert: Each MoE layer has a shared expert (always active) + shared expert gate
  • Custom router: Qwen3_5MoeTopKRouter returns (softmax_probs, normalized_topk_scores, topk_indices)

Hardware & Runtime

Stage Time Hardware
REAP saliency collection (256 samples) ~8 min 1Γ— NVIDIA RTX PRO 6000 Blackwell 98GB
REAP expert pruning <1 sec CPU
AutoRound quantization (40 blocks) ~27 min 1Γ— NVIDIA RTX PRO 6000 Blackwell 98GB
Total pipeline ~37 min

Resource Usage (quantization stage):

  • Peak VRAM: 35.82 GB
  • Peak RAM: ~61 GB

Intended Use

  • Research on MoE pruning and compression techniques for Qwen3.5 hybrid attention models
  • Exploring sparsity-performance trade-offs in dense-expert MoE architectures (256 experts is unusually high)
  • Local / single-GPU deployment of a compressed Qwen3.5 MoE variant (~11 GB fits on any 24 GB+ GPU)

Limitations

  • No post-pruning fine-tuning β€” raw prune + quantize, quality degradation expected
  • Aggressive compression β€” 50% expert removal + 4-bit quantization is significant
  • Calibration bias β€” Both REAP and AutoRound used general text (pile-10k); code-heavy or domain-specific tasks may be disproportionately affected
  • Not benchmarked β€” no formal evals run yet; contributions welcome
  • Text-only β€” This is the CausalLM variant, not the multimodal ConditionalGeneration model
  • Requires transformers from main β€” qwen3_5_moe model type not in any released version yet

Citation

@article{lasby2025reap,
  title={REAP: Router-weighted Expert Activation Pruning for Scalable Mixture-of-Experts Compression},
  author={Lasby, Mike and others},
  year={2025},
  url={https://github.com/CerebrasResearch/reap}
}

@misc{autoround2024,
  title={AutoRound: Advanced Weight Quantization},
  author={Intel Corporation},
  year={2024},
  howpublished={\url{https://github.com/intel/auto-round}}
}

Acknowledgments

  • Qwen team β€” Qwen3.5-35B-A3B base model
  • Cerebras Research β€” REAP pruning method
  • Intel β€” AutoRound quantization framework

License

Apache License 2.0 β€” same as the base Qwen3.5-35B-A3B model.

Downloads last month
213
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for atbender/Qwen3.5-REAP-20B-A3B-W4A16

Quantized
(1)
this model