Instructions to use QuantTrio/GLM-5.2-Int4-Int8Mix with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use QuantTrio/GLM-5.2-Int4-Int8Mix with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="QuantTrio/GLM-5.2-Int4-Int8Mix") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("QuantTrio/GLM-5.2-Int4-Int8Mix") model = AutoModelForCausalLM.from_pretrained("QuantTrio/GLM-5.2-Int4-Int8Mix") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use QuantTrio/GLM-5.2-Int4-Int8Mix with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "QuantTrio/GLM-5.2-Int4-Int8Mix" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuantTrio/GLM-5.2-Int4-Int8Mix", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/QuantTrio/GLM-5.2-Int4-Int8Mix
- SGLang
How to use QuantTrio/GLM-5.2-Int4-Int8Mix with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "QuantTrio/GLM-5.2-Int4-Int8Mix" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuantTrio/GLM-5.2-Int4-Int8Mix", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "QuantTrio/GLM-5.2-Int4-Int8Mix" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuantTrio/GLM-5.2-Int4-Int8Mix", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use QuantTrio/GLM-5.2-Int4-Int8Mix with Docker Model Runner:
docker model run hf.co/QuantTrio/GLM-5.2-Int4-Int8Mix
GLM-5.2-Int4-Int8Mix generates only "!!!!!" on H100 8×80GB with vLLM 0.23.0/0.24.0 — IndexShare sparse indexer K-cache never populated during prefill
Hi,
We are running QuantTrio/GLM-5.2-Int4-Int8Mix on a self-hosted H100 cluster and consistently get degenerate outputs (all exclamation marks !!!!!!)
regardless of the prompt. After extensive debugging, we traced the root cause to the IndexShare sparse indexer K-cache not being populated during the
prefill phase in vLLM 0.23.0 and 0.24.0. We'd like to ask whether the official team has a known fix or workaround.
Environment
┌────────────────────┬──────────────────────────────────┐
│ Component │ Version │
├────────────────────┼──────────────────────────────────┤
│ GPU │ 8× NVIDIA H100 SXM5 80GB (sm_90) │
├────────────────────┼──────────────────────────────────┤
│ CUDA │ 13.0 │
├────────────────────┼──────────────────────────────────┤
│ PyTorch │ 2.11.0+cu130 │
├────────────────────┼──────────────────────────────────┤
│ vLLM │ 0.23.0 / 0.24.0 (both tested) │
├────────────────────┼──────────────────────────────────┤
│ transformers │ 5.12.1 │
├────────────────────┼──────────────────────────────────┤
│ compressed-tensors │ 0.17.0 │
├────────────────────┼──────────────────────────────────┤
│ Python │ 3.11.11 │
└────────────────────┴──────────────────────────────────┘
Launch Command
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export LD_LIBRARY_PATH=/root/miniconda3/envs/vllm_glm/lib/python3.11/site-packages/nvidia/cu13/lib:${LD_LIBRARY_PATH}
vllm serve /path/to/GLM-5.2-Int4-Int8Mix
--host 0.0.0.0 --port 8000
--served-model-name GLM5.2
--trust-remote-code
--dtype bfloat16
--quantization compressed-tensors
--kv-cache-dtype fp8
--tensor-parallel-size 8
--enable-expert-parallel
--gpu-memory-utilization 0.90
--max-model-len 65536
--max-num-seqs 32
--speculative-config.method mtp
--speculative-config.num_speculative_tokens 1
--tool-call-parser glm47
--reasoning-parser glm45
--enable-auto-tool-choice
Symptom
Every generation, regardless of prompt or temperature, produces only repeated ! characters:
content: "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
Inspecting logprobs reveals the logit distribution is perfectly uniform across the entire 154,880-token vocabulary:
tok: '!' logprob: -11.950 (= -ln(154880) → uniform distribution)
tok: '#' logprob: -11.950
tok: '"' logprob: -11.950
...
This confirms that the model's final hidden states entering lm_head are effectively zero, causing the softmax to collapse to a uniform distribution and
argmax to always select token 0 (!).
Root Cause Investigation
We added instrumentation at multiple points and traced the failure to the IndexShare sparse indexer K-cache in GLM-5.2's DSA (Dense-Sparse Attention)
architecture:
- During decode, topk_indices are almost entirely -1 (invalid):
decode step with a 3-token prefill ("hi"):
[TOPK] valid=3/2048 sample=[0, -1, -1]
Only 3 out of 2048 topk slots are valid, matching the 3 prefill tokens. The FlashMLASparse attention kernel then attends to only 1–3 positions, producing
near-zero outputs and causing the hidden state collapse.
- The indexer K-cache is never written during prefill.
In vLLM's execution flow for sparse MLA models, MultiHeadLatentAttentionWrapper.forward() is not called during the prefill phase. The prefill path goes
through a separate prefill_backend that bypasses the Indexer.forward() call entirely. As a result, the indexer K-cache remains empty after prefill, and
decode-time topk selection has no valid KV history to choose from.
We confirmed this by logging inside Indexer.forward():
- It is called during CUDA graph warmup (batch sizes 4, 8192, etc.)
- It is called during decode (as expected, inside the CUDA graph)
- It is never called during actual prefill inference
- The topk_length fix (PR #36616, included in v0.24.0) is insufficient.
Even after applying the topk_length fix which tells the kernel to ignore -1 indices, the output remains degenerate because the K-cache itself is empty —
there are no valid KVs to attend to in the first place.
What We Have Tried
- vLLM 0.23.0 and 0.24.0 (both include PR #45895 and PR #36616) → same result
- Removing --kv-cache-dtype fp8 (BF16 KV cache) → same result
- Removing --speculative-config (no MTP) → same result
- Removing --enable-expert-parallel → same result
- Adding --enforce-eager → same result
- All model files verified intact (138 files, 378 GiB, all sizes match HuggingFace repo)
Question
Based on the vLLM recipes page, it mentions using the dedicated Docker image vllm/vllm-openai:glm52 for the best experience with GLM-5.2. We believe this
image contains additional fixes for the prefill indexer K-cache population issue that are not yet in the PyPI releases.
Could you clarify:
- Is this a known issue with GLM-5.2 on H100 (sm_90) in vLLM ≤ 0.24.0?
- Does the vllm/vllm-openai:glm52 Docker image contain a fix for the prefill-phase indexer K-cache not being written?
- If so, could you point us to the specific commit or PR that resolves the prefill K-cache population for the IndexShare sparse attention architecture?
Thank you!
We tested this model on H200 GPUs. The main software versions we used are:
vllm==0.23.0transformers==5.12.1
I have two suggestions:
- Add
--moe-backend tritonto your vLLM launch command. - Create a fresh Python 3.12 virtual environment and reinstall the dependencies before trying again.
Looking forward to hearing about your test results. If you run into any other issues, please let me know, and I'll be happy to help.