Instructions to use QuantTrio/GLM-5.2-Int4-Int8Mix with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use QuantTrio/GLM-5.2-Int4-Int8Mix with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="QuantTrio/GLM-5.2-Int4-Int8Mix")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("QuantTrio/GLM-5.2-Int4-Int8Mix")
model = AutoModelForCausalLM.from_pretrained("QuantTrio/GLM-5.2-Int4-Int8Mix")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use QuantTrio/GLM-5.2-Int4-Int8Mix with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "QuantTrio/GLM-5.2-Int4-Int8Mix"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-5.2-Int4-Int8Mix",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/QuantTrio/GLM-5.2-Int4-Int8Mix

SGLang

How to use QuantTrio/GLM-5.2-Int4-Int8Mix with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "QuantTrio/GLM-5.2-Int4-Int8Mix" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-5.2-Int4-Int8Mix",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "QuantTrio/GLM-5.2-Int4-Int8Mix" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-5.2-Int4-Int8Mix",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use QuantTrio/GLM-5.2-Int4-Int8Mix with Docker Model Runner:
```
docker model run hf.co/QuantTrio/GLM-5.2-Int4-Int8Mix
```

GLM-5.2-Int4-Int8Mix generates only "!!!!!" on H100 8×80GB with vLLM 0.23.0/0.24.0 — IndexShare sparse indexer K-cache never populated during prefill

by kinggenguo - opened about 15 hours ago

Discussion

kinggenguo

about 15 hours ago

Hi,

We are running QuantTrio/GLM-5.2-Int4-Int8Mix on a self-hosted H100 cluster and consistently get degenerate outputs (all exclamation marks !!!!!!)
regardless of the prompt. After extensive debugging, we traced the root cause to the IndexShare sparse indexer K-cache not being populated during the
prefill phase in vLLM 0.23.0 and 0.24.0. We'd like to ask whether the official team has a known fix or workaround.

Environment

┌────────────────────┬──────────────────────────────────┐
│ Component │ Version │
├────────────────────┼──────────────────────────────────┤
│ GPU │ 8× NVIDIA H100 SXM5 80GB (sm_90) │
├────────────────────┼──────────────────────────────────┤
│ CUDA │ 13.0 │
├────────────────────┼──────────────────────────────────┤
│ PyTorch │ 2.11.0+cu130 │
├────────────────────┼──────────────────────────────────┤
│ vLLM │ 0.23.0 / 0.24.0 (both tested) │
├────────────────────┼──────────────────────────────────┤
│ transformers │ 5.12.1 │
├────────────────────┼──────────────────────────────────┤
│ compressed-tensors │ 0.17.0 │
├────────────────────┼──────────────────────────────────┤
│ Python │ 3.11.11 │
└────────────────────┴──────────────────────────────────┘

Launch Command

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export LD_LIBRARY_PATH=/root/miniconda3/envs/vllm_glm/lib/python3.11/site-packages/nvidia/cu13/lib:${LD_LIBRARY_PATH}

vllm serve /path/to/GLM-5.2-Int4-Int8Mix
--host 0.0.0.0 --port 8000
--served-model-name GLM5.2
--trust-remote-code
--dtype bfloat16
--quantization compressed-tensors
--kv-cache-dtype fp8
--tensor-parallel-size 8
--enable-expert-parallel
--gpu-memory-utilization 0.90
--max-model-len 65536
--max-num-seqs 32
--speculative-config.method mtp
--speculative-config.num_speculative_tokens 1
--tool-call-parser glm47
--reasoning-parser glm45
--enable-auto-tool-choice

Symptom

Every generation, regardless of prompt or temperature, produces only repeated ! characters:

content: "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"

Inspecting logprobs reveals the logit distribution is perfectly uniform across the entire 154,880-token vocabulary:

tok: '!' logprob: -11.950 (= -ln(154880) → uniform distribution)
tok: '#' logprob: -11.950
tok: '"' logprob: -11.950
...

This confirms that the model's final hidden states entering lm_head are effectively zero, causing the softmax to collapse to a uniform distribution and
argmax to always select token 0 (!).

Root Cause Investigation

We added instrumentation at multiple points and traced the failure to the IndexShare sparse indexer K-cache in GLM-5.2's DSA (Dense-Sparse Attention)
architecture:

During decode, topk_indices are almost entirely -1 (invalid):

decode step with a 3-token prefill ("hi"):

[TOPK] valid=3/2048 sample=[0, -1, -1]

Only 3 out of 2048 topk slots are valid, matching the 3 prefill tokens. The FlashMLASparse attention kernel then attends to only 1–3 positions, producing
near-zero outputs and causing the hidden state collapse.

The indexer K-cache is never written during prefill.

In vLLM's execution flow for sparse MLA models, MultiHeadLatentAttentionWrapper.forward() is not called during the prefill phase. The prefill path goes
through a separate prefill_backend that bypasses the Indexer.forward() call entirely. As a result, the indexer K-cache remains empty after prefill, and
decode-time topk selection has no valid KV history to choose from.

We confirmed this by logging inside Indexer.forward():

It is called during CUDA graph warmup (batch sizes 4, 8192, etc.)
It is called during decode (as expected, inside the CUDA graph)
It is never called during actual prefill inference

The topk_length fix (PR #36616, included in v0.24.0) is insufficient.

Even after applying the topk_length fix which tells the kernel to ignore -1 indices, the output remains degenerate because the K-cache itself is empty —
there are no valid KVs to attend to in the first place.

What We Have Tried

vLLM 0.23.0 and 0.24.0 (both include PR #45895 and PR #36616) → same result
Removing --kv-cache-dtype fp8 (BF16 KV cache) → same result
Removing --speculative-config (no MTP) → same result
Removing --enable-expert-parallel → same result
Adding --enforce-eager → same result
All model files verified intact (138 files, 378 GiB, all sizes match HuggingFace repo)

Question

Based on the vLLM recipes page, it mentions using the dedicated Docker image vllm/vllm-openai:glm52 for the best experience with GLM-5.2. We believe this
image contains additional fixes for the prefill indexer K-cache population issue that are not yet in the PyPI releases.

Could you clarify:

Is this a known issue with GLM-5.2 on H100 (sm_90) in vLLM ≤ 0.24.0?
Does the vllm/vllm-openai:glm52 Docker image contain a fix for the prefill-phase indexer K-cache not being written?
If so, could you point us to the specific commit or PR that resolves the prefill K-cache population for the IndexShare sparse attention architecture?

Thank you!

JunHowie

QuantTrio org about 12 hours ago

We tested this model on H200 GPUs. The main software versions we used are:

vllm==0.23.0
transformers==5.12.1

I have two suggestions:

Add --moe-backend triton to your vLLM launch command.
Create a fresh Python 3.12 virtual environment and reinstall the dependencies before trying again.

Looking forward to hearing about your test results. If you run into any other issues, please let me know, and I'll be happy to help.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment