UI-UX

UI-UX is a multimodal LLM optimized for UI/UX defect diagnosis on mobile interfaces. Built on Qwen3.5 with task-aware GRPO reinforcement learning, it achieves 79.63% on UXBench — surpassing Claude-4.5-Sonnet (65.50%) and GPT-4o class models.

Highlights

State-of-the-art UX reasoning — 79.63% on UXBench
UXBench — first vision-language benchmark for UX defect diagnosis, 2,000 samples across 8 tasks
8 diagnostic tasks covering Usability, Efficiency, and Trustworthiness
Thinking mode — chain-of-thought reasoning via <think>...</think> tokens, with reward-based mitigation of overthinking

Model Details

Property	Value
Base model	Qwen3.5
Architecture	Qwen3_5ForConditionalGeneration (hybrid linear + full attention)
Training method	GRPO with Asymmetric Transition Reward
Context length	16K tokens
Precision	bfloat16
Task	Image-text-to-text (UX defect diagnosis)

Training data: Multi-task UX datasets — BubbleBtn, BubbleNum, MiniPgm, ModelClose, MultiModel, TextCon, TextOcc, MarkerCons

Quick Start

Installation

pip install transformers accelerate

Inference with Transformers

from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "afx-team/UI-UX", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("afx-team/UI-UX")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "screenshot.png"},
            {"type": "text", "text": """Core Task: Evaluate whether the 'pop-up' offers users an explicit control for closing it.
Options:
A. No modal pop-up is present
B. The modal pop-up lacks an explicit close control
C. The modal pop-up has an explicit close control
Output Format: $\\boxed{X}$ (where X is one of A-C)."""},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_dict=True, return_tensors="pt"
).to(model.device)

output = model.generate(**inputs, max_new_tokens=8192)
response = processor.decode(output[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(response)

Inference with vLLM (Recommended)

vllm serve afx-team/UI-UX --port 8000 --max-model-len 16384 \
    --dtype bfloat16 --enable-reasoning --reasoning-parser deepseek_r1

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
response = client.chat.completions.create(
    model="afx-team/UI-UX",
    messages=[{"role": "user", "content": [...]}],
    max_tokens=8192
)
print(response.choices[0].message.content)

UXBench

UXBench is the first vision-language benchmark for UX defect diagnosis, containing 2,000 VQA samples across 8 tasks in 3 dimensions:

Dimension	Tasks	Description
Usability	BubbleOcclT, BubbleOcclBtn, PopupNoClose	Text/button occlusion, missing close controls
Efficiency	PopupBlock, PopupStack	Popup blocking interactions, dialog stacking
Trustworthiness	MismatchBadge, MismatchContent, MismatchFunc	Inconsistencies in badges, content, and ads

Citation

@inproceedings{uiux2026,
  title={Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach},
  author={Mao, Ruichao and Fang, Zhou and Guo, Teng and Yang, Hao and Li, Yaping and Peng, Shaohua and Huang, Maji and Lin, Xiaoyu and Liu, Shuoyang and Li, Xuepeng and Zhang, Yuyu and Rao, Hai},
  booktitle={CVPR Findings},
  year={2026}
}

License

This project is licensed under the MIT License.

Acknowledgements

UI-UX is built upon Qwen3.5 and trained with ms-swift.

Downloads last month: 89

Safetensors

Model size

5B params

Tensor type

BF16