English | 中文
Ruichao Mao, Zhou Fang, Teng Guo, Hao Yang, Yaping Li, Shaohua Peng,
Maji Huang, Xiaoyu Lin, Shuoyang Liu, Xuepeng Li, Yuyu Zhang, Hai Rao
Ant Group
User experience (UX)—centered on usability, perceived consistency, and functional clarity—is fundamental to real-world user interfaces (UI). While multimodal large language models (MLLMs) have advanced UI tasks such as visual element grounding, GUI agents, and design-to-code generation, their capacity to reason about UX from UI screenshots remains underexplored: UX defects often arise from misalignments between design conventions and user mental models rather than from visible layout errors. We introduce UXBench, the first vision–language benchmark for UX defect diagnosis, and UI-UX, a reinforcement-learning–enhanced MLLM that performs fine-grained, experience-centered reasoning. UI-UX attains 79.63% accuracy on UXBench—surpassing Claude-4.5-Sonnet (65.50%) at the time of publication, and still leading the latest Claude-Opus-4.8 (72.90%) some six months later—while preserving strong cross-task generalization and low inference latency.
- The UI-* Series
- Highlights
- UXBench
- Results
- Approach
- Quick Start
- Repository Structure
- Citation
- License
UI-UX is part of a family of multimodal models for user interfaces developed at Ant Group:
| Project | Focus | Links |
|---|---|---|
| UI-UX (this repo) | UX defect diagnosis and experience-centered reasoning; introduces the UXBench benchmark | Paper · Model · Demo |
| UI-UG | A unified MLLM for UI Understanding and Generation: referring, grounding, captioning, and generation | Paper · Repo |
- UXBench — the first vision–language benchmark for UX defect diagnosis: 2,000 VQA samples across 8 tasks and 3 dimensions (Usability, Efficiency, Trustworthiness).
- 8 fine-grained diagnostic tasks — each requires causal reasoning rather than keyword matching; labels are validated through two rounds of review by four senior UX specialists.
- Task-adaptive reward routing + asymmetric transition reward — GRPO-based reinforcement learning that strengthens UX reasoning while suppressing overthinking and reducing inference latency.
- State-of-the-art results — UI-UX reaches 79.63% on UXBench, surpassing Claude-4.5-Sonnet (65.50%) and remaining ahead of the latest frontier MLLMs.
Figure 1. UXBench spans 8 UX defect diagnosis tasks across 3 dimensions.
UXBench is the first vision–language benchmark for UX defect diagnosis, comprising 2,000 VQA samples built from real mobile UI screenshots and organized into 8 tasks across 3 dimensions. Each sample is posed as a two- or three-choice question requiring causal reasoning that maps visual evidence to design principles, rather than keyword matching. Labels are produced via large-scale MLLM-assisted annotation followed by two rounds of expert validation by four senior UX specialists, ensuring high label consistency.
| Dimension | Task | Defect under diagnosis |
|---|---|---|
| Usability | BubbleOcclT |
Floating overlay occludes page text |
| Usability | BubbleOcclBtn |
Floating overlay blocks a clickable element |
| Efficiency | PopupNoClose |
Modal popup lacks an explicit close control |
| Efficiency | PopupBlock |
Popup obstructs the native / mini-program close button |
| Efficiency | PopupStack |
Multiple modal popups presented simultaneously |
| Trustworthiness | MismatchBadge |
Badge content inconsistent with the landing page |
| Trustworthiness | MismatchContent |
Service name inconsistent with page text |
| Trustworthiness | MismatchFunc |
Advertised description inconsistent with actual functionality |
Each task expects a single-letter answer in the format
$\boxed{X}$, enabling deterministic, exact-match scoring.
With only 4B parameters, UI-UX achieves the highest accuracy on UXBench, outperforming both instruction-tuned and reasoning MLLMs—including models two orders of magnitude larger (235B) and leading proprietary systems. Notably, UI-UX improves its same-scale base model Qwen3-VL-Thinking (4B) from 0.5254 to 0.7963, a gain of over 27 points attributable to the proposed reinforcement-learning framework.
Figure 2. Accuracy comparison across models on UXBench (2,000 samples, 8 tasks).
As reported in the paper, UI-UX reaches 79.63%, surpassing Claude-4.5-Sonnet (65.50%). Entries marked † were evaluated post-publication (2026); UI-UX remains ahead of the latest frontier models, including Claude-Opus-4.8 (0.7290).
UI-UX is a reinforcement-learning–based enhancement framework that fine-tunes a multimodal foundation model end-to-end with the GRPO algorithm, without manual preference annotation. Two mechanisms drive its UX reasoning ability: task-adaptive reward routing, which selects an appropriate reward signal per task type, and the asymmetric transition reward, which curbs redundant reasoning to reduce latency.

Figure 3. Training pipeline: data collection → labeling → curated training sets → GRPO with task-adaptive reward routing and the asymmetric transition reward.
| Component | Detail |
|---|---|
| Base model | Qwen3-VL-4B-Thinking (paper); this release upgrades the base to Qwen3.5-4B |
| Optimization | GRPO (SAPO loss) with task-adaptive reward routing |
| Reward design | Accuracy / ROUGE-L / grounding rewards + asymmetric transition reward |
| Parameters | 4B |
| Context length | 16K tokens |
| Reasoning | Chain-of-thought via <think>...</think>, with reward-based overthinking mitigation |
See cookbook/ for full training configurations and reward-function
implementations.
pip install -r requirements.txtfrom transformers import AutoModelForImageTextToText, AutoProcessor
model = AutoModelForImageTextToText.from_pretrained(
"afx-team/UI-UX", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("afx-team/UI-UX")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "screenshot.png"},
{"type": "text", "text": """Core Task: Evaluate whether the 'pop-up' offers users an explicit control for closing it.
Options:
A. No modal pop-up is present
B. The modal pop-up lacks an explicit close control
C. The modal pop-up has an explicit close control
Output Format: $\\boxed{X}$ (where X is one of A-C)."""},
],
}
]
inputs = processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_dict=True, return_tensors="pt"
).to(model.device)
output = model.generate(**inputs, max_new_tokens=8192)
response = processor.decode(output[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(response)vllm serve afx-team/UI-UX --port 8000 --max-model-len 16384 \
--dtype bfloat16 --enable-reasoning --reasoning-parser deepseek_r1from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
response = client.chat.completions.create(
model="afx-team/UI-UX",
messages=[{"role": "user", "content": [...]}],
max_tokens=8192
)
print(response.choices[0].message.content)See cookbook/ for complete examples, including batch evaluation on
UXBench.
UI-UX/
├── assets/ # Logo, figures
├── cookbook/ # Usage examples
│ ├── inference_transformers.py
│ ├── inference_vllm.py
│ ├── eval_uxbench.py
│ ├── serve_vllm.sh
│ └── README.md
├── .github/ # Issue / PR templates
├── app.py # Gradio demo
├── README.md
├── requirements.txt
├── LEGAL.md
└── LICENSE
If you find UXBench or UI-UX useful in your research, please cite:
@inproceedings{uiux2026,
title={Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach},
author={Mao, Ruichao and Fang, Zhou and Guo, Teng and Yang, Hao and Li, Yaping and Peng, Shaohua and Huang, Maji and Lin, Xiaoyu and Liu, Shuoyang and Li, Xuepeng and Zhang, Yuyu and Rao, Hai},
booktitle={CVPR Findings},
year={2026}
}This project is released under the MIT License. Please review
LEGAL.md for additional terms governing the model and benchmark.
UI-UX builds upon Qwen3-VL and is trained with ms-swift. We thank the open-source community for these foundations.
