UI-UX

Reasoning for Mobile User Experience with Multimodal LLMs:
Task, Benchmark, and Approach

English | 中文

Ruichao Mao, Zhou Fang, Teng Guo, Hao Yang, Yaping Li, Shaohua Peng,
Maji Huang, Xiaoyu Lin, Shuoyang Liu, Xuepeng Li, Yuyu Zhang, Hai Rao
Ant Group

Abstract

User experience (UX)—centered on usability, perceived consistency, and functional clarity—is fundamental to real-world user interfaces (UI). While multimodal large language models (MLLMs) have advanced UI tasks such as visual element grounding, GUI agents, and design-to-code generation, their capacity to reason about UX from UI screenshots remains underexplored: UX defects often arise from misalignments between design conventions and user mental models rather than from visible layout errors. We introduce UXBench, the first vision–language benchmark for UX defect diagnosis, and UI-UX, a reinforcement-learning–enhanced MLLM that performs fine-grained, experience-centered reasoning. UI-UX attains 79.63% accuracy on UXBench—surpassing Claude-4.5-Sonnet (65.50%) at the time of publication, and still leading the latest Claude-Opus-4.8 (72.90%) some six months later—while preserving strong cross-task generalization and low inference latency.

The UI-* Series

UI-UX is part of a family of multimodal models for user interfaces developed at Ant Group:

Project	Focus	Links
UI-UX (this repo)	UX defect diagnosis and experience-centered reasoning; introduces the UXBench benchmark	Paper · Model · Demo
UI-UG	A unified MLLM for UI Understanding and Generation: referring, grounding, captioning, and generation	Paper · Repo

Highlights

UXBench — the first vision–language benchmark for UX defect diagnosis: 2,000 VQA samples across 8 tasks and 3 dimensions (Usability, Efficiency, Trustworthiness).
8 fine-grained diagnostic tasks — each requires causal reasoning rather than keyword matching; labels are validated through two rounds of review by four senior UX specialists.
Task-adaptive reward routing + asymmetric transition reward — GRPO-based reinforcement learning that strengthens UX reasoning while suppressing overthinking and reducing inference latency.
State-of-the-art results — UI-UX reaches 79.63% on UXBench, surpassing Claude-4.5-Sonnet (65.50%) and remaining ahead of the latest frontier MLLMs.

Figure 1. UXBench spans 8 UX defect diagnosis tasks across 3 dimensions.

UXBench

UXBench is the first vision–language benchmark for UX defect diagnosis, comprising 2,000 VQA samples built from real mobile UI screenshots and organized into 8 tasks across 3 dimensions. Each sample is posed as a two- or three-choice question requiring causal reasoning that maps visual evidence to design principles, rather than keyword matching. Labels are produced via large-scale MLLM-assisted annotation followed by two rounds of expert validation by four senior UX specialists, ensuring high label consistency.

Dimension	Task	Defect under diagnosis
Usability	`BubbleOcclT`	Floating overlay occludes page text
Usability	`BubbleOcclBtn`	Floating overlay blocks a clickable element
Efficiency	`PopupNoClose`	Modal popup lacks an explicit close control
Efficiency	`PopupBlock`	Popup obstructs the native / mini-program close button
Efficiency	`PopupStack`	Multiple modal popups presented simultaneously
Trustworthiness	`MismatchBadge`	Badge content inconsistent with the landing page
Trustworthiness	`MismatchContent`	Service name inconsistent with page text
Trustworthiness	`MismatchFunc`	Advertised description inconsistent with actual functionality

Each task expects a single-letter answer in the format $\boxed{X}$ , enabling deterministic, exact-match scoring.

Results

With only 4B parameters, UI-UX achieves the highest accuracy on UXBench, outperforming both instruction-tuned and reasoning MLLMs—including models two orders of magnitude larger (235B) and leading proprietary systems. Notably, UI-UX improves its same-scale base model Qwen3-VL-Thinking (4B) from 0.5254 to 0.7963, a gain of over 27 points attributable to the proposed reinforcement-learning framework.

Figure 2. Accuracy comparison across models on UXBench (2,000 samples, 8 tasks).

As reported in the paper, UI-UX reaches 79.63%, surpassing Claude-4.5-Sonnet (65.50%). Entries marked † were evaluated post-publication (2026); UI-UX remains ahead of the latest frontier models, including Claude-Opus-4.8 (0.7290).

Approach

UI-UX is a reinforcement-learning–based enhancement framework that fine-tunes a multimodal foundation model end-to-end with the GRPO algorithm, without manual preference annotation. Two mechanisms drive its UX reasoning ability: task-adaptive reward routing, which selects an appropriate reward signal per task type, and the asymmetric transition reward, which curbs redundant reasoning to reduce latency.

Figure 3. Training pipeline: data collection → labeling → curated training sets → GRPO with task-adaptive reward routing and the asymmetric transition reward.

Component	Detail
Base model	Qwen3-VL-4B-Thinking (paper); this release upgrades the base to Qwen3.5-4B
Optimization	GRPO (SAPO loss) with task-adaptive reward routing
Reward design	Accuracy / ROUGE-L / grounding rewards + asymmetric transition reward
Parameters	4B
Context length	16K tokens
Reasoning	Chain-of-thought via `<think>...</think>`, with reward-based overthinking mitigation

See cookbook/ for full training configurations and reward-function implementations.

Quick Start

Installation

pip install -r requirements.txt

Inference with Transformers

from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "afx-team/UI-UX", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("afx-team/UI-UX")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "screenshot.png"},
            {"type": "text", "text": """Core Task: Evaluate whether the 'pop-up' offers users an explicit control for closing it.
Options:
A. No modal pop-up is present
B. The modal pop-up lacks an explicit close control
C. The modal pop-up has an explicit close control
Output Format: $\\boxed{X}$ (where X is one of A-C)."""},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_dict=True, return_tensors="pt"
).to(model.device)

output = model.generate(**inputs, max_new_tokens=8192)
response = processor.decode(output[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(response)

Inference with vLLM (Recommended)

vllm serve afx-team/UI-UX --port 8000 --max-model-len 16384 \
    --dtype bfloat16 --enable-reasoning --reasoning-parser deepseek_r1

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
response = client.chat.completions.create(
    model="afx-team/UI-UX",
    messages=[{"role": "user", "content": [...]}],
    max_tokens=8192
)
print(response.choices[0].message.content)

See cookbook/ for complete examples, including batch evaluation on UXBench.

Repository Structure

UI-UX/
├── assets/                 # Logo, figures
├── cookbook/               # Usage examples
│   ├── inference_transformers.py
│   ├── inference_vllm.py
│   ├── eval_uxbench.py
│   ├── serve_vllm.sh
│   └── README.md
├── .github/                # Issue / PR templates
├── app.py                  # Gradio demo
├── README.md
├── requirements.txt
├── LEGAL.md
└── LICENSE

Citation

If you find UXBench or UI-UX useful in your research, please cite:

@inproceedings{uiux2026,
  title={Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach},
  author={Mao, Ruichao and Fang, Zhou and Guo, Teng and Yang, Hao and Li, Yaping and Peng, Shaohua and Huang, Maji and Lin, Xiaoyu and Liu, Shuoyang and Li, Xuepeng and Zhang, Yuyu and Rao, Hai},
  booktitle={CVPR Findings},
  year={2026}
}

License

This project is released under the MIT License. Please review LEGAL.md for additional terms governing the model and benchmark.

Acknowledgements

UI-UX builds upon Qwen3-VL and is trained with ms-swift. We thank the open-source community for these foundations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

UI-UX

Reasoning for Mobile User Experience with Multimodal LLMs:
Task, Benchmark, and Approach

Abstract

Table of Contents

The UI-* Series

Highlights

UXBench

Results

Approach

Quick Start

Installation

Inference with Transformers

Inference with vLLM (Recommended)

Repository Structure

Citation

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github		.github
assets		assets
cookbook		cookbook
images		images
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LEGAL.md		LEGAL.md
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
UI-UX.pdf		UI-UX.pdf
app.py		app.py
requirements.txt		requirements.txt

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

UI-UX

Reasoning for Mobile User Experience with Multimodal LLMs:Task, Benchmark, and Approach

Abstract

Table of Contents

The UI-* Series

Highlights

UXBench

Results

Approach

Quick Start

Installation

Inference with Transformers

Inference with vLLM (Recommended)

Repository Structure

Citation

License

Acknowledgements

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Reasoning for Mobile User Experience with Multimodal LLMs:
Task, Benchmark, and Approach

Packages