Skip to content

afx-team/UI-UX

Repository files navigation

UI-UX Logo

UI-UX

Reasoning for Mobile User Experience with Multimodal LLMs:
Task, Benchmark, and Approach

English | 中文

Ruichao Mao, Zhou Fang, Teng Guo, Hao Yang, Yaping Li, Shaohua Peng,
Maji Huang, Xiaoyu Lin, Shuoyang Liu, Xuepeng Li, Yuyu Zhang, Hai Rao

Ant Group

CVPR 2026 Findings Paper Model Demo License


Abstract

User experience (UX)—centered on usability, perceived consistency, and functional clarity—is fundamental to real-world user interfaces (UI). While multimodal large language models (MLLMs) have advanced UI tasks such as visual element grounding, GUI agents, and design-to-code generation, their capacity to reason about UX from UI screenshots remains underexplored: UX defects often arise from misalignments between design conventions and user mental models rather than from visible layout errors. We introduce UXBench, the first vision–language benchmark for UX defect diagnosis, and UI-UX, a reinforcement-learning–enhanced MLLM that performs fine-grained, experience-centered reasoning. UI-UX attains 79.63% accuracy on UXBench—surpassing Claude-4.5-Sonnet (65.50%) at the time of publication, and still leading the latest Claude-Opus-4.8 (72.90%) some six months later—while preserving strong cross-task generalization and low inference latency.

Table of Contents


The UI-* Series

UI-UX is part of a family of multimodal models for user interfaces developed at Ant Group:

Project Focus Links
UI-UX (this repo) UX defect diagnosis and experience-centered reasoning; introduces the UXBench benchmark Paper · Model · Demo
UI-UG A unified MLLM for UI Understanding and Generation: referring, grounding, captioning, and generation Paper · Repo

Highlights

  • UXBench — the first vision–language benchmark for UX defect diagnosis: 2,000 VQA samples across 8 tasks and 3 dimensions (Usability, Efficiency, Trustworthiness).
  • 8 fine-grained diagnostic tasks — each requires causal reasoning rather than keyword matching; labels are validated through two rounds of review by four senior UX specialists.
  • Task-adaptive reward routing + asymmetric transition reward — GRPO-based reinforcement learning that strengthens UX reasoning while suppressing overthinking and reducing inference latency.
  • State-of-the-art results — UI-UX reaches 79.63% on UXBench, surpassing Claude-4.5-Sonnet (65.50%) and remaining ahead of the latest frontier MLLMs.

UXBench 8 Tasks
Figure 1. UXBench spans 8 UX defect diagnosis tasks across 3 dimensions.


UXBench

UXBench is the first vision–language benchmark for UX defect diagnosis, comprising 2,000 VQA samples built from real mobile UI screenshots and organized into 8 tasks across 3 dimensions. Each sample is posed as a two- or three-choice question requiring causal reasoning that maps visual evidence to design principles, rather than keyword matching. Labels are produced via large-scale MLLM-assisted annotation followed by two rounds of expert validation by four senior UX specialists, ensuring high label consistency.

Dimension Task Defect under diagnosis
Usability BubbleOcclT Floating overlay occludes page text
Usability BubbleOcclBtn Floating overlay blocks a clickable element
Efficiency PopupNoClose Modal popup lacks an explicit close control
Efficiency PopupBlock Popup obstructs the native / mini-program close button
Efficiency PopupStack Multiple modal popups presented simultaneously
Trustworthiness MismatchBadge Badge content inconsistent with the landing page
Trustworthiness MismatchContent Service name inconsistent with page text
Trustworthiness MismatchFunc Advertised description inconsistent with actual functionality

Each task expects a single-letter answer in the format $\boxed{X}$, enabling deterministic, exact-match scoring.


Results

With only 4B parameters, UI-UX achieves the highest accuracy on UXBench, outperforming both instruction-tuned and reasoning MLLMs—including models two orders of magnitude larger (235B) and leading proprietary systems. Notably, UI-UX improves its same-scale base model Qwen3-VL-Thinking (4B) from 0.5254 to 0.7963, a gain of over 27 points attributable to the proposed reinforcement-learning framework.

Performance comparison on UXBench
Figure 2. Accuracy comparison across models on UXBench (2,000 samples, 8 tasks).

As reported in the paper, UI-UX reaches 79.63%, surpassing Claude-4.5-Sonnet (65.50%). Entries marked were evaluated post-publication (2026); UI-UX remains ahead of the latest frontier models, including Claude-Opus-4.8 (0.7290).


Approach

UI-UX is a reinforcement-learning–based enhancement framework that fine-tunes a multimodal foundation model end-to-end with the GRPO algorithm, without manual preference annotation. Two mechanisms drive its UX reasoning ability: task-adaptive reward routing, which selects an appropriate reward signal per task type, and the asymmetric transition reward, which curbs redundant reasoning to reduce latency.

Training Pipeline
Figure 3. Training pipeline: data collection → labeling → curated training sets → GRPO with task-adaptive reward routing and the asymmetric transition reward.

Component Detail
Base model Qwen3-VL-4B-Thinking (paper); this release upgrades the base to Qwen3.5-4B
Optimization GRPO (SAPO loss) with task-adaptive reward routing
Reward design Accuracy / ROUGE-L / grounding rewards + asymmetric transition reward
Parameters 4B
Context length 16K tokens
Reasoning Chain-of-thought via <think>...</think>, with reward-based overthinking mitigation

See cookbook/ for full training configurations and reward-function implementations.


Quick Start

Installation

pip install -r requirements.txt

Inference with Transformers

from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "afx-team/UI-UX", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("afx-team/UI-UX")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "screenshot.png"},
            {"type": "text", "text": """Core Task: Evaluate whether the 'pop-up' offers users an explicit control for closing it.
Options:
A. No modal pop-up is present
B. The modal pop-up lacks an explicit close control
C. The modal pop-up has an explicit close control
Output Format: $\\boxed{X}$ (where X is one of A-C)."""},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_dict=True, return_tensors="pt"
).to(model.device)

output = model.generate(**inputs, max_new_tokens=8192)
response = processor.decode(output[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(response)

Inference with vLLM (Recommended)

vllm serve afx-team/UI-UX --port 8000 --max-model-len 16384 \
    --dtype bfloat16 --enable-reasoning --reasoning-parser deepseek_r1
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
response = client.chat.completions.create(
    model="afx-team/UI-UX",
    messages=[{"role": "user", "content": [...]}],
    max_tokens=8192
)
print(response.choices[0].message.content)

See cookbook/ for complete examples, including batch evaluation on UXBench.


Repository Structure

UI-UX/
├── assets/                 # Logo, figures
├── cookbook/               # Usage examples
│   ├── inference_transformers.py
│   ├── inference_vllm.py
│   ├── eval_uxbench.py
│   ├── serve_vllm.sh
│   └── README.md
├── .github/                # Issue / PR templates
├── app.py                  # Gradio demo
├── README.md
├── requirements.txt
├── LEGAL.md
└── LICENSE

Citation

If you find UXBench or UI-UX useful in your research, please cite:

@inproceedings{uiux2026,
  title={Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach},
  author={Mao, Ruichao and Fang, Zhou and Guo, Teng and Yang, Hao and Li, Yaping and Peng, Shaohua and Huang, Maji and Lin, Xiaoyu and Liu, Shuoyang and Li, Xuepeng and Zhang, Yuyu and Rao, Hai},
  booktitle={CVPR Findings},
  year={2026}
}

License

This project is released under the MIT License. Please review LEGAL.md for additional terms governing the model and benchmark.


Acknowledgements

UI-UX builds upon Qwen3-VL and is trained with ms-swift. We thank the open-source community for these foundations.

About

UI-UX: a 4B multimodal LLM for mobile UX defect diagnosis, with UXBench — the first vision–language benchmark for UX reasoning (CVPR 2026 Findings).

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages