Beyond Screenshots: Evaluating VLMs' Understanding of UI Animations

📅 2026-04-28

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the limited capability of current vision-language models (VLMs) to interpret functional animations prevalent in modern user interfaces, as they predominantly rely on static screenshots and thus struggle to capture the semantic nuances of dynamic interactions. To bridge this gap, the study introduces AniMINT, a novel dataset comprising 300 meticulously annotated UI animation videos, and proposes the MCPC analytical framework. Through multidimensional ablation studies leveraging motion, context, and perceptual cues, the framework systematically evaluates VLMs’ abilities in perceiving animation effects, recognizing user intent, and interpreting semantic meaning. Findings reveal that while VLMs reliably detect basic motion patterns, they significantly underperform humans in higher-order animation semantics, exposing critical limitations and charting directions for future model improvements.

📝 Abstract

AI agents operating on user interfaces must understand how interfaces communicate state and feedback to act reliably. As a core communicative modality, animations are increasingly used in modern interfaces, serving critical functional purposes beyond mere aesthetics. Thus, understanding UI animation is essential for comprehensive interface interpretation. However, recent studies of Vision Language Models (VLMs) for UI understanding have focused primarily on static screenshots, leaving it unclear how well these models handle dynamic UI animations. To address this gap, we created AniMINT, a novel dataset of 300 densely annotated UI animation videos. We systematically evaluate state-of-the-art VLMs on UI animation understanding, including their abilities to perceive the animation effects, identify animation purposes, and interpret animation meaning. Our results show that VLMs can reliably detect primitive motion. However, their high-level animation interpretation remains inconsistent, with substantial gaps relative to human performance. Finally, we use Motion, Context, and Perceptual Cues (MCPC) to probe factors affecting VLM performance, revealing key bottlenecks and directions for future improvement.

Problem

Research questions and friction points this paper is trying to address.

UI animation

Vision Language Models

dynamic interface understanding

animation interpretation

human-computer interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

UI animation understanding

Vision Language Models

AniMINT dataset