🤖 AI Summary
Resource-constrained edge devices face challenges in accurately understanding user intent from UI interaction traces, while simultaneously ensuring privacy preservation and real-time responsiveness.
Method: This paper proposes a two-stage decomposed architecture: (1) generating structured sequential summaries of interaction behaviors, followed by (2) lightweight intent inference based on these summaries. The approach integrates context-aggregated enhancement and task-adaptive fine-tuning to strengthen semantic modeling capabilities of small models.
Contribution/Results: Experimental results demonstrate that, under identical privacy guarantees and low-latency constraints, the proposed method achieves higher intent recognition accuracy than state-of-the-art large multimodal language models. It establishes an efficient, privacy-aware, and real-time interaction understanding paradigm for on-device intelligent agents.
📝 Abstract
Understanding user intents from UI interaction trajectories remains a challenging, yet crucial, frontier in intelligent agent development. While massive, datacenter-based, multi-modal large language models (MLLMs) possess greater capacity to handle the complexities of such sequences, smaller models which can run on-device to provide a privacy-preserving, low-cost, and low-latency user experience, struggle with accurate intent inference. We address these limitations by introducing a novel decomposed approach: first, we perform structured interaction summarization, capturing key information from each user action. Second, we perform intent extraction using a fine-tuned model operating on the aggregated summaries. This method improves intent understanding in resource-constrained models, even surpassing the base performance of large MLLMs.