🤖 AI Summary
This work addresses the challenges of efficiently and privately transforming continuous desktop interaction streams into structured personal logs on-device, where limited computational resources and cross-task context contamination pose significant obstacles. The authors propose a privacy-first multi-agent system featuring a unified filter-plan-record architecture, in which a lightweight Filter Agent, a text-only Brain Agent, a Record Agent, and a task-isolated Memory Agent collaborate to substantially reduce reliance on vision-language models (VLMs). Their approach introduces the first instruction-free on-device summarization mechanism, achieving a 60.4% reduction in total token consumption and a 72.3% decrease in VLM invocations on DesktopBench, while improving key information recall from 0.38 to 0.61. Furthermore, it maintains robust performance under task interruption, with 0.81 task accuracy and 0.80 key information retention (KIR).
📝 Abstract
Desktop interaction streams provide a continuous, privacy-sensitive record of interleaved user tasks. Transforming these streams into task-organized personal logs on-device faces two main challenges: exhaustive Vision-Language Model (VLM) processing strains local resources, and global stream processing causes cross-task context pollution. We present FOCAL (Filtered On-device Continuous Activity Logging), a privacy-first multi-agent system utilizing a unified filter-plan-log architecture. It cascades a lightweight Filter Agent for noise suppression, a text-only Brain Agent for task attribution, a Record Agent for selective visual reasoning, and a task-isolated Memory Agent for context-coherent summarization. Experiments on DesktopBench (comprising 2,572 screenshots across 420 complex sessions) show FOCAL reduces total token consumption by 60.4% and VLM call count by 72.3% versus a baseline, while boosting Key Information Recall (KIR) from 0.38 to 0.61. Crucially, under $A{\to}B{\to}A$ task interruptions, FOCAL maintains Task Acc 0.81 and KIR 0.80, whereas the baseline collapses to Task Acc 0.03. FOCAL pioneers the efficient, on-device summarization of instruction-free desktop streams into multi-perspective personal logs.