🤖 AI Summary
This work addresses the high latency and computational redundancy incurred by conventional proactive agents that frequently invoke large language models (LLMs) to parse textualized user events. To overcome this inefficiency, the authors propose modeling user activities as structured temporal event streams and employing a lightweight Temporal Graph Learning (TGL) model to directly process native operating system graph data, invoking the LLM only upon trigger conditions to generate responses. By eliminating the inefficient “structure-to-text-and-back” pipeline, this approach achieves real-time trigger decision-making and entity routing via graph neural networks for the first time. Experiments demonstrate that the model improves average F1 scores by 16.7 (up to +46.0) across 14 baselines, achieves state-of-the-art trigger AUC with stable thresholds, and requires only 11.13 ms (GPU) or 13.99 ms (CPU) per event—yielding a 4–83× speedup over LLM-based methods—with a memory footprint of approximately 220 MiB, enabling on-device deployment.
📝 Abstract
Proactive agents read user activity as text and call an LLM on every event to decide whether to act. But user activity is not natively text: it is a structured event stream of (actor, verb, object, timestamp) tuples that the operating system already maintains in graph form. Rendering the structure as text and asking an LLM to recover it is a round-trip the system never had to take. We treat the always-on signal as graph updates rather than text and use a small temporal-graph-learning (TGL) model as the encoder: one forward pass yields a per-event trigger probability and a per-entity routing score, and only the downstream agent (turning a small structured handoff into a fluent user-facing sentence) is an LLM call, invoked only when the trigger fires. TGL improves F1 on each of 14 backbones (mean +16.7, up to +46.0); in trigger-architecture comparisons, one TGL checkpoint gives the strongest trigger AUCs and the most stable deployed threshold. It runs at 11.13 ms per event on a GPU server and 13.99 ms on a consumer laptop, approximately 4--7x and 12--83x faster than every single-forward LLM-as-trigger configuration tested in each regime, with an approximately 220 MiB BF16 resident footprint deployable on-device alongside the privacy-sensitive activity stream it consumes.