🤖 AI Summary
This work addresses the limitation of existing vision-language models in remote TV interaction, where a lack of awareness of interface topology hinders long-horizon, focus-oriented navigation tasks. To this end, we present the first systematic modeling of remote TV interaction through TVWorld—a graph-structured, offline abstract environment for TV navigation—accompanied by two benchmarks: TVWorld-N and TVWorld-G. We further introduce a Topology-Aware Training framework that integrates large vision-language models with graph-based abstractions and a focus-aware grounding mechanism to train agents with explicit topological reasoning capabilities. Our proposed model, TVTheseus, achieves a 68.3% success rate on TVWorld-N, substantially outperforming closed-source baselines such as Gemini 3 Flash and establishing a new state-of-the-art performance.
📝 Abstract
Recent large vision-language models (LVLMs) have demonstrated strong potential for device control. However, existing research has primarily focused on point-and-click (PnC) interaction, while remote-control (RC) interaction commonly encountered in everyday TV usage remains largely underexplored. To fill this gap, we introduce \textbf{TVWorld}, an offline graph-based abstraction of real-world TV navigation that enables reproducible and deployment-free evaluation. On this basis, we derive two complementary benchmarks that comprehensively assess TV-use capabilities: \textbf{TVWorld-N} for topology-aware navigation and \textbf{TVWorld-G} for focus-aware grounding. These benchmarks expose a key limitation of existing agents: insufficient topology awareness for focus-based, long-horizon TV navigation. Motivated by this finding, we propose a \emph{Topology-Aware Training} framework that injects topology awareness into LVLMs. Using this framework, we develop \textbf{TVTheseus}, a foundation model specialized for TV navigation. TVTheseus achieves a success rate of $68.3\%$ on TVWorld-N, surpassing strong closed-source baselines such as Gemini 3 Flash and establishing state-of-the-art (SOTA) performance. Additional analyses further provide valuable insights into the development of effective TV-use agents.