SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in Long-horizon Embodied Scenarios

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing embodied agents struggle to reliably operate physical control interfaces (e.g., switches, panels) in long-horizon real-world scenarios, primarily due to deficiencies in commonsense reasoning, spatiotemporal causal modeling (e.g., delayed responses), handling partial observability, and result verification. To address this, we introduce SWITCH-Basic—the first multitask benchmark dedicated to operable interfaces—based on egocentric RGB videos spanning 98 real-world devices and 351 tasks. It systematically evaluates five core capabilities: visual question answering, semantic alignment, action generation, state prediction, and result verification. Experiments reveal that leading commercial and open-source multimodal large language models exhibit unstable performance, heavily relying on textual cues while neglecting dynamic visual evidence. We publicly release the dataset, code, and a held-out test set to advance research in embodied causal reasoning and safety-critical verification.

Technology Category

Application Category

📝 Abstract

Autonomous intelligence requires not only perception and reasoning, but critically, effective interaction with the existing world and its infrastructure. Everyday environments are rich in tangible control interfaces (TCIs), e.g., light switches, appliance panels, and embedded GUIs, that demand commonsense and physics reasoning, but also causal prediction and outcome verification in time and space (e.g., delayed heating, remote lights). Moreover, failures here have potential safety implications, yet current benchmarks rarely test grounding, partial observability (video), or post-hoc verification in situated settings. We introduce SWITCH (Semantic World Interface Tasks for Control and Handling), an embodied, task-driven benchmark created through iterative releases to probe these gaps. Its first iteration, SWITCH-Basic, evaluates five complementary abilities:task-aware VQA, semantic UI grounding, action generation, state-transition prediction, and result verification, under egocentric RGB video input and device diversity. Across 351 tasks spanning 98 real devices and appliances, commercial and open LMMMs exhibit inconsistent performance even on single-step interactions, often over-relying on textual cues and under-using visual or video evidence (and high aggregate scores can mask such failures). SWITCH provides data, code, and held-out splits to enable reproducible evaluation and community contributions toward more challenging future iterations of the benchmark and the creation of training datasets. Benchmark resources are available at: https://github.com/BAAI-Agents/SWITCH.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking AI interaction with tangible control interfaces in real environments

Addressing gaps in grounding, partial observability, and outcome verification

Evaluating multimodal reasoning for physical device manipulation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark tests embodied AI interaction with tangible interfaces

Evaluates five abilities including VQA and action generation

Uses egocentric RGB video input across diverse real devices

🔎 Similar Papers

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

2024-10-04International Conference on Learning RepresentationsCitations: 0

Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI

2024-07-09IEEE/ASME transactions on mechatronicsCitations: 94

Bosch Group

Attraktive Vergütung

Horb am Neckar, BW, DE

Research Scientist Intern, Multimodal AI (PhD)