🤖 AI Summary
This paper introduces the problem of Human-Centered Open-world Task Discovery (HOTD): automatically identifying tasks that substantially reduce human cognitive and operational burden across multiple plausible future scenarios, under conditions of highly concurrent and dynamically evolving human intentions. To address this, we establish HOTD-Bench—the first large-scale benchmark for HOTD—comprising 2,000+ real-world videos and a rigorous simulation-based evaluation protocol. We further propose the Collaborative Multi-Agent Search Tree (CMAST) framework, which decomposes complex task discovery into scalable, tree-structured reasoning and supports plug-and-play integration of mainstream large multimodal models (LMMs). Experiments demonstrate that CMAST significantly outperforms existing LMMs on HOTD-Bench, marking the first systematic advancement in human-intention-driven, embodied task discovery research.
📝 Abstract
Recent progress in robotics and embodied AI is largely driven by Large Multimodal Models (LMMs). However, a key challenge remains underexplored: how can we advance LMMs to discover tasks that directly assist humans in open-future scenarios, where human intentions are highly concurrent and dynamic. In this work, we formalize the problem of Human-centric Open-future Task Discovery (HOTD), focusing particularly on identifying tasks that reduce human effort across multiple plausible futures. To facilitate this study, we propose an HOTD-Bench, which features over 2K real-world videos, a semi-automated annotation pipeline, and a simulation-based protocol tailored for open-set future evaluation. Additionally, we propose the Collaborative Multi-Agent Search Tree (CMAST) framework, which decomposes the complex reasoning through a multi-agent system and structures the reasoning process through a scalable search tree module. In our experiments, CMAST achieves the best performance on the HOTD-Bench, significantly surpassing existing LMMs. It also integrates well with existing LMMs, consistently improving performance.