🤖 AI Summary
This work addresses the limited generalization of multimodal large language models in out-of-distribution scenarios when performing complex embodied tasks. To this end, the authors propose the Verifier-Guided Action Selection (VeGAS) framework, which generates multiple candidate actions during inference and employs a policy-independent generative verifier to select the optimal one, substantially enhancing decision robustness. Key innovations include a large language model–based automatic data synthesis mechanism to construct a diverse curriculum of failure cases for verifier training, as well as the integration of chain-of-thought reasoning with ensemble sampling. Experimental results demonstrate that VeGAS achieves strong performance on the Habitat and ALFRED benchmarks, yielding up to a 36% relative improvement over strong baselines on the most challenging long-horizon, multi-object tasks.
📝 Abstract
Building generalist embodied agents capable of solving complex real-world tasks remains a fundamental challenge in AI. Multimodal Large Language Models (MLLMs) have significantly advanced the reasoning capabilities of such agents through strong vision-language knowledge and chain-of-thought (CoT) reasoning, yet remain brittle when faced with challenging out-of-distribution scenarios. To address this, we propose Verifier-Guided Action Selection (VegAS), a test-time framework designed to improve the robustness of MLLM-based embodied agents through an explicit verification step. At inference time, rather than committing to a single decoded action, VeGAS samples an ensemble of candidate actions and uses a generative verifier to identify the most reliable choice, without modifying the underlying policy. Crucially, we find that using an MLLM off-the-shelf as a verifier yields no improvement, motivating our LLM-driven data synthesis strategy, which automatically constructs a diverse curriculum of failure cases to expose the verifier to a rich distribution of potential errors at training time. Across embodied reasoning benchmarks spanning the Habitat and ALFRED environments, VeGAS consistently improves generalization, achieving up to a 36% relative performance gain over strong CoT baselines on the most challenging multi-object, long-horizon tasks.