🤖 AI Summary
This work addresses the poor robustness and high-cost failures of vision-language models (VLMs) when employed as black-box symbolic planners in closed-loop robotic tasks. We introduce, for the first time, a control-theoretic formulation of VLM-based dynamic planning, proposing a synergistic optimization framework that jointly leverages control horizon and warm-starting to explicitly embed symbolic planning within a feedback control loop. Comprehensive experiments demonstrate that our approach significantly improves planning success rate (+27.3%) and real-time performance (39% reduction in reasoning steps) on complex robotic tasks, while enhancing robustness against observation noise and actuation errors. The method establishes an interpretable, analyzable paradigm for reliably deploying VLMs in safety-critical, high-level robotic planning.
📝 Abstract
Large Language Models (LLMs) and Vision Language Models (VLMs) have been widely used for embodied symbolic planning. Yet, how to effectively use these models for closed-loop symbolic planning remains largely unexplored. Because they operate as black boxes, LLMs and VLMs can produce unpredictable or costly errors, making their use in high-level robotic planning especially challenging. In this work, we investigate how to use VLMs as closed-loop symbolic planners for robotic applications from a control-theoretic perspective. Concretely, we study how the control horizon and warm-starting impact the performance of VLM symbolic planners. We design and conduct controlled experiments to gain insights that are broadly applicable to utilizing VLMs as closed-loop symbolic planners, and we discuss recommendations that can help improve the performance of VLM symbolic planners.