🤖 AI Summary
Existing robots face key challenges in lifelong learning: slow task adaptation, planning lacking geometric and physical grounding, and unstable outputs from large language models (LLMs) or vision-language models (VLMs). To address these, we propose a vision-grounded replanning framework integrated with a reusable skill memory module. Our approach leverages VLMs to establish visual grounding of scene geometry and object physical properties, while employing LLMs for knowledge-informed high-level planning. A state-feedback-driven dynamic replanning mechanism enables robust recovery from execution failures, and a skill memory module consolidates successful experiences for cross-task transfer. Evaluated on LIBERO, RLBench, and real robotic platforms, our method significantly improves task success rates and generalization over state-of-the-art baselines. It establishes a reliable, adaptive, and sustainable closed-loop autonomous learning system for robots.
📝 Abstract
Robots trained via Reinforcement Learning (RL) or Imitation Learning (IL) often adapt slowly to new tasks, whereas recent Large Language Models (LLMs) and Vision-Language Models (VLMs) promise knowledge-rich planning from minimal data. Deploying LLMs/VLMs for motion planning, however, faces two key obstacles: (i) symbolic plans are rarely grounded in scene geometry and object physics, and (ii) model outputs can vary for identical prompts, undermining execution reliability. We propose ViReSkill, a framework that pairs vision-grounded replanning with a skill memory for accumulation and reuse. When a failure occurs, the replanner generates a new action sequence conditioned on the current scene, tailored to the observed state. On success, the executed plan is stored as a reusable skill and replayed in future encounters without additional calls to LLMs/VLMs. This feedback loop enables autonomous continual learning: each attempt immediately expands the skill set and stabilizes subsequent executions. We evaluate ViReSkill on simulators such as LIBERO and RLBench as well as on a physical robot. Across all settings, it consistently outperforms conventional baselines in task success rate, demonstrating robust sim-to-real generalization.