🤖 AI Summary
This work addresses scalability and out-of-distribution generalization bottlenecks in open-vocabulary physical skill learning for agents in simulation, without relying on hand-crafted rewards or task-specific demonstrations. The proposed self-optimizing framework features: (1) an LLM–VLM closed-loop, where the LLM generates physically grounded constraints and the VLM evaluates motion semantics; and (2) a lightweight Pose2CLIP mapper that bridges the domain gap between simulated pose representations and visual-semantic embeddings. The method integrates large language models, vision-language models, reinforcement learning, and cross-modal feature alignment. Across diverse morphologies and learning paradigms, it achieves a 22.2% improvement in motion naturalness, a 25.7% increase in task success rate, and an 8.4× speedup in training efficiency—while significantly enhancing zero-shot task transfer and out-of-distribution generalization.
📝 Abstract
Learning open-vocabulary physical skills for simulated agents presents a significant challenge in artificial intelligence. Current reinforcement learning approaches face critical limitations: manually designed rewards lack scalability across diverse tasks, while demonstration-based methods struggle to generalize beyond their training distribution. We introduce GROVE, a generalized reward framework that enables open-vocabulary physical skill learning without manual engineering or task-specific demonstrations. Our key insight is that Large Language Models(LLMs) and Vision Language Models(VLMs) provide complementary guidance -- LLMs generate precise physical constraints capturing task requirements, while VLMs evaluate motion semantics and naturalness. Through an iterative design process, VLM-based feedback continuously refines LLM-generated constraints, creating a self-improving reward system. To bridge the domain gap between simulation and natural images, we develop Pose2CLIP, a lightweight mapper that efficiently projects agent poses directly into semantic feature space without computationally expensive rendering. Extensive experiments across diverse embodiments and learning paradigms demonstrate GROVE's effectiveness, achieving 22.2% higher motion naturalness and 25.7% better task completion scores while training 8.4x faster than previous methods. These results establish a new foundation for scalable physical skill acquisition in simulated environments.