HyCodePolicy: Hybrid Language Controllers for Multimodal Monitoring and Decision in Embodied Agents

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Current embodied agents lack adaptive runtime monitoring and online error correction capabilities for complex tasks. To address this, we propose a closed-loop programming framework that integrates code generation, object-centric geometric localization, vision-language model (VLM)-driven perceptual monitoring, and structured execution tracing. The framework establishes a hybrid dual-feedback mechanism—combining program execution traces with VLM-derived perceptual feedback—to enable iterative program synthesis and dynamic error recovery without dense human supervision. Experiments demonstrate substantial improvements in robotic manipulation policy robustness and sample efficiency. Furthermore, the framework validates a scalable reasoning integration paradigm within multimodal autonomous decision-making systems. By unifying symbolic program execution with grounded perceptual feedback, it advances trustworthy closed-loop control for embodied intelligence.

Technology Category

Application Category

📝 Abstract

Recent advances in multimodal large language models (MLLMs) have enabled richer perceptual grounding for code policy generation in embodied agents. However, most existing systems lack effective mechanisms to adaptively monitor policy execution and repair codes during task completion. In this work, we introduce HyCodePolicy, a hybrid language-based control framework that systematically integrates code synthesis, geometric grounding, perceptual monitoring, and iterative repair into a closed-loop programming cycle for embodied agents. Technically, given a natural language instruction, our system first decomposes it into subgoals and generates an initial executable program grounded in object-centric geometric primitives. The program is then executed in simulation, while a vision-language model (VLM) observes selected checkpoints to detect and localize execution failures and infer failure reasons. By fusing structured execution traces capturing program-level events with VLM-based perceptual feedback, HyCodePolicy infers failure causes and repairs programs. This hybrid dual feedback mechanism enables self-correcting program synthesis with minimal human supervision. Our results demonstrate that HyCodePolicy significantly improves the robustness and sample efficiency of robot manipulation policies, offering a scalable strategy for integrating multimodal reasoning into autonomous decision-making pipelines.

Problem

Research questions and friction points this paper is trying to address.

Adaptive monitoring and code repair in embodied agents

Integration of code synthesis and perceptual grounding

Self-correcting program synthesis with minimal supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid language-based control framework

Vision-language model for monitoring

Dual feedback for self-correcting programs

🔎 Similar Papers

A Survey on Vision-Language-Action Models for Embodied AI