🤖 AI Summary
This work addresses the challenge in multi-agent reinforcement learning where dynamic interruptions by natural language instructions during long-horizon tasks lead to inconsistent value estimation upon instruction switching. The authors propose MAVIC, a novel approach that decouples value estimation across different instruction contexts at the macro-action level. By correcting the Bellman backup target at instruction-switching boundaries—rather than relying on reward shaping—MAVIC restores the continuation value of the original task while aligning with the new instruction objective. Built upon an actor-critic architecture, the method integrates macro-action modeling with a theoretically grounded value correction mechanism. Empirical results demonstrate that MAVIC significantly improves instruction-following accuracy in complex cooperative environments without compromising baseline task performance.
📝 Abstract
Multi-agent reinforcement learning (MARL) in real-world use cases may need to adapt to external natural language instructions that interrupt ongoing behavior and conflict with long-horizon objectives. However, conditioning rewards on instructions introduces a fundamental failure mode as Bellman updates couple value estimates across instruction contexts, leading to inconsistent values when instructions interrupt macro-actions. We propose Macro-Action Value Correction for Instruction Compliance (MAVIC), which corrects Bellman backups at instruction boundaries by correcting the incoming instruction objective and restoring the continuation value under the current objective. Unlike reward shaping, MAVIC modifies the bootstrapping target itself, enabling consistent value estimation under stochastic instruction switching within a unified policy. We provide theoretical analysis and an actor-critic implementation, and show that MAVIC achieves high instruction compliance while preserving base task performance in increasingly complex cooperative multi-agent environments.