🤖 AI Summary
In multi-view visual reinforcement learning, a fundamental trade-off exists among low sample efficiency, high deployment overhead, and weak policy robustness. To address this, we propose the Merge And Disentanglement (MAD) framework: it enhances representation capacity via multi-view feature fusion while explicitly disentangling view-specific features to enable lightweight single-view deployment. MAD further integrates disentangled representation learning with visual-servoing-inspired Q-learning optimization, jointly improving sample efficiency and policy generalization. Experiments on Meta-World and ManiSkill3 demonstrate that MAD achieves substantial gains—average task success rate improvement of +12.7% and up to 1.8× higher sample efficiency over prior state-of-the-art methods—while exhibiting strong robustness and generalization across unseen scenarios and tasks.
📝 Abstract
Vision is well-known for its use in manipulation, especially using visual servoing. To make it robust, multiple cameras are needed to expand the field of view. That is computationally challenging. Merging multiple views and using Q-learning allows the design of more effective representations and optimization of sample efficiency. Such a solution might be expensive to deploy. To mitigate this, we introduce a Merge And Disentanglement (MAD) algorithm that efficiently merges views to increase sample efficiency while augmenting with single-view features to allow lightweight deployment and ensure robust policies. We demonstrate the efficiency and robustness of our approach using Meta-World and ManiSkill3. For project website and code, see https://aalmuzairee.github.io/mad