🤖 AI Summary
To address limitations in multi-task coordination, generalization to unseen environments, and spatial localization memory for robotic manipulation in dynamic open-world settings, this paper proposes SAM2Act—the first vision-action policy framework integrating a vision foundation model with an explicit spatial memory mechanism—and its enhanced variant, SAM2Act+, which incorporates reinforced memory capabilities. Methodologically, we introduce a multi-view Transformer, a SAM2-inspired memory bank, an encoder-attention memory module, and multi-resolution upsampling. We further establish MemoryBench, the first benchmark dedicated to evaluating spatial memory retention and action traceability. Experiments show that SAM2Act achieves a mean success rate of 86.8% across 18 RLBench tasks; SAM2Act+ significantly outperforms state-of-the-art methods on MemoryBench and exhibits remarkable robustness, with only a 4.3% performance degradation under environmental perturbations—demonstrating superior generalization and memory fidelity.
📝 Abstract
Robotic manipulation systems operating in diverse, dynamic environments must exhibit three critical abilities: multitask interaction, generalization to unseen scenarios, and spatial memory. While significant progress has been made in robotic manipulation, existing approaches often fall short in generalization to complex environmental variations and addressing memory-dependent tasks. To bridge this gap, we introduce SAM2Act, a multi-view robotic transformer-based policy that leverages multi-resolution upsampling with visual representations from large-scale foundation model. SAM2Act achieves a state-of-the-art average success rate of 86.8% across 18 tasks in the RLBench benchmark, and demonstrates robust generalization on The Colosseum benchmark, with only a 4.3% performance gap under diverse environmental perturbations. Building on this foundation, we propose SAM2Act+, a memory-based architecture inspired by SAM2, which incorporates a memory bank, an encoder, and an attention mechanism to enhance spatial memory. To address the need for evaluating memory-dependent tasks, we introduce MemoryBench, a novel benchmark designed to assess spatial memory and action recall in robotic manipulation. SAM2Act+ achieves competitive performance on MemoryBench, significantly outperforming existing approaches and pushing the boundaries of memory-enabled robotic systems. Project page: https://sam2act.github.io/