🤖 AI Summary
This work addresses the challenge of maintaining geometric consistency in video object segmentation under large viewpoint changes, a setting where existing methods often rely on additional geometric priors such as camera poses or depth maps. For the first time, we integrate implicit 3D-aware features extracted by MUSt3R into the SAM2 architecture without requiring any explicit geometric prior. Our approach leverages a lightweight multi-level feature fusion module and a field-of-view-aware frame sampling strategy to achieve geometrically consistent segmentation from RGB-only inputs. Evaluated on large-baseline motion datasets including ScanNet++ and Replica, the proposed method significantly outperforms SAM2 and current video object segmentation approaches, achieving 90.6% IoU and 71.7% Positive IoU on the ScanNet++ Selected Subset.
📝 Abstract
Video object segmentation methods like SAM2 achieve strong performance through memory-based architectures but struggle under large viewpoint changes due to reliance on appearance features. Traditional 3D instance segmentation methods address viewpoint consistency but require camera poses, depth maps, and expensive preprocessing. We introduce 3AM, a training-time enhancement that integrates 3D-aware features from MUSt3R into SAM2. Our lightweight Feature Merger fuses multi-level MUSt3R features that encode implicit geometric correspondence. Combined with SAM2's appearance features, the model achieves geometry-consistent recognition grounded in both spatial position and visual similarity. We propose a field-of-view aware sampling strategy ensuring frames observe spatially consistent object regions for reliable 3D correspondence learning. Critically, our method requires only RGB input at inference, with no camera poses or preprocessing. On challenging datasets with wide-baseline motion (ScanNet++, Replica), 3AM substantially outperforms SAM2 and extensions, achieving 90.6% IoU and 71.7% Positive IoU on ScanNet++'s Selected Subset, improving over state-of-the-art VOS methods by +15.9 and +30.4 points. Project page: https://jayisaking.github.io/3AM-Page/