Stereo Multistage Spatial Attention for Real-Time Mobile Manipulation Under Visual Scale Variation and Disturbances

📅 2026-05-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

224K/year
🤖 AI Summary
This work addresses the challenge of insufficient robustness in mobile robot manipulation under open-world conditions, where camera viewpoint variations induce visual scale changes and perturbations. To tackle this issue, the authors propose a stereo multi-stage spatial attention-based depth prediction learning method. The approach extracts task-relevant attentive features from binocular images and integrates robot state information through a hierarchical recurrent architecture to enable real-time closed-loop action prediction. By innovatively combining structured stereo spatial attention with temporal multi-stage modeling, the method significantly improves task success rates under random initial poses and visual disturbances across four real-world mobile manipulation tasks, outperforming representative imitation learning and vision-language-action baseline approaches.
📝 Abstract
Robots operating in open, unstructured real-world environments must rely on onboard visual perception while autonomously moving across different locations. Continuous changes in onboard camera viewpoints cause significant visual scale variations in target objects, affecting vision-based motion generation. In this work, we present a stereo multistage spatial attention-based deep predictive learning method for real-time mobile manipulation. The proposed methods extracts task-relevant spatial attention points from stereo images and integrates them with robot states through a hierarchical recurrent architecture for closed-loop action prediction. We evaluate the system on four real-world mobile manipulation tasks using a mobile manipulator, including rigid placement, articulated object manipulation, and deformable object interaction. Experiments under randomized initial positions and visual disturbance conditions demonstrate improved robustness and task success rates compared to representative imitation learning and vision-language-action baselines under identical control settings. The results indicate that structured stereo spatial attention combined with predictive temporal modeling provides an effective solution within the evaluated mobile manipulation scenarios.
Problem

Research questions and friction points this paper is trying to address.

visual scale variation
mobile manipulation
visual disturbances
real-time perception
onboard visual perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

stereo spatial attention
multistage attention
real-time mobile manipulation
visual scale variation
predictive learning
🔎 Similar Papers
No similar papers found.
X
Xianbo Cai
Department of Intermedia Art and Science, Waseda University, Tokyo, Japan
H
Hideyuki Ichiwara
Department of Intermedia Art and Science, Waseda University, Tokyo, Japan; SB Intuitions Corp., Tokyo, Japan
H
Hyogo Hiruma
Department of Intermedia Art and Science, Waseda University, Tokyo, Japan; Research and Development Group, Hitachi, Ltd., Ibaraki, Japan
M
Masaki Yoshikawa
Department of Intermedia Art and Science, Waseda University, Tokyo, Japan
H
Hiroshi Ito
Department of Intermedia Art and Science, Waseda University, Tokyo, Japan; Research and Development Group, Hitachi, Ltd., Ibaraki, Japan
Tetsuya Ogata
Tetsuya Ogata
Professor, Waseda University / Joint-appointed Fellow, AIST / Visiting Professor, NII
Deep Predictive LearningPhysical AIDevelopmental Robotics