🤖 AI Summary
This work addresses the challenges of external motion capture dependency, high visual latency, and low sample efficiency in open-world quadrupedal loco-manipulation by proposing a fully onboard, egocentric vision-based pick-and-place system. The approach leverages a lightweight geometric representation via Sigma Points to achieve native sim-to-real alignment and integrates an egocentric Kalman filter to deliver high-frequency state estimation, effectively bridging the gap between slow perception (5 Hz with 200 ms latency) and fast control. To enhance learning efficiency and robustness, the method introduces a Hint Poses–guided active sampling curriculum combined with a temporal encoding strategy. Relying solely on an open-vocabulary object detector, the system achieves human teleoperation–comparable performance across diverse dynamic manipulation tasks.
📝 Abstract
Designing an open-world quadrupedal loco-manipulation system is highly challenging. Traditional reinforcement learning frameworks utilizing exteroception often suffer from extreme sample inefficiency and massive sim-to-real gaps. Furthermore, the inherent latency of visual tracking fundamentally conflicts with the high-frequency demands of precise floating-base control. Consequently, existing systems lean heavily on expensive external motion capture and off-board computation. To eliminate these dependencies, we present SigLoMa, a fully onboard, ego-centric vision-based pick-and-place framework. At the core of SigLoMa is the introduction of Sigma Points, a lightweight geometric representation for exteroception that guarantees high scalability and native sim-to-real alignment. To bridge the frequency divide between slow perception and fast control, we design an ego-centric Kalman Filter to provide robust, high-rate state estimation. On the learning front, we alleviate sample inefficiency via an Active Sampling Curriculum guided by Hint Poses, and tackle the robot's structural visual blind spots using temporal encoding coupled with simulated random-walk drift. Real-world experiments validate that, relying solely on a 5Hz (200 ms latency) open-vocabulary detector, SigLoMa successfully executes dynamic loco-manipulation across multiple tasks, achieving performance comparable to expert human teleoperation.