Simultaneous Tactile-Visual Perception for Learning Multimodal Robot Manipulation

📅 2025-12-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing soft tactile sensors (STS) lack robust synchronous multimodal perception, suffer from unreliable tactile tracking, and offer no deep integration of multimodal signals with robotic manipulation decision-making. This work introduces TacThru—a novel tactile-visual fusion sensor—and TacThru-UMI, an imitation learning framework. TacThru features a first-of-its-kind design: a fully transparent elastomer, persistent backlighting, and critical line markings—enabling high-precision, interference-resilient, synchronous tactile-visual sensing. TacThru-UMI proposes the first Transformer-based diffusion policy architecture explicitly designed for joint tactile-visual modeling. Evaluated on five real-world manipulation tasks, the system achieves a mean success rate of 85.5%, substantially outperforming alternating-sensing (66.3%) and vision-only baselines (55.4%). Notably, it significantly improves contact detection for thin, deformable objects and enables precise bimanual coordination.

Technology Category

Application Category

📝 Abstract
Robotic manipulation requires both rich multimodal perception and effective learning frameworks to handle complex real-world tasks. See-through-skin (STS) sensors, which combine tactile and visual perception, offer promising sensing capabilities, while modern imitation learning provides powerful tools for policy acquisition. However, existing STS designs lack simultaneous multimodal perception and suffer from unreliable tactile tracking. Furthermore, integrating these rich multimodal signals into learning-based manipulation pipelines remains an open challenge. We introduce TacThru, an STS sensor enabling simultaneous visual perception and robust tactile signal extraction, and TacThru-UMI, an imitation learning framework that leverages these multimodal signals for manipulation. Our sensor features a fully transparent elastomer, persistent illumination, novel keyline markers, and efficient tracking, while our learning system integrates these signals through a Transformer-based Diffusion Policy. Experiments on five challenging real-world tasks show that TacThru-UMI achieves an average success rate of 85.5%, significantly outperforming the baselines of alternating tactile-visual (66.3%) and vision-only (55.4%). The system excels in critical scenarios, including contact detection with thin and soft objects and precision manipulation requiring multimodal coordination. This work demonstrates that combining simultaneous multimodal perception with modern learning frameworks enables more precise, adaptable robotic manipulation.
Problem

Research questions and friction points this paper is trying to address.

Simultaneous tactile-visual perception lacking in current sensors
Unreliable tactile tracking in existing see-through-skin designs
Integrating multimodal signals into learning-based manipulation pipelines
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simultaneous tactile-visual perception sensor with robust tracking
Transformer-based diffusion policy for multimodal imitation learning
Integration enables precise manipulation in complex real-world tasks
🔎 Similar Papers
No similar papers found.
Yuyang Li
Yuyang Li
Institute for AI, Peking University
Robotic ManipulationTactile SensingHuman-Object Interaction
Y
Yinghan Chen
Institute for Artificial Intelligence, Peking University; Beijing Key Lab of Behavior and Mental Health, Peking University; Department of Computer Science and Technology, University of Cambridge; State Key Lab for General Artificial Intelligence
Zihang Zhao
Zihang Zhao
PhD Candidate, Peking University
manipulationtactile robotics
Puhao Li
Puhao Li
Ph.D. Student, Tsinghua University
Computer VisionRoboticsMachine Learning
Tengyu Liu
Tengyu Liu
Beijing Institute for General Artificial Intelligence
computer visionhuman object interactionhuman motion generationgrasping
S
Siyuan Huang
Beijing Institute for General Artificial Intelligence; State Key Lab for General Artificial Intelligence
Yixin Zhu
Yixin Zhu
Assistant Professor, Peking University
Computer VisionVisual ReasoningHuman-Robot Teaming