CapsDT: Diffusion-Transformer for Capsule Robot Manipulation

📅 2025-06-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenges of unintuitive human–robot interaction and low operational efficiency in gastrointestinal capsule robots. We propose the first vision–language–action (VLA) end-to-end control model specifically designed for gastric environments. Methodologically, we introduce the Diffusion Transformer into the control framework of miniature untethered medical robots for the first time, integrating Vision Transformers (ViT) with large language models to enable multimodal command understanding; we further develop a gastric physical simulation platform and a four-level task dataset. Our contributions include: (1) bridging the gap in VLA research for untethered microrobots, and (2) proposing a generalizable paradigm for magnetically actuated capsule motion-signal generation. Experiments demonstrate state-of-the-art performance in gastric simulation tasks and a 26.25% success rate in realistic environment emulation—significantly outperforming existing baselines.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models have emerged as a prominent research area, showcasing significant potential across a variety of applications. However, their performance in endoscopy robotics, particularly endoscopy capsule robots that perform actions within the digestive system, remains unexplored. The integration of VLA models into endoscopy robots allows more intuitive and efficient interactions between human operators and medical devices, improving both diagnostic accuracy and treatment outcomes. In this work, we design CapsDT, a Diffusion Transformer model for capsule robot manipulation in the stomach. By processing interleaved visual inputs, and textual instructions, CapsDT can infer corresponding robotic control signals to facilitate endoscopy tasks. In addition, we developed a capsule endoscopy robot system, a capsule robot controlled by a robotic arm-held magnet, addressing different levels of four endoscopy tasks and creating corresponding capsule robot datasets within the stomach simulator. Comprehensive evaluations on various robotic tasks indicate that CapsDT can serve as a robust vision-language generalist, achieving state-of-the-art performance in various levels of endoscopy tasks while achieving a 26.25% success rate in real-world simulation manipulation.
Problem

Research questions and friction points this paper is trying to address.

Exploring VLA models for endoscopy capsule robot performance
Integrating VLA models to enhance human-robot interaction in endoscopy
Developing CapsDT for robotic control in stomach endoscopy tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Transformer model for capsule robots
Vision-Language-Action integration for endoscopy
Magnet-controlled robotic arm for capsule manipulation
🔎 Similar Papers
No similar papers found.
X
Xiting He
Department of Electronic Engineering, The Chinese University of Hong Kong (CUHK), Hong Kong, China; and also with the CUHK Shenzhen Research Institute, Shenzhen, China
M
Mingwu Su
Department of Electronic Engineering, The Chinese University of Hong Kong (CUHK), Hong Kong, China; and also with the CUHK Shenzhen Research Institute, Shenzhen, China
X
Xinqi Jiang
Department of Electronic Engineering, The Chinese University of Hong Kong (CUHK), Hong Kong, China; and also with the CUHK Shenzhen Research Institute, Shenzhen, China
Long Bai
Long Bai
Research Assistant, Institute of Computing Technology, Chinese Academy of Sciences
Event-Centric AnalysisKnowledge GraphNatural Language Processing
Jiewen Lai
Jiewen Lai
CUHK
Medical MechatronicsContinuum RobotsSoft RoboticsRobot Control
Hongliang Ren
Hongliang Ren
Chinese University of Hong Kong | National University of Singapore | JHU/Harvard(RF) | CUHK(PhD)
Biorobotics & intelligent systemsmedical mechatronicscontinuumsoft flexible robots/sensorsmultisensory perception