Multi-Modal Gesture Recognition from Video and Surgical Tool Pose Information via Motion Invariants

📅 2025-03-19

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

To address insufficient accuracy in gesture recognition for robot-assisted surgery, this paper proposes a multimodal relational graph network integrating video, surgical instrument pose, and geometric motion invariants—specifically curvature and torsion. It is the first work to incorporate differential-geometric motion invariants into surgical gesture modeling, uncovering the intrinsic geometric structure of instrument trajectories and overcoming the limitations of conventional pose-only representations (e.g., position and quaternion). We design a tri-stream feature fusion architecture—video, pose, and invariants—coupled with a relational graph neural network to enable frame-level real-time recognition. Evaluated on the JIGSAWS suturing dataset, our method achieves 90.3% frame-wise accuracy, substantially outperforming baseline approaches. This demonstrates that geometry-aware modeling significantly enhances the robustness and discriminative power of surgical gesture recognition.

Technology Category

Application Category

📝 Abstract

Recognizing surgical gestures in real-time is a stepping stone towards automated activity recognition, skill assessment, intra-operative assistance, and eventually surgical automation. The current robotic surgical systems provide us with rich multi-modal data such as video and kinematics. While some recent works in multi-modal neural networks learn the relationships between vision and kinematics data, current approaches treat kinematics information as independent signals, with no underlying relation between tool-tip poses. However, instrument poses are geometrically related, and the underlying geometry can aid neural networks in learning gesture representation. Therefore, we propose combining motion invariant measures (curvature and torsion) with vision and kinematics data using a relational graph network to capture the underlying relations between different data streams. We show that gesture recognition improves when combining invariant signals with tool position, achieving 90.3% frame-wise accuracy on the JIGSAWS suturing dataset. Our results show that motion invariant signals coupled with position are better representations of gesture motion compared to traditional position and quaternion representations. Our results highlight the need for geometric-aware modeling of kinematics for gesture recognition.

Problem

Research questions and friction points this paper is trying to address.

Recognizing surgical gestures in real-time for automation.

Improving gesture recognition using motion invariant measures.

Enhancing neural networks with geometric-aware kinematics modeling.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines motion invariants with vision data

Uses relational graph network for data integration

Improves gesture recognition with geometric modeling

🔎 Similar Papers

Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures