DML-RAM: Deep Multimodal Learning Framework for Robotic Arm Manipulation using Pre-trained Models

📅 2025-04-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing challenges in robotic arm manipulation—including multimodal perception, real-time decision-making, and human-robot collaborative adaptability—this paper proposes a modular, interpretable deep multimodal learning framework. Methodologically, it fuses image sequences (extracted via vision pre-trained models such as VGG16) with robot state data using a late-fusion strategy to jointly regress continuous control actions. Departing from end-to-end black-box and pure reinforcement learning paradigms, the framework integrates lightweight random forest regression to enhance model interpretability while ensuring real-time inference capability. Evaluated on BridgeData V2 and Kuka datasets, it achieves mean squared errors of 0.0021 and 0.0028, respectively, demonstrating high accuracy, robustness, and feasibility for edge deployment. The core contribution is the first integration of an efficient, interpretable regressor into a multimodal late-fusion architecture—uniquely balancing performance, transparency, and low-latency responsiveness to enable adaptive physical human-robot collaboration.

Technology Category

Application Category

📝 Abstract
This paper presents a novel deep learning framework for robotic arm manipulation that integrates multimodal inputs using a late-fusion strategy. Unlike traditional end-to-end or reinforcement learning approaches, our method processes image sequences with pre-trained models and robot state data with machine learning algorithms, fusing their outputs to predict continuous action values for control. Evaluated on BridgeData V2 and Kuka datasets, the best configuration (VGG16 + Random Forest) achieved MSEs of 0.0021 and 0.0028, respectively, demonstrating strong predictive performance and robustness. The framework supports modularity, interpretability, and real-time decision-making, aligning with the goals of adaptive, human-in-the-loop cyber-physical systems.
Problem

Research questions and friction points this paper is trying to address.

Develops robotic arm control via multimodal deep learning
Integrates vision and state data for action prediction
Enables modular, interpretable, real-time manipulation decisions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep multimodal learning with late-fusion strategy
Pre-trained models for image sequence processing
Modular framework supporting real-time decision-making
🔎 Similar Papers
No similar papers found.
S
Sathish Kumar
Department of Computer Science, Cleveland State University, Cleveland, OH, USA
S
Swaroop Damodaran
Department of Computer Science, Cleveland State University, Cleveland, OH, USA
N
Naveen Kumar Kuruba
Department of Computer Science, Cleveland State University, Cleveland, OH, USA
Sumit Kumar Jha
Sumit Kumar Jha
University of Florida
Arvind Ramanathan
Arvind Ramanathan
Argonne National Laboratory
Machine LearningComputational BiologyMolecular biophysicsenzyme catalysishigher-order statistics