VELoRA: A Low-Rank Adaptation Approach for Efficient RGB-Event based Recognition

📅 2024-12-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address inefficiency in fine-tuning pre-trained models, excessive parameter overhead, and insufficient dynamic modeling in RGB-Event multimodal recognition, this paper proposes the first Parameter-Efficient Fine-Tuning (PEFT) framework tailored for Vision Transformers (ViTs). Methodologically, we design a two-stage LoRA architecture that jointly leverages modality-specific and modality-shared adaptation, and introduce a novel frame-difference network to explicitly encode motion cues from event streams—enabling efficient joint representation and Transformer-based fusion of RGB frames and event data. Our contributions are threefold: (1) the first systematic application of PEFT to RGB-Event recognition; (2) a novel two-stage collaborative LoRA mechanism; and (3) integration of frame-difference–derived motion priors to enhance dynamic perception. Experiments show <0.5% trainable parameters, state-of-the-art accuracy across multiple benchmarks, and a 3.2× inference speedup. Code and models are publicly available.

Technology Category

Application Category

📝 Abstract
Pattern recognition leveraging both RGB and Event cameras can significantly enhance performance by deploying deep neural networks that utilize a fine-tuning strategy. Inspired by the successful application of large models, the introduction of such large models can also be considered to further enhance the performance of multi-modal tasks. However, fully fine-tuning these models leads to inefficiency and lightweight fine-tuning methods such as LoRA and Adapter have been proposed to achieve a better balance between efficiency and performance. To our knowledge, there is currently no work that has conducted parameter-efficient fine-tuning (PEFT) for RGB-Event recognition based on pre-trained foundation models. To address this issue, this paper proposes a novel PEFT strategy to adapt the pre-trained foundation vision models for the RGB-Event-based classification. Specifically, given the RGB frames and event streams, we extract the RGB and event features based on the vision foundation model ViT with a modality-specific LoRA tuning strategy. The frame difference of the dual modalities is also considered to capture the motion cues via the frame difference backbone network. These features are concatenated and fed into high-level Transformer layers for efficient multi-modal feature learning via modality-shared LoRA tuning. Finally, we concatenate these features and feed them into a classification head to achieve efficient fine-tuning. The source code and pre-trained models will be released on url{https://github.com/Event-AHU/VELoRA}.
Problem

Research questions and friction points this paper is trying to address.

Pre-trained Model
RGB and Event Information Recognition
Efficient Fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

VELoRA
Efficient Fine-tuning
RGB-Event Classification
🔎 Similar Papers
No similar papers found.
L
Langlang Chen
School of Electronic and Information Engineering, Anhui University, Hefei 230601, China
Haoxiang Yang
Haoxiang Yang
The Chinese University of Hong Kong, Shenzhen
Stochastic OptimizationEnergy Systems
P
Pengpeng Shao
Tsinghua University, Beijing, China
H
Haoyu Song
School of Computer Science and Technology, Anhui University, Hefei, China
X
Xiao Wang
School of Computer Science and Technology, Anhui University, Hefei, China
Zhicheng Zhao
Zhicheng Zhao
Associate Professor at the School of Artificial Intelligence, Anhui University
Computer Vision
Yaowei Wang
Yaowei Wang
The Hong Kong Polytechnic University
Y
Yonghong Tian
Peng Cheng Laboratory, Shenzhen, China, and National Engineering Laboratory for Video Technology, School of Electronics Engineering and Computer Science, Peking University, Beijing, China