Prism: A Plug-in Reproducible Infrastructure for Scalable Multimodal Continual Instruction Tuning

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the engineering bottlenecks in multimodal continual instruction tuning (MCIT), which traditionally requires intrusive modifications to base model code, leading to implementation complexity, architectural fragmentation, and challenges in reproducibility and fair comparison. To overcome these limitations, we propose the first plug-and-play, reproducible framework specifically designed for MCIT. By leveraging modular decoupling and a registration mechanism, our framework cleanly separates algorithmic development from the backbone multimodal large language model, enabling flexible integration of new strategies without altering the original model code. Compatible with mainstream large-scale training pipelines, the framework substantially lowers the barrier to entry, enhances comparability across methods, and improves experimental reproducibility. The framework is publicly released to accelerate the development and evaluation of novel MCIT approaches.
📝 Abstract
Multimodal Large Language Models (MLLMs) achieve versatility by reformulating diverse tasks into a unified instruction-following framework via instruction tuning. However, real-world deployment requires continuous adaptation to emerging tasks, motivating Multimodal Continual Instruction Tuning (MCIT). Despite its growing importance, current MCIT research is hindered by severe engineering bottlenecks. Existing methods are typically implemented by directly modifying the base MLLM codebase, which imposes substantial implementation overhead and yields method-specific architectures that severely limit code reuse and fair comparison. To address this, we introduce Prism, a plug-in reproducible codebase specifically designed for scalable MCIT research. It separates algorithmic development from the backbone implementation via a lightweight plugin registration mechanism, enabling new strategies to be integrated as independent plugins without modifying the underlying MLLM codebase, thereby eliminating structural fragmentation and accelerating method development. Prism natively supports widely used large-scale training pipeline, thereby enabling reproducible and scalable MCIT experimentation. Code is available at https://github.com/LAMDA-CL/Prism.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Continual Instruction Tuning
Engineering Bottlenecks
Code Reusability
Fair Comparison
Structural Fragmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Continual Instruction Tuning
Plugin-based Architecture
Reproducible Infrastructure
Scalable Training
Codebase Decoupling
🔎 Similar Papers
J
Jun-Tao Tang
School of Artificial Intelligence, Nanjing University, China; National Key Laboratory for Novel Software Technology, Nanjing University, China
Y
Yu-Cheng Shi
National Key Laboratory for Novel Software Technology, Nanjing University, China
Z
Zhen-Hao Xie
School of Artificial Intelligence, Nanjing University, China; National Key Laboratory for Novel Software Technology, Nanjing University, China
Da-Wei Zhou
Da-Wei Zhou
Associate Researcher, Nanjing University
Incremental LearningContinual LearningOpen-World LearningModel Reuse