🤖 AI Summary
This work addresses the engineering bottlenecks in multimodal continual instruction tuning (MCIT), which traditionally requires intrusive modifications to base model code, leading to implementation complexity, architectural fragmentation, and challenges in reproducibility and fair comparison. To overcome these limitations, we propose the first plug-and-play, reproducible framework specifically designed for MCIT. By leveraging modular decoupling and a registration mechanism, our framework cleanly separates algorithmic development from the backbone multimodal large language model, enabling flexible integration of new strategies without altering the original model code. Compatible with mainstream large-scale training pipelines, the framework substantially lowers the barrier to entry, enhances comparability across methods, and improves experimental reproducibility. The framework is publicly released to accelerate the development and evaluation of novel MCIT approaches.
📝 Abstract
Multimodal Large Language Models (MLLMs) achieve versatility by reformulating diverse tasks into a unified instruction-following framework via instruction tuning. However, real-world deployment requires continuous adaptation to emerging tasks, motivating Multimodal Continual Instruction Tuning (MCIT). Despite its growing importance, current MCIT research is hindered by severe engineering bottlenecks. Existing methods are typically implemented by directly modifying the base MLLM codebase, which imposes substantial implementation overhead and yields method-specific architectures that severely limit code reuse and fair comparison. To address this, we introduce Prism, a plug-in reproducible codebase specifically designed for scalable MCIT research. It separates algorithmic development from the backbone implementation via a lightweight plugin registration mechanism, enabling new strategies to be integrated as independent plugins without modifying the underlying MLLM codebase, thereby eliminating structural fragmentation and accelerating method development. Prism natively supports widely used large-scale training pipeline, thereby enabling reproducible and scalable MCIT experimentation. Code is available at https://github.com/LAMDA-CL/Prism.