🤖 AI Summary
This work addresses the limitations of existing continual learning benchmarks for video large language models (VLLMs), which suffer from task redundancy and neglect pretraining capabilities, thereby failing to effectively evaluate catastrophic forgetting. To this end, we propose CL-VISTA—the first low-redundancy, high-distribution-shift continual learning benchmark tailored for VLLMs—encompassing eight diverse tasks spanning perception, understanding, and reasoning, along with six evaluation protocols. Leveraging a multi-task curriculum learning strategy, a distribution-shift induction mechanism, and a three-dimensional evaluation framework assessing performance, computation, and memory, we systematically benchmark ten state-of-the-art algorithms. Our experiments reveal that current methods struggle to balance anti-forgetting robustness, generalization, and resource efficiency, highlighting CL-VISTA’s ability to distinguish genuine intelligence gains from task-specific overfitting.
📝 Abstract
Video Large Language Models (Video-LLMs) require continual learning to adapt to non-stationary real-world data. However, existing benchmarks fall short of evaluating modern foundation models: many still rely on models without large-scale pre-training, and prevailing benchmarks typically partition a single dataset into sub-tasks, resulting in high task redundancy and negligible forgetting on pre-trained Video-LLMs. To address these limitations, we propose CL-VISTA, a benchmark tailored for continual video understanding of Video-LLMs. By curating 8 diverse tasks spanning perception, understanding, and reasoning, CL-VISTA induces substantial distribution shifts that effectively expose catastrophic forgetting. To systematically assess CL methods, we establish a comprehensive evaluation framework comprising 6 distinct protocols across 3 critical dimensions: performance, computational efficiency, and memory footprint. Notably, the performance dimension incorporates a general video understanding assessment to assess whether CL methods genuinely enhance foundational intelligence or merely induce task-specific overfitting. Extensive benchmarking of 10 mainstream CL methods reveals a fundamental trade-off: no single approach achieves universal superiority across all dimensions. Methods that successfully mitigate catastrophic forgetting tend to compromise generalization or incur prohibitive computational and memory overheads. We hope CL-VISTA provides critical insights for advancing continual learning in multimodal foundation models.