CL-VISTA: Benchmarking Continual Learning in Video Large Language Models

📅 2026-04-01

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the limitations of existing continual learning benchmarks for video large language models (VLLMs), which suffer from task redundancy and neglect pretraining capabilities, thereby failing to effectively evaluate catastrophic forgetting. To this end, we propose CL-VISTA—the first low-redundancy, high-distribution-shift continual learning benchmark tailored for VLLMs—encompassing eight diverse tasks spanning perception, understanding, and reasoning, along with six evaluation protocols. Leveraging a multi-task curriculum learning strategy, a distribution-shift induction mechanism, and a three-dimensional evaluation framework assessing performance, computation, and memory, we systematically benchmark ten state-of-the-art algorithms. Our experiments reveal that current methods struggle to balance anti-forgetting robustness, generalization, and resource efficiency, highlighting CL-VISTA’s ability to distinguish genuine intelligence gains from task-specific overfitting.

Technology Category

Application Category

📝 Abstract

Video Large Language Models (Video-LLMs) require continual learning to adapt to non-stationary real-world data. However, existing benchmarks fall short of evaluating modern foundation models: many still rely on models without large-scale pre-training, and prevailing benchmarks typically partition a single dataset into sub-tasks, resulting in high task redundancy and negligible forgetting on pre-trained Video-LLMs. To address these limitations, we propose CL-VISTA, a benchmark tailored for continual video understanding of Video-LLMs. By curating 8 diverse tasks spanning perception, understanding, and reasoning, CL-VISTA induces substantial distribution shifts that effectively expose catastrophic forgetting. To systematically assess CL methods, we establish a comprehensive evaluation framework comprising 6 distinct protocols across 3 critical dimensions: performance, computational efficiency, and memory footprint. Notably, the performance dimension incorporates a general video understanding assessment to assess whether CL methods genuinely enhance foundational intelligence or merely induce task-specific overfitting. Extensive benchmarking of 10 mainstream CL methods reveals a fundamental trade-off: no single approach achieves universal superiority across all dimensions. Methods that successfully mitigate catastrophic forgetting tend to compromise generalization or incur prohibitive computational and memory overheads. We hope CL-VISTA provides critical insights for advancing continual learning in multimodal foundation models.

Problem

Research questions and friction points this paper is trying to address.

Continual Learning

Video Large Language Models

Catastrophic Forgetting

Benchmarking

Multimodal Foundation Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Continual Learning

Video Large Language Models

Catastrophic Forgetting