Bisecle: Binding and Separation in Continual Learning for Video Language Understanding

📅 2025-07-01

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

To address catastrophic forgetting and parameter interference in video-language continual learning, this paper proposes Bisecle—a novel framework inspired by hippocampal memory mechanisms. Bisecle is the first to introduce rapid binding and pattern separation into multimodal video-language continual learning. It employs multi-directional supervision and contrastive prompt learning to jointly model cross-modal relationships while isolating task-specific knowledge. Leveraging a frozen backbone combined with parameter-efficient fine-tuning, Bisecle mitigates forgetting without incurring significant computational overhead. Extensive experiments on multiple VideoQA benchmarks demonstrate that Bisecle substantially improves both backward transfer (i.e., retention of old-task performance) and forward transfer (i.e., generalization to new tasks), validating its effectiveness and robustness in dynamic video-stream scenarios.

Technology Category

Application Category

📝 Abstract

Frontier vision-language models (VLMs) have made remarkable improvements in video understanding tasks. However, real-world videos typically exist as continuously evolving data streams (e.g., dynamic scenes captured by wearable glasses), necessitating models to continually adapt to shifting data distributions and novel scenarios. Considering the prohibitive computational costs of fine-tuning models on new tasks, usually, a small subset of parameters is updated while the bulk of the model remains frozen. This poses new challenges to existing continual learning frameworks in the context of large multimodal foundation models, i.e., catastrophic forgetting and update conflict. While the foundation models struggle with parameter-efficient continual learning, the hippocampus in the human brain has evolved highly efficient mechanisms for memory formation and consolidation. Inspired by the rapid Binding and pattern separation mechanisms in the hippocampus, in this work, we propose Bisecle for video-language continual learning, where a multi-directional supervision module is used to capture more cross-modal relationships and a contrastive prompt learning scheme is designed to isolate task-specific knowledge to facilitate efficient memory storage. Binding and separation processes further strengthen the ability of VLMs to retain complex experiences, enabling robust and efficient continual learning in video understanding tasks. We perform a thorough evaluation of the proposed Bisecle, demonstrating its ability to mitigate forgetting and enhance cross-task generalization on several VideoQA benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Address catastrophic forgetting in continual learning for video understanding

Mitigate update conflicts in parameter-efficient multimodal foundation models

Enhance cross-task generalization in video-language continual learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-directional supervision captures cross-modal relationships

Contrastive prompt learning isolates task-specific knowledge

Binding and separation processes enhance memory retention

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs