HyperTokens: Controlling Token Dynamics for Continual Video-Language Understanding

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses the challenges of task interference and high memory overhead from storing task-specific prompts in continual video question answering (VideoQA). To this end, the authors propose HyperTokens, a Transformer-based dynamic token generator that produces fine-tuning tokens on demand, enabling explicit control over prompt updates under a fixed memory budget. By incorporating a forward-looking regularization term to suppress sharp, task-specific optimization directions and combining causal modeling with a mutual information proxy loss, the method encourages convergence toward flat minima across modalities and promotes robust continual transfer. Evaluated on two standard continual VideoQA benchmarks, HyperTokens significantly improves average accuracy while reducing forgetting. It also demonstrates strong robustness under a newly introduced cross-modal continual transfer protocol from ImageQA to VideoQA.

Technology Category

Application Category

📝 Abstract

Continual VideoQA with multimodal LLMs is hindered by interference between tasks and the prohibitive cost of storing task-specific prompts. We introduce HyperTokens, a transformer-based token generator that produces fine-tuning tokens on demand, giving explicit control over prompt updates while keeping memory fixed. To suppress forgetting, we propose meta-inspired regularisers that look ahead to avoid task-specific sharp directions and anchor the evolving generator to prior tasks. We further connect our objective to sharpness-aware optimisation, providing insight into why it encourages flatter cross-task minima and improves retention. Beyond regularisation, HyperTokens exploits lightweight auxiliary multimodal supervision through shared generation weights; guided by a causal perspective, we design feasible objectives and surrogate mutual-information losses to regularise anti-causal cross-modal directions. Across two standard continual VideoQA benchmarks, HyperTokens achieves higher average accuracy with substantially lower forgetting. Finally, we introduce a challenging cross-modal ImageQA->VideoQA protocol and show that HyperTokens enables robust continual transfer in this setting.

Problem

Research questions and friction points this paper is trying to address.

Continual VideoQA

task interference

catastrophic forgetting

multimodal LLMs

prompt storage

Innovation

Methods, ideas, or system contributions that make the work stand out.

HyperTokens

continual learning

video-language understanding