COLT: Enhancing Video Large Language Models with Continual Tool Usage

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video large language models (V-LLMs) operate under a static tool library assumption, rendering them ill-suited for real-world scenarios involving continuously evolving tools and streaming inputs—leading to poor generalization and catastrophic forgetting. To address this, we propose a Dynamic Learnable Tool Codebook: a dedicated memory module enabling incremental injection of novel tools while preserving stable representations of historical ones. We further introduce an instruction-similarity-driven dynamic tool retrieval mechanism and a continual learning optimization strategy. Additionally, we construct VideoToolBench—the first video-oriented benchmark for tool usage evaluation. Extensive experiments on multiple V-LLM benchmarks and VideoToolBench demonstrate significant improvements in tool selection accuracy and continual adaptability. Our approach achieves, for the first time, efficient and robust tool utilization by open-source V-LLMs under continuous, streaming tool updates.

Technology Category

Application Category

📝 Abstract
The success of Large Language Models (LLMs) has significantly propelled the research of video understanding. To harvest the benefits of well-trained expert models (i.e., tools), video LLMs prioritize the exploration of tool usage capabilities. Existing methods either prompt closed-source LLMs or employ the instruction tuning paradigm for tool-use fine-tuning. These methods, however, assume an established repository of fixed tools and struggle to generalize to real-world environments where tool data is perpetually evolving and streaming in. To this end, we propose to enhance open-source video LLMs with COntinuaL Tool usage (termed COLT), which automatically acquires tool-use ability in a successive tool stream without suffering 'catastrophic forgetting' of the past learned tools. Specifically, our COLT incorporates a learnable tool codebook as a tool-specific memory system. Then relevant tools are dynamically selected based on the similarity between user instruction and tool features within the codebook. To unleash the tool usage potential of video LLMs, we collect a video-centric tool-use instruction tuning dataset VideoToolBench. Extensive experiments on both previous video LLM benchmarks and the tool-use-specific VideoToolBench dataset demonstrate the state-of-the-art performance of our proposed COLT.
Problem

Research questions and friction points this paper is trying to address.

Video LLMs struggle with evolving tool streams
Existing methods cannot handle perpetually changing tool data
Need to prevent catastrophic forgetting of learned tools
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learnable tool codebook as memory system
Dynamic tool selection via instruction-tool similarity
Continual learning without catastrophic forgetting
🔎 Similar Papers
No similar papers found.
Y
Yuyang Liu
Mohamed bin Zayed University of Artificial Intelligence, Masdar City, Abu Dhabi, United Arab Emirates
X
Xinyuan Shi
Mohamed bin Zayed University of Artificial Intelligence, Masdar City, Abu Dhabi, United Arab Emirates
Bang Yang
Bang Yang
Peng Cheng Laboratory
Immersive TechnologyMultimodal LearningAI in Healthcare
Peilin Zhou
Peilin Zhou
HKUST; Peking University
sequential recommendationnatural language processing
J
Jiahua Dong
Mohamed bin Zayed University of Artificial Intelligence, Masdar City, Abu Dhabi, United Arab Emirates
L
Long Chen
Mohamed bin Zayed University of Artificial Intelligence, Masdar City, Abu Dhabi, United Arab Emirates
I
Ian Reid
Mohamed bin Zayed University of Artificial Intelligence, Masdar City, Abu Dhabi, United Arab Emirates
Xiaodan Liang
Xiaodan Liang
Professor of Computer Science, Sun Yat-sen University, MBZUAI, CMU, NUS
Computer visionEmbodied AIMachine learning