BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This work addresses the limitations of existing video multimodal large language models, which suffer from modality mismatch, excessive textual token consumption, and loss of temporal resolution due to their reliance on textual serialization of bounding box coordinates. To overcome these issues, the authors propose a novel paradigm that replaces textual coordinates with visual prompts: colored bounding boxes and trajectory lines are directly rendered onto video frames, accompanied by a concise color-to-object legend to embed spatiotemporal information within the visual modality itself. This approach reduces textual token usage by 87–93%, preserves full temporal resolution, and effectively encodes inter-frame motion dynamics. Evaluated on five benchmarks—CLEVRER, Perception Test, STAR, NExT-QA, and IntentQA—the method significantly outperforms baseline models, markedly enhancing spatial reasoning while nearly eliminating performance degradation during inference.

Technology Category

Application Category

📝 Abstract

Object-level spatial-temporal understanding is essential for video question answering, yet existing multimodal large language models (MLLMs) encode frames holistically and lack explicit mechanisms for fine-grained object grounding. Recent work addresses this by serializing bounding box coordinates as text tokens, but this text-coordinate paradigm suffers from a fundamental modality mismatch: object information is inherently visual, yet encoding it as text incurs a high token cost that forces aggressive temporal downsampling. We propose BoxTuning, which resolves this mismatch by injecting object spatial-temporal information directly into the visual modality. Colored bounding boxes and trajectory trails are rendered onto video frames as visual prompts, with only a concise color-to-object legend retained as text. This reduces the token cost significantly, achieving 87-93% text token reduction in practice. It also preserves full temporal resolution, where the trajectory trails further encode inter-frame motion direction and speed within each keyframe, recovering fine-grained dynamics that text-coordinate methods are forced to discard. Experimental results on five video QA benchmarks (CLEVRER, Perception Test, STAR, NExT-QA, IntentQA) show that BoxTuning surpasses text-coordinate baselines on spatially oriented tasks and nearly eliminates the accuracy degradation observed on reasoning-centric tasks, establishing visual prompting as a more natural and efficient paradigm for conveying object information to video MLLMs.

Problem

Research questions and friction points this paper is trying to address.

object grounding

multimodal large language models

video question answering

spatial-temporal understanding

modality mismatch

Innovation

Methods, ideas, or system contributions that make the work stand out.

BoxTuning

visual prompting

multimodal fine-tuning