How Can Objects Help Video-Language Understanding?

📅 2025-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the impact of explicitly incorporating object representations on video-language understanding in multimodal large language models (MLLMs), addressing the necessity and efficient implementation of object-centric modeling. Method: We propose a lightweight symbolic object representation scheme, seamlessly integrated into MLLM architectures via learnable adapters; we systematically compare symbolic versus distributed object representations and validate that explicit perception module fusion outperforms implicit inductive biases. Results: Experiments across five mainstream video question answering benchmarks demonstrate that our approach achieves state-of-the-art performance with higher data efficiency. It is the first work to empirically establish that symbolic object representations simultaneously preserve strong generalization capability and exhibit superior integration friendliness—offering a concise, effective, and scalable object modeling paradigm for visual grounding in MLLMs.

Technology Category

Application Category

📝 Abstract
How multimodal large language models (MLLMs) perceive the visual world remains a mystery. To one extreme, object and relation modeling may be implicitly implemented with inductive biases, for example by treating objects as tokens. To the other extreme, empirical results reveal the surprising finding that simply performing visual captioning, which tends to ignore spatial configuration of the objects, serves as a strong baseline for video understanding. We aim to answer the question: how can objects help video-language understanding in MLLMs? We tackle the question from the object representation and adaptation perspectives. Specifically, we investigate the trade-off between representation expressiveness (e.g., distributed versus symbolic) and integration difficulty (e.g., data-efficiency when learning the adapters). Through extensive evaluations on five video question answering datasets, we confirm that explicit integration of object-centric representation remains necessary, and the symbolic objects can be most easily integrated while being performant for question answering. We hope our findings can encourage the community to explore the explicit integration of perception modules into MLLM design. Our code and models will be publicly released.
Problem

Research questions and friction points this paper is trying to address.

How objects enhance video-language understanding in MLLMs
Trade-off between object representation expressiveness and integration difficulty
Explicit object-centric representation improves video question answering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Explicit integration of object-centric representation
Symbolic objects for easy integration
Trade-off between expressiveness and integration difficulty
🔎 Similar Papers
No similar papers found.
Zitian Tang
Zitian Tang
Brown University
Artificial IntelligenceMultimodal Machine Learning
S
Shijie Wang
Brown University
Junho Cho
Junho Cho
Samsung Electronics
J
Jaewook Yoo
Samsung Electronics
C
Chen Sun
Brown University