Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal Inconsistency

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Current large multimodal models exhibit poor temporal robustness in video temporal reasoning: they over-rely on textual priors while neglecting visual dynamics, leading to sharp performance degradation under temporal inconsistency perturbations. To address this, we introduce TemRobBench—the first benchmark explicitly designed for evaluating temporal robustness—featuring a novel decoupled evaluation paradigm that applies independent temporal perturbations to visual and textual modalities. We further propose Panoramic Direct Preference Optimization (PanoDPO), the first method enabling joint alignment of temporal preferences across vision and language modalities. PanoDPO integrates multimodal feature disentanglement with collaborative training. Extensive experiments across 16 state-of-the-art multimodal models reveal widespread temporal fragility. Our approach significantly enhances temporal reasoning robustness, achieving an average accuracy gain of 12.7% on TemRobBench, alongside improved generalization and reliability.

Technology Category

Application Category

📝 Abstract

Large Multimodal Models (LMMs) have recently demonstrated impressive performance on general video comprehension benchmarks. Nevertheless, for broader applications, the robustness of their temporal analysis capability needs to be thoroughly investigated yet predominantly ignored. Motivated by this, we propose a novel temporal robustness benchmark (TemRobBench), which introduces temporal inconsistency perturbations separately at the visual and textual modalities to assess the robustness of models. We evaluate 16 mainstream LMMs and find that they exhibit over-reliance on prior knowledge and textual context in adversarial environments, while ignoring the actual temporal dynamics in the video. To mitigate this issue, we design panoramic direct preference optimization (PanoDPO), which encourages LMMs to incorporate both visual and linguistic feature preferences simultaneously. Experimental results show that PanoDPO can effectively enhance the model's robustness and reliability in temporal analysis.

Problem

Research questions and friction points this paper is trying to address.

Assessing LMM robustness against temporal inconsistency perturbations

Addressing over-reliance on prior knowledge in adversarial settings

Enhancing temporal analysis via multimodal feature integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces TemRobBench for temporal robustness assessment

Proposes PanoDPO to enhance multimodal feature integration

Evaluates 16 LMMs revealing over-reliance on prior knowledge

🔎 Similar Papers

No similar papers found.