Structured Causal Video Reasoning via Multi-Objective Alignment

📅 2026-04-06

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the limitations of existing video large language models, which struggle to model temporal causal relationships due to their reliance on unstructured reasoning, resulting in inefficient and causally fragile inference. To overcome this, the authors propose Factum-4B, a novel framework that incorporates structured event facts as causal priors to guide concise and consistent video reasoning. The approach employs a four-stage training pipeline—fact alignment, format warm-up, chain-of-thought warm-up, and reinforcement post-training—and formulates multi-objective optimization as a multi-objective reinforcement learning problem, explicitly optimizing the Pareto frontier to balance structural completeness, causal fidelity, and reasoning conciseness. Experimental results demonstrate that Factum-4B achieves superior performance on fine-grained temporal reasoning tasks, yielding more reliable inferences and intermediate evidence that is more verifiable.

Technology Category

Application Category

📝 Abstract

Human understanding of video dynamics is typically grounded in a structured mental representation of entities, actions, and temporal relations, rather than relying solely on immediate deductive reasoning. In contrast, existing Video-LLMs largely depend on unstructured video reasoning, where critical visual evidence is embedded in verbose textual descriptions and temporal causality is often weakly modeled. This leads to inefficient processes and fragile causal inference. To bridge this cognitive gap, we propose constructing a compact representation of salient events and their causal relationships, which we name Structured Event Facts, prior to the reasoning stage. This structured prior serves as an explicit constraint to promote concise and causally grounded reasoning, while also making intermediate evidence easier to verify. To effectively train models on such structured facts, we introduce CausalFact-60K and a four-stage training pipeline comprising facts alignment, format warm-start, thinking warm-start, and reinforcement learning-based post-training. During RL stage, we find that this framework introduces competing objectives, as structural completeness and causal fidelity must be balanced against reasoning length, making it difficult to optimize. We address this challenge by formulating the optimization as a Multi-Objective Reinforcement Learning (MORL) problem and explicitly optimizing toward the Pareto-Frontier to balance these trade-offs. As a result, we introduce Factum-4B, which yields more reliable reasoning and delivers stronger performance on challenging video understanding tasks requiring fine-grained temporal inference.

Problem

Research questions and friction points this paper is trying to address.

structured reasoning

causal inference

video understanding

temporal relations

Video-LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured Event Facts

Multi-Objective Reinforcement Learning

Causal Video Reasoning

Video-LLMs

Pareto-Frontier Optimization

🔎 Similar Papers

MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning

2024-09-26Citations: 0

NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative

2024-06-10arXiv.orgCitations: 3