FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation of multimodal large language models (MLLMs) in forecasting future events based on audio-visual cues, as existing benchmarks primarily focus on retrospective understanding. We introduce FutureOmni, the first benchmark specifically designed to assess MLLMs’ ability to predict future events by integrating cross-modal causal and temporal reasoning with internal knowledge. To support this benchmark, we develop a scalable human-in-the-loop data construction pipeline and propose the Omni-Modal Future Forecasting (OFF) training strategy to enhance model generalization. Evaluated on a dataset comprising 919 videos and 1,034 question-answer pairs across 20 leading models, our approach demonstrates that Gemini 1.5 Flash achieves the highest accuracy of 64.8%, validating the effectiveness of the OFF strategy in improving future prediction performance.

Technology Category

Application Category

📝 Abstract

Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio-visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios, with the best accuracy of 64.8% achieved by Gemini 3 Flash. To mitigate this limitation, we curate a 7K-sample instruction-tuning dataset and propose an Omni-Modal Future Forecasting (OFF) training strategy. Evaluations on FutureOmni and popular audio-visual and video-only benchmarks demonstrate that OFF enhances future forecasting and generalization. We publicly release all code (https://github.com/OpenMOSS/FutureOmni) and datasets (https://huggingface.co/datasets/OpenMOSS-Team/FutureOmni).

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

Future Forecasting

Audio-Visual Prediction

Omni-Modal Context

Temporal Reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

future forecasting

omni-modal reasoning

multimodal LLMs