DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing temporal point process (TPP) models are limited to unimodal data and lack benchmarks supporting joint modeling of time, text, and vision. This work introduces DanmakuTPP—the first multimodal TPP benchmark tailored to the danmaku (real-time commentary) scenario—comprising (i) DanmakuTPP-Events, a precisely aligned timestamp-text-video-frame event dataset, and (ii) DanmakuTPP-QA, a spatiotemporal multimodal reasoning question-answering set generated by collaborative LLMs and multimodal LLMs (MLLMs). Methodologically, we extend TPP modeling for the first time to jointly capture textual, visual, and temporal dynamics, formalizing a natural multimodal event paradigm driven by danmaku. We further design an LLM+MLLM co-pipeline for high-quality QA generation. Experiments expose critical bottlenecks in current TPP models and MLLMs for multimodal dynamic modeling, establish strong baselines, and release all code and data publicly to advance the integration of TPPs and multimodal foundation models.

Technology Category

Application Category

📝 Abstract
We introduce DanmakuTPPBench, a comprehensive benchmark designed to advance multi-modal Temporal Point Process (TPP) modeling in the era of Large Language Models (LLMs). While TPPs have been widely studied for modeling temporal event sequences, existing datasets are predominantly unimodal, hindering progress in models that require joint reasoning over temporal, textual, and visual information. To address this gap, DanmakuTPPBench comprises two complementary components: (1) DanmakuTPP-Events, a novel dataset derived from the Bilibili video platform, where user-generated bullet comments (Danmaku) naturally form multi-modal events annotated with precise timestamps, rich textual content, and corresponding video frames; (2) DanmakuTPP-QA, a challenging question-answering dataset constructed via a novel multi-agent pipeline powered by state-of-the-art LLMs and multi-modal LLMs (MLLMs), targeting complex temporal-textual-visual reasoning. We conduct extensive evaluations using both classical TPP models and recent MLLMs, revealing significant performance gaps and limitations in current methods' ability to model multi-modal event dynamics. Our benchmark establishes strong baselines and calls for further integration of TPP modeling into the multi-modal language modeling landscape. The code and dataset have been released at https://github.com/FRENKIE-CHIANG/DanmakuTPPBench
Problem

Research questions and friction points this paper is trying to address.

Lack of multi-modal datasets for Temporal Point Process modeling
Need for joint temporal, textual, and visual event reasoning
Performance gaps in current multi-modal event dynamics modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal TPP benchmark with Danmaku data
LLM-powered QA dataset for complex reasoning
Integration of temporal, textual, visual information
🔎 Similar Papers
No similar papers found.
Y
Yue Jiang
Fudan University
J
Jichu Li
Center for Applied Statistics and School of Statistics, Renmin University of China
Y
Yang Liu
Alibaba Cloud
Dingkang Yang
Dingkang Yang
ByteDance
Multimodal LearningGenerative AIEmbodied AI
F
Feng Zhou
Center for Applied Statistics and School of Statistics, Renmin University of China, Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing
Quyu Kong
Quyu Kong
Alibaba Cloud
Multimodal LLMInformation Diffusion ModelingMachine Learning