🤖 AI Summary
This work addresses the challenge of modeling long-range dependencies in asynchronous multimodal event sequences comprising visual and textual modalities, where existing attention-based temporal point process (TPP) models often fail to generate coherent and contextually rich text due to excessive sequence lengths. To overcome this limitation, the authors propose a novel multimodal long-range modeling framework that integrates large language models (LLMs) with TPPs. The key innovations include the first effective incorporation of visual modality into LLM-based TPPs and an adaptive sequence compression mechanism grounded in temporal similarity, which substantially reduces input length while preserving critical event patterns. Employing a two-stage training strategy—pretraining on compressed sequences followed by fine-tuning on downstream tasks—the proposed model achieves state-of-the-art performance on benchmarks such as DanmakuTPP-QA, significantly outperforming existing approaches in both event prediction accuracy and text generation quality.
📝 Abstract
Temporal point processes (TPPs) have emerged as powerful tools for modeling asynchronous event sequences. While recent advances have extended TPPs to handle textual information, existing approaches are limited in their ability to generate rich, multimodal content and reason about event dynamics. A key challenge is that incorporating multimodal data dramatically increases sequence length, hindering the ability of attention-based models to generate coherent, long-form textual descriptions that require long-range understanding. In this paper, we propose a novel framework that extends LLM-based TPPs to the visual modality, positioning text generation as a core capability alongside time and type prediction. Our approach addresses the long-context problem through an adaptive sequence compression mechanism based on temporal similarity, which reduces sequence length while preserving essential patterns. We employ a two-stage paradigm of pre-training on compressed sequences followed by supervised fine-tuning for downstream tasks. Extensive experiments, including on the challenging DanmakuTPP-QA benchmark, demonstrate that our method outperforms state-of-the-art baselines in both predictive accuracy and the quality of its generated textual analyses.