Multimodal Banking Dataset: Understanding Client Needs through Event Sequences

📅 2024-09-26
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
The financial domain has long lacked large-scale, multi-source, real-world event sequence datasets, hindering research on multimodal temporal modeling. To address this, we introduce MBD—the first industrial-grade, open-source multimodal banking event sequence dataset—comprising 1.5 million corporate clients, 950 million transactions, 1 billion geolocation records, 5 million customer-service dialogue embeddings, and aggregated product-purchase data. We propose three core techniques: cross-modal event alignment, differential-privacy-based anonymization, and multimodal temporal encoding. Further, we define two benchmark tasks—marketing response prediction and client matching. Extensive experiments demonstrate that our multimodal baselines significantly outperform unimodal counterparts, effectively overcoming modeling bottlenecks imposed by financial data silos and privacy constraints. MBD establishes a new infrastructure for reproducible, scalable research in financial intelligent decision-making.

Technology Category

Application Category

📝 Abstract
Financial organizations collect a huge amount of data about clients that typically has a temporal (sequential) structure and is collected from various sources (modalities). Due to privacy issues, there are no large-scale open-source multimodal datasets of event sequences, which significantly limits the research in this area. In this paper, we present the industrial-scale publicly available multimodal banking dataset, MBD, that contains more than 1.5M corporate clients with several modalities: 950M bank transactions, 1B geo position events, 5M embeddings of dialogues with technical support and monthly aggregated purchases of four bank's products. All entries are properly anonymized from real proprietary bank data. Using this dataset, we introduce a novel benchmark with two business tasks: campaigning (purchase prediction in the next month) and matching of clients. We provide numerical results that demonstrate the superiority of our multi-modal baselines over single-modal techniques for each task. As a result, the proposed dataset can open new perspectives and facilitate the future development of practically important large-scale multimodal algorithms for event sequences. HuggingFace Link: https://huggingface.co/datasets/ai-lab/MBD Github Link: https://github.com/Dzhambo/MBD
Problem

Research questions and friction points this paper is trying to address.

Lack of large open-source multimodal banking datasets for deep learning
Need for anonymized real-world client data from multiple sources
Absence of benchmarks for financial multimodal sequence analysis tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

First industrial-scale multimodal banking dataset
Anonymized data preserving significant information
Fusion baselines outperform single-modal techniques
🔎 Similar Papers
No similar papers found.