Babel: A Scalable Pre-trained Model for Multi-Modal Sensing via Expandable Modality Alignment

📅 2024-07-25
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of sparse modality pairing and scarce labeled data in multimodal perception, this paper proposes a scalable modality alignment paradigm that decomposes N-modality joint alignment into incremental binary alignments, enabling dynamic integration of new modalities. Methodologically, it integrates multi-stage contrastive learning, modality-decoupled encoders, dynamic weight calibration, and cross-modal projection alignment, and conducts joint pretraining across six modalities: Wi-Fi, mmWave, IMU, LiDAR, video, and depth. Key contributions include the first scalable alignment mechanism, a modality contribution balancing strategy, and a sparse-pairing mitigation technique, culminating in plug-and-play single- and multimodal foundation models for perception. Evaluated on eight human activity recognition benchmarks, the approach achieves an average 12% accuracy gain for single-modal inputs and up to 22% improvement for multimodal fusion—substantially outperforming state-of-the-art methods. It further supports cross-modal retrieval and large language model–enabled collaborative perception and understanding.

Technology Category

Application Category

📝 Abstract
This paper presents Babel, the expandable modality alignment model, specially designed for multi-modal sensing. While there has been considerable work on multi-modality alignment, they all struggle to effectively incorporate multiple sensing modalities due to the data scarcity constraints. How to utilize multi-modal data with partial pairings in sensing remains an unresolved challenge. Babel tackles this challenge by introducing the concept of expandable modality alignment. The key idea involves transforming the N-modality alignment into a series of binary-modality alignments. Novel techniques are also proposed to further mitigate data scarcity issue and balance the contribution of the newly incorporated modality with the previously established modality alignment during the expandable alignment process. We provide the comprehensive implementation. In the pre-training phase, Babel currently aligns 6 sensing modalities, namely Wi-Fi, mmWave, IMU, LiDAR, video, and depth. For the deployment phase, as a foundation model, any single or combination of aligned modalities could be selected from Babel and applied to downstream tasks. Evaluation demonstrates Babel's outstanding performance on eight human activity recognition datasets, compared to a broad range of baselines e.g., the SOTA single-modal sensing networks, multi-modal sensing framework, and multi-modal large language models. Babel not only improves the performance of individual modality sensing (12% averaged accuracy improvement), but also effectively fuses multiple available modalities (up to 22% accuracy increase). Case studies also highlight emerging application scenarios empowered by Babel, including cross-modality retrieval (i.e., sensing imaging), and bridging LLM for sensing comprehension.
Problem

Research questions and friction points this paper is trying to address.

Addresses multi-modal sensing with partial data pairings
Solves data scarcity in multi-modality alignment tasks
Enhances sensing accuracy across six diverse modalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Expandable modality alignment for multi-modal sensing
Binary-modality alignments to simplify N-modality alignment
Novel techniques to mitigate data scarcity issues
🔎 Similar Papers
No similar papers found.