Multi-modal video data-pipelines for machine learning with minimal human supervision

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing models are largely restricted to unimodal or bimodal architectures, limiting their capacity to efficiently integrate the rich diversity of visual modalities present in real-world scenarios. To address this, we propose a low-supervision, fully automated multimodal video data pipeline that enables programmable composition and joint learning across heterogeneous visual modalities—including RGB, depth, optical flow, and edge maps. Our key contributions are: (1) PHG-MAE, a lightweight multimodal self-supervised encoder (<1M parameters), which leverages pretrained expert models and knowledge distillation to achieve performance on par with 300M-parameter large models; and (2) seamless integration of off-the-shelf modules (e.g., DPT) to enable real-time semantic segmentation and near-real-time depth estimation from handheld or webcam video on commodity hardware. Extensive experiments demonstrate the pipeline’s efficiency, scalability, and strong generalization under resource-constrained conditions.

Technology Category

Application Category

📝 Abstract
The real-world is inherently multi-modal at its core. Our tools observe and take snapshots of it, in digital form, such as videos or sounds, however much of it is lost. Similarly for actions and information passing between humans, languages are used as a written form of communication. Traditionally, Machine Learning models have been unimodal (i.e. rgb ->semantic or text ->sentiment_class). Recent trends go towards bi-modality, where images and text are learned together, however, in order to truly understand the world, we need to integrate all these independent modalities. In this work we try to combine as many visual modalities as we can using little to no human supervision. In order to do this, we use pre-trained experts and procedural combinations between them on top of raw videos using a fully autonomous data-pipeline, which we also open-source. We then make use of PHG-MAE, a model specifically designed to leverage multi-modal data. We show that this model which was efficiently distilled into a low-parameter (<1M) can have competitive results compared to models of ~300M parameters. We deploy this model and analyze the use-case of real-time semantic segmentation from handheld devices or webcams on commodity hardware. Finally, we deploy other off-the-shelf models using the same framework, such as DPT for near real-time depth estimation.
Problem

Research questions and friction points this paper is trying to address.

Developing multi-modal video pipelines with minimal human supervision
Integrating diverse visual modalities using pre-trained experts
Enabling efficient real-time semantic segmentation on commodity hardware
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal video pipelines with minimal supervision
Pre-trained experts combined procedurally on raw videos
PHG-MAE model efficiently distilled to under 1M parameters
🔎 Similar Papers
No similar papers found.