StreamSense: Streaming Social Task Detection with Selective Vision-Language Model Routing

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of efficiently processing asynchronous multimodal social signals—such as video, text, and audio—in real-time live streaming, where trade-offs exist among latency, computational cost, and decision accuracy under limited contextual information. To this end, the authors propose StreamSense, a framework that employs a lightweight streaming encoder for routine processing and selectively invokes a vision-language model (VLM) only for ambiguous or difficult samples, while deferring decisions when context is insufficient. The approach integrates cross-modal contrastive learning and an IoU-weighted loss to mitigate label noise and interference. Evaluated on emotion classification and hate content moderation tasks, StreamSense significantly outperforms purely VLM-based streaming methods, achieving higher accuracy while substantially reducing average latency and computational overhead.

Technology Category

Application Category

📝 Abstract
Live streaming platforms require real-time monitoring and reaction to social signals, utilizing partial and asynchronous evidence from video, text, and audio. We propose StreamSense, a streaming detector that couples a lightweight streaming encoder with selective routing to a Vision-Language Model (VLM) expert. StreamSense handles most timestamps with the lightweight streaming encoder, escalates hard/ambiguous cases to the VLM, and defers decisions when context is insufficient. The encoder is trained using (i) a cross-modal contrastive term to align visual/audio cues with textual signals, and (ii) an IoU-weighted loss that down-weights poorly overlapping target segments, mitigating label interference across segment boundaries. We evaluate StreamSense on multiple social streaming detection tasks (e.g., sentiment classification and hate content moderation), and the results show that StreamSense achieves higher accuracy than VLM-only streaming while only occasionally invoking the VLM, thereby reducing average latency and compute. Our results indicate that selective escalation and deferral are effective primitives for understanding streaming social tasks. Code is publicly available on GitHub.
Problem

Research questions and friction points this paper is trying to address.

streaming social task detection
real-time monitoring
asynchronous multimodal signals
live streaming
social signal understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

selective routing
streaming social task detection
vision-language model
cross-modal contrastive learning
IoU-weighted loss
🔎 Similar Papers
No similar papers found.