StreamSense: Streaming Social Task Detection with Selective Vision-Language Model Routing

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the challenge of efficiently processing asynchronous multimodal social signals—such as video, text, and audio—in real-time live streaming, where trade-offs exist among latency, computational cost, and decision accuracy under limited contextual information. To this end, the authors propose StreamSense, a framework that employs a lightweight streaming encoder for routine processing and selectively invokes a vision-language model (VLM) only for ambiguous or difficult samples, while deferring decisions when context is insufficient. The approach integrates cross-modal contrastive learning and an IoU-weighted loss to mitigate label noise and interference. Evaluated on emotion classification and hate content moderation tasks, StreamSense significantly outperforms purely VLM-based streaming methods, achieving higher accuracy while substantially reducing average latency and computational overhead.

Technology Category

Application Category

📝 Abstract

Live streaming platforms require real-time monitoring and reaction to social signals, utilizing partial and asynchronous evidence from video, text, and audio. We propose StreamSense, a streaming detector that couples a lightweight streaming encoder with selective routing to a Vision-Language Model (VLM) expert. StreamSense handles most timestamps with the lightweight streaming encoder, escalates hard/ambiguous cases to the VLM, and defers decisions when context is insufficient. The encoder is trained using (i) a cross-modal contrastive term to align visual/audio cues with textual signals, and (ii) an IoU-weighted loss that down-weights poorly overlapping target segments, mitigating label interference across segment boundaries. We evaluate StreamSense on multiple social streaming detection tasks (e.g., sentiment classification and hate content moderation), and the results show that StreamSense achieves higher accuracy than VLM-only streaming while only occasionally invoking the VLM, thereby reducing average latency and compute. Our results indicate that selective escalation and deferral are effective primitives for understanding streaming social tasks. Code is publicly available on GitHub.

Problem

Research questions and friction points this paper is trying to address.

streaming social task detection

real-time monitoring

asynchronous multimodal signals

live streaming

social signal understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

selective routing

streaming social task detection

vision-language model