🤖 AI Summary
This work addresses the challenge of streaming video understanding under stringent latency constraints, where dynamic variations in information density render existing static approaches inadequate in balancing accuracy and real-time performance. The authors propose R3-Streaming, a framework that formulates the task as a cascaded control problem and progressively optimizes the information state through three stages: memory compression, response readiness assessment, and computational routing. Key innovations include an age-aware forgetting strategy for efficient memory compression, a target-balanced TB-GRPO reinforcement learning algorithm that enables dynamic computation routing to prevent mode collapse, and a multi-model collaborative inference mechanism. Evaluated on OVO-Bench and StreamingBench, the method achieves state-of-the-art performance with scores of 57.92 and 76.36, respectively, while reducing visual token consumption by 95%–96%.
📝 Abstract
Streaming video requires handling dynamic information density under strict latency budgets. Yet, existing methods typically employ static strategies, such as fixed memory compression or reliance on a single model, forcing a trade-off: fast models fail on complex queries, while always-on heavy models violate real-time constraints and overcomplicate simple queries. Rather than fixing these decisions upfront, we propose R3-Streaming (Remember, Respond, Reason), which formulates streaming video understanding as a cascaded control problem: for each query, the system compresses memory, judges response readiness, and routes computation sequentially, so that each downstream decision builds on progressively refined information states. To optimize this pipeline, we introduce an age-aware forgetting policy for memory compression, as aggressively compressing historical frames can yield substantial performance gains. For compute routing, we propose TB-GRPO, a target-balanced reinforcement learning objective that routes hard queries to a stronger model while preventing mode collapse. Extensive evaluations demonstrate that R3-Streaming achieves state-of-the-art results among streaming MLLMs, reaching 57.92 on OVO-Bench and 76.36 on StreamingBench, while reducing visual token usage by 95 to 96 percent.