Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI

📅 2025-07-14

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Real-time AI video dialogue faces core challenges of high multimodal large language model (MLLM) inference latency and network instability, leading to unsmooth human-AI interaction. To address this, we propose an AI-native real-time communication framework that fundamentally shifts network requirements from “human viewing video” to “AI understanding video.” Our method introduces context-aware video streaming encoding, packet-loss-resilient adaptive frame-rate control, and a novel historical-frame compensation mechanism for packet-loss recovery. Additionally, we establish DeViBench—the first benchmark specifically designed for evaluating MLLM performance on low-quality video inputs. Experiments demonstrate that our framework achieves a 58% average bitrate reduction without compromising MLLM visual understanding accuracy, while reducing end-to-end latency by 37%, thereby enabling human-like real-time interactive experiences.

Technology Category

Application Category

📝 Abstract

AI Video Chat emerges as a new paradigm for Real-time Communication (RTC), where one peer is not a human, but a Multimodal Large Language Model (MLLM). This makes interaction between humans and AI more intuitive, as if chatting face-to-face with a real person. However, this poses significant challenges to latency, because the MLLM inference takes up most of the response time, leaving very little time for video streaming. Due to network uncertainty and instability, transmission latency becomes a critical bottleneck preventing AI from being like a real person. To address this, we propose Artic, an AI-oriented Real-time Communication framework, exploring the network requirement shift from "humans watching video" to "AI understanding video". To reduce bitrate dramatically while maintaining MLLM accuracy, we propose Context-Aware Video Streaming that recognizes the importance of each video region for chat and allocates bitrate almost exclusively to chat-important regions. To avoid packet retransmission, we propose Loss-Resilient Adaptive Frame Rate that leverages previous frames to substitute for lost/delayed frames while avoiding bitrate waste. To evaluate the impact of video streaming quality on MLLM accuracy, we build the first benchmark, named Degraded Video Understanding Benchmark (DeViBench). Finally, we discuss some open questions and ongoing solutions for AI Video Chat.

Problem

Research questions and friction points this paper is trying to address.

Reducing latency in AI-human video chat due to MLLM inference delays

Optimizing video bitrate allocation for AI understanding without sacrificing accuracy

Mitigating network instability effects on real-time AI video communication

Innovation

Methods, ideas, or system contributions that make the work stand out.

Context-Aware Video Streaming prioritizes chat-important regions

Loss-Resilient Adaptive Frame Rate reduces retransmission needs

DeViBench evaluates video quality impact on MLLM accuracy

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs