EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation

📅 2026-02-14

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing streaming multimodal video generation models suffer from high latency, temporal instability, spatial blurriness, temporal drift, and lip-audio asynchrony, hindering their applicability in real-time scenarios. This work proposes EchoTorrent, a novel framework that achieves efficient and stable few-step autoregressive generation through four synergistic innovations: multi-teacher knowledge distillation, adaptive classifier-free guidance calibration (ACC-DMD), forced alignment of long-tail frames within a causal–bidirectional hybrid architecture, and optimization of the pixel-domain VAE decoder. EchoTorrent substantially enhances temporal consistency, identity preservation, and lip-sync accuracy, effectively mitigating multimodal degradation inherent in streaming generation and striking a breakthrough balance between efficiency and generation quality.

Technology Category

Application Category

📝 Abstract

Recent multi-modal video generation models have achieved high visual quality, but their prohibitive latency and limited temporal stability hinder real-time deployment. Streaming inference exacerbates these issues, leading to pronounced multimodal degradation, such as spatial blurring, temporal drift, and lip desynchronization, which creates an unresolved efficiency-performance trade-off. To this end, we propose EchoTorrent, a novel schema with a fourfold design: (1) Multi-Teacher Training fine-tunes a pre-trained model on distinct preference domains to obtain specialized domain experts, which sequentially transfer domain-specific knowledge to a student model; (2) Adaptive CFG Calibration (ACC-DMD), which calibrates the audio CFG augmentation errors in DMD via a phased spatiotemporal schedule, eliminating redundant CFG computations and enabling single-pass inference per step; (3) Hybrid Long Tail Forcing, which enforces alignment exclusively on tail frames during long-horizon self-rollout training via a causal-bidirectional hybrid architecture, effectively mitigates spatiotemporal degradation in streaming mode while enhancing fidelity to reference frames; and (4) VAE Decoder Refiner through pixel-domain optimization of the VAE decoder to recover high-frequency details while circumventing latent-space ambiguities. Extensive experiments and analysis demonstrate that EchoTorrent achieves few-pass autoregressive generation with substantially extended temporal consistency, identity preservation, and audio-lip synchronization.

Problem

Research questions and friction points this paper is trying to address.

multi-modal video generation

streaming inference

temporal stability

audio-lip synchronization

real-time deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Streaming Video Generation

Multi-Teacher Knowledge Transfer

Adaptive CFG Calibration