LPM 1.0: Video-based Character Performance Model

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the “performance trilemma” in video-driven character animation—where expressiveness, real-time capability, and identity consistency are difficult to achieve simultaneously—by proposing the first interactive character performance framework tailored for full-duplex audiovisual dialogue. Leveraging a newly curated high-quality multimodal character dataset, the authors train a 17-billion-parameter diffusion Transformer model (Base LPM) and further develop an online variant (Online LPM) through identity-aware multi-reference extraction and causal streaming distillation, enabling real-time generation of indefinitely long videos. Evaluated on the newly introduced LPM-Bench benchmark, the proposed method achieves state-of-the-art performance, simultaneously preserving strong identity consistency, high expressiveness, and low latency, making it well-suited for applications such as conversational agents, virtual live streaming, and game NPCs.

Technology Category

Application Category

📝 Abstract

Performance, the externalization of intent, emotion, and personality through visual, vocal, and temporal behavior, is what makes a character alive. Learning such performance from video is a promising alternative to traditional 3D pipelines. However, existing video models struggle to jointly achieve high expressiveness, real-time inference, and long-horizon identity stability, a tension we call the performance trilemma. Conversation is the most comprehensive performance scenario, as characters simultaneously speak, listen, react, and emote while maintaining identity over time. To address this, we present LPM 1.0 (Large Performance Model), focusing on single-person full-duplex audio-visual conversational performance. Concretely, we build a multimodal human-centric dataset through strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction; train a 17B-parameter Diffusion Transformer (Base LPM) for highly controllable, identity-consistent performance through multimodal conditioning; and distill it into a causal streaming generator (Online LPM) for low-latency, infinite-length interaction. At inference, given a character image with identity-aware references, LPM 1.0 generates listening videos from user audio and speaking videos from synthesized audio, with text prompts for motion control, all at real-time speed with identity-stable, infinite-length generation. LPM 1.0 thus serves as a visual engine for conversational agents, live streaming characters, and game NPCs. To systematically evaluate this setting, we propose LPM-Bench, the first benchmark for interactive character performance. LPM 1.0 achieves state-of-the-art results across all evaluated dimensions while maintaining real-time inference.

Problem

Research questions and friction points this paper is trying to address.

performance trilemma

identity stability

real-time inference

expressiveness

conversational performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

performance trilemma

multimodal diffusion transformer

identity-consistent generation