🤖 AI Summary
This work addresses the challenge of balancing interactivity and reconstruction quality in highly compressed human video. We propose the first end-to-end differentiable human video compression framework enabling semantic-level interactivity. Methodologically, we leverage a 3D human model to decouple motion into editable semantic embeddings; integrate grid-based motion field evolution with a generative decoder; and achieve real-time, bitstream-level semantic editing and controllable reconstruction—without pre- or post-processing. Our contributions are threefold: (1) the first human video codec supporting semantic-level interactivity; (2) superior rate-distortion performance over VVC and state-of-the-art generative compression methods at ultra-low bitrates; and (3) simultaneous high-fidelity reconstruction and millisecond-latency semantic manipulation. The framework establishes a new paradigm for real-time digital human communication in the metaverse.
📝 Abstract
In this paper, we propose to compress human body video with interactive semantics, which can facilitate video coding to be interactive and controllable by manipulating semantic-level representations embedded in the coded bitstream. In particular, the proposed encoder employs a 3D human model to disentangle nonlinear dynamics and complex motion of human body signal into a series of configurable embeddings, which are controllably edited, compactly compressed, and efficiently transmitted. Moreover, the proposed decoder can evolve the mesh-based motion fields from these decoded semantics to realize the high-quality human body video reconstruction. Experimental results illustrate that the proposed framework can achieve promising compression performance for human body videos at ultra-low bitrate ranges compared with the state-of-the-art video coding standard Versatile Video Coding (VVC) and the latest generative compression schemes. Furthermore, the proposed framework enables interactive human body video coding without any additional pre-/post-manipulation processes, which is expected to shed light on metaverse-related digital human communication in the future.