🤖 AI Summary
Existing image representation methods struggle to fully disentangle semantics, geometry, and texture while preserving high-fidelity reconstruction, thereby limiting fine-grained controllable editing. This work proposes a hierarchical proxy-embedding parameterized representation that constructs a semantic-aware hierarchical proxy geometry and embeds multi-scale implicit textures into geometry-aware proxy nodes. For the first time, this approach achieves complete disentanglement of the three components within independent parameter spaces, enabling high-quality background completion and physics-driven animation without relying on generative models. By integrating adaptive Bézier fitting, iterative region subdivision, and local feature indexing, the method attains state-of-the-art reconstruction quality on benchmarks including ImageNet, OIR-Bench, and HumanEdit with fewer parameters, while supporting intuitive interaction and real-time animation—significantly outperforming existing generative approaches.
📝 Abstract
Prevailing image representation methods, including explicit representations such as raster images and Gaussian primitives, as well as implicit representations such as latent images, either suffer from representation redundancy that leads to heavy manual editing effort, or lack a direct mapping from latent variables to semantic instances or parts, making fine-grained manipulation difficult. These limitations hinder efficient and controllable image and video editing. To address these issues, we propose a hierarchical proxy-based parametric image representation that disentangles semantic, geometric, and textural attributes into independent and manipulable parameter spaces. Based on a semantic-aware decomposition of the input image, our representation constructs hierarchical proxy geometries through adaptive Bezier fitting and iterative internal region subdivision and meshing. Multi-scale implicit texture parameters are embedded into the resulting geometry-aware distributed proxy nodes, enabling continuous high-fidelity reconstruction in the pixel domain and instance- or part-independent semantic editing. In addition, we introduce a locality-adaptive feature indexing mechanism to ensure spatial texture coherence, which further supports high-quality background completion without relying on generative models. Extensive experiments on image reconstruction and editing benchmarks, including ImageNet, OIR-Bench, and HumanEdit, demonstrate that our method achieves state-of-the-art rendering fidelity with significantly fewer parameters, while enabling intuitive, interactive, and physically plausible manipulation. Moreover, by integrating proxy nodes with Position-Based Dynamics, our framework supports real-time physics-driven animation using lightweight implicit rendering, achieving superior temporal consistency and visual realism compared with generative approaches.