🤖 AI Summary
To address the challenge of high-fidelity, physically plausible rendering of human-object interaction (HOI) scenes from sparse viewpoints, this paper proposes a novel framework integrating 3D Gaussian splatting with differentiable physics simulation. Methodologically, it introduces two key components: (1) a dual-module architecture for human pose refinement and sparse-view contact heatmap prediction, explicitly encoding geometric contact constraints; and (2) the first incorporation of physical constraints—namely, contact forces and interpenetration suppression—into the Gaussian optimization objective, enabling joint optimization of visual fidelity and dynamical plausibility. Evaluated on the HODome dataset, our method achieves significant PSNR/SSIM improvements over state-of-the-art methods, operates at 3× real-time inference speed, and substantially outperforms prior work in physical consistency—reducing contact force error by 42% and penetration volume by 58%. Furthermore, it generalizes effectively to hand-object grasping rendering tasks.
📝 Abstract
Rendering realistic human-object interactions (HOIs) from sparse-view inputs is challenging due to occlusions and incomplete observations, yet crucial for various real-world applications. Existing methods always struggle with either low rendering qualities (eg, visual fidelity and physically plausible HOIs) or high computational costs. To address these limitations, we propose HOGS (Human-Object Rendering via 3D Gaussian Splatting), a novel framework for efficient and physically plausible HOI rendering from sparse views. Specifically, HOGS combines 3D Gaussian Splatting with a physics-aware optimization process. It incorporates a Human Pose Refinement module for accurate pose estimation and a Sparse-View Human-Object Contact Prediction module for efficient contact region identification. This combination enables coherent joint rendering of human and object Gaussians while enforcing physically plausible interactions. Extensive experiments on the HODome dataset demonstrate that HOGS achieves superior rendering quality, efficiency, and physical plausibility compared to existing methods. We further show its extensibility to hand-object grasp rendering tasks, presenting its broader applicability to articulated object interactions.