🤖 AI Summary
Existing single-image human-scene interaction reconstruction methods struggle to balance speed and physical plausibility: optimization-based approaches are accurate but slow, while feedforward methods are fast yet lack explicit interaction modeling, often producing floating or interpenetration artifacts. This work proposes GRAFT, a learnable interaction prior that generates volume-anchored tokens via geometric probes and employs a lightweight recurrent Transformer to predict interaction gradients, thereby transforming geometric fitting into efficient feedforward inference. GRAFT achieves explicit and physically reasonable 3D interaction reconstruction at high speed. Experiments show that GRAFT improves interaction quality by up to 113% over state-of-the-art feedforward methods, operates approximately 50 times faster than optimization-based approaches, generalizes well to in-the-wild multi-person scenes, and achieves a user preference rate of 64.8%.
📝 Abstract
Reconstructing physically plausible 3D human-scene interactions (HSI) from a single image currently presents a trade-off: optimization based methods offer accurate contact but are slow (~20s), while feed-forward approaches are fast yet lack explicit interaction reasoning, producing floating and interpenetration artifacts.
Our key insight is that geometry-based human--scene fitting can be amortized into fast feed-forward inference. We present GRAFT (Geometric Refinement And Fitting Transformer), a learned HSI prior that predicts Interaction Gradients: corrective parameter updates that iteratively refine human meshes by reasoning about their 3D relationship to the surrounding scene.
GRAFT encodes the interaction state into compact body-anchored tokens, each grounded in the scene geometry via Geometric Probes that capture spatial relationships with nearby surfaces.
A lightweight transformer recurrently updates human meshes and re-probes the scene, ensuring the final pose aligns with both learned priors and observed geometry.
GRAFT operates either as an end-to-end reconstructor using image features, or with geometry alone as a transferable plug-and-play HSI prior that improves feed-forward methods without retraining.
Experiments show GRAFT improves interaction quality by up to 113% over state-of-the-art feed-forward methods and matches optimization-based interaction quality at ${\sim}50{\times}$ lower runtime, while generalizing seamlessly to in-the-wild multi-person scenes and being preferred in 64.8% of three-way user study. Project page: https://pradyumnaym.github.io/graft .