InfBaGel: Human-Object-Scene Interaction Generation with Dynamic Perception and Iterative Refinement

📅 2026-04-06

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the challenges of human-object-scene interaction (HOSI) generation, particularly dynamic environment reasoning and scarce annotated data. The authors propose a coarse-to-fine instruction-conditioned generation framework based on consistency models with iterative denoising. Their approach innovatively integrates trajectory-driven dynamic scene awareness, collision-avoidance physical guidance, voxelized scene occupancy injection, and a joint HOI/HSI training strategy, effectively mitigating data scarcity without requiring detailed geometric information. The method achieves state-of-the-art performance on both HOSI and HOI generation tasks, significantly enhancing the realism and temporal consistency of synthesized interactions while demonstrating strong generalization to unseen scenes and enabling real-time, high-quality interactive generation.

Technology Category

Application Category

📝 Abstract

Human-object-scene interactions (HOSI) generation has broad applications in embodied AI, simulation, and animation. Unlike human-object interaction (HOI) and human-scene interaction (HSI), HOSI generation requires reasoning over dynamic object-scene changes, yet suffers from limited annotated data. To address these issues, we propose a coarse-to-fine instruction-conditioned interaction generation framework that is explicitly aligned with the iterative denoising process of a consistency model. In particular, we adopt a dynamic perception strategy that leverages trajectories from the preceding refinement to update scene context and condition subsequent refinement at each denoising step of consistency model, yielding consistent interactions. To further reduce physical artifacts, we introduce a bump-aware guidance that mitigates collisions and penetrations during sampling without requiring fine-grained scene geometry, enabling real-time generation. To overcome data scarcity, we design a hybrid training startegy that synthesizes pseudo-HOSI samples by injecting voxelized scene occupancy into HOI datasets and jointly trains with high-fidelity HSI data, allowing interaction learning while preserving realistic scene awareness. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both HOSI and HOI generation, and strong generalization to unseen scenes. Project page: https://yudezou.github.io/InfBaGel-page/

Problem

Research questions and friction points this paper is trying to address.

Human-Object-Scene Interaction

Dynamic Perception

Data Scarcity

Physical Artifacts

Interaction Generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic perception

consistency model

bump-aware guidance

hybrid training strategy

human-object-scene interaction

🔎 Similar Papers

No similar papers found.