ByteLoom: Weaving Geometry-Consistent Human-Object Interactions through Progressive Curriculum Learning

๐Ÿ“… 2025-12-28
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing HOI video generation methods suffer from two key limitations: (1) cross-view geometric inconsistency of objects and (2) excessive reliance on fine-grained hand mesh annotations. To address these, we propose a hand-mesh-free multi-view HOI video generation framework. First, we introduce an RCM-cache mechanism to jointly encode object geometry and precisely model 6-DoF spatial transformations. Second, we design a Diffusion Transformer (DiT)-based generative architecture that fuses Relative Coordinate Maps (RCMs) with 3D object encodings. Third, we adopt a progressive curriculum learning strategy to mitigate data scarcity and reduce annotation dependency. Experiments demonstrate significant improvements in cross-view geometric consistency, interaction plausibility, and identity stabilityโ€”while preserving motion smoothness and fine-grained manipulation capability.

Technology Category

Application Category

๐Ÿ“ Abstract
Human-object interaction (HOI) video generation has garnered increasing attention due to its promising applications in digital humans, e-commerce, advertising, and robotics imitation learning. However, existing methods face two critical limitations: (1) a lack of effective mechanisms to inject multi-view information of the object into the model, leading to poor cross-view consistency, and (2) heavy reliance on fine-grained hand mesh annotations for modeling interaction occlusions. To address these challenges, we introduce ByteLoom, a Diffusion Transformer (DiT)-based framework that generates realistic HOI videos with geometrically consistent object illustration, using simplified human conditioning and 3D object inputs. We first propose an RCM-cache mechanism that leverages Relative Coordinate Maps (RCM) as a universal representation to maintain object's geometry consistency and precisely control 6-DoF object transformations in the meantime. To compensate HOI dataset scarcity and leverage existing datasets, we further design a training curriculum that enhances model capabilities in a progressive style and relaxes the demand of hand mesh. Extensive experiments demonstrate that our method faithfully preserves human identity and the object's multi-view geometry, while maintaining smooth motion and object manipulation.
Problem

Research questions and friction points this paper is trying to address.

Generates geometry-consistent human-object interaction videos
Reduces reliance on fine-grained hand mesh annotations
Enhances cross-view consistency with multi-view object information
Innovation

Methods, ideas, or system contributions that make the work stand out.

RCM-cache mechanism for geometry consistency
Progressive curriculum learning for dataset scarcity
Simplified human conditioning with 3D object inputs
๐Ÿ”Ž Similar Papers
No similar papers found.
B
Bangya Liu
University of Wisconsin-Madison
Xinyu Gong
Xinyu Gong
TikTok
Computer vision
Zelin Zhao
Zelin Zhao
Georgia Institute of Technology
Machine LearningComputer VisionTheory
Z
Ziyang Song
The Hong Kong Polytechnic University
Y
Yulei Lu
ByteDance
S
Suhui Wu
ByteDance
J
Jun Zhang
ByteDance
Suman Banerjee
Suman Banerjee
Department of CSE, IIT Jammu
Algorithmic Data ManagementSocial Network AnalysisGraph Theory and Graph AlgorithmsParameterized Complexity
H
Hao Zhang
ByteDance