SceneDesigner: Controllable Multi-Object Image Generation with 9-DoF Pose Manipulation

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

147K/year

🤖 AI Summary

Existing controllable image generation methods struggle to simultaneously and precisely control the 9-degree-of-freedom poses (3D position, scale, and orientation) of multiple objects, exhibiting a clear trade-off between pose control accuracy and visual fidelity. To address this, we propose CNOCS (Canonical Normalized Object Coordinate System)—a novel geometry-aware representation—and a branched network architecture that decouples pose encoding from image synthesis. Furthermore, we introduce a two-stage reinforcement learning training strategy to mitigate pose distribution skewness and enable customizable per-object weighting. Experiments demonstrate that our method significantly outperforms state-of-the-art approaches across multiple dimensions: pose control accuracy, geometric consistency, and image fidelity. It also exhibits strong generalization in complex scene editing and layout customization tasks. The source code is publicly available.

Technology Category

Application Category

📝 Abstract

Controllable image generation has attracted increasing attention in recent years, enabling users to manipulate visual content such as identity and style. However, achieving simultaneous control over the 9D poses (location, size, and orientation) of multiple objects remains an open challenge. Despite recent progress, existing methods often suffer from limited controllability and degraded quality, falling short of comprehensive multi-object 9D pose control. To address these limitations, we propose SceneDesigner, a method for accurate and flexible multi-object 9-DoF pose manipulation. SceneDesigner incorporates a branched network to the pre-trained base model and leverages a new representation, CNOCS map, which encodes 9D pose information from the camera view. This representation exhibits strong geometric interpretation properties, leading to more efficient and stable training. To support training, we construct a new dataset, ObjectPose9D, which aggregates images from diverse sources along with 9D pose annotations. To further address data imbalance issues, particularly performance degradation on low-frequency poses, we introduce a two-stage training strategy with reinforcement learning, where the second stage fine-tunes the model using a reward-based objective on rebalanced data. At inference time, we propose Disentangled Object Sampling, a technique that mitigates insufficient object generation and concept confusion in complex multi-object scenes. Moreover, by integrating user-specific personalization weights, SceneDesigner enables customized pose control for reference subjects. Extensive qualitative and quantitative experiments demonstrate that SceneDesigner significantly outperforms existing approaches in both controllability and quality. Code is publicly available at https://github.com/FudanCVL/SceneDesigner.

Problem

Research questions and friction points this paper is trying to address.

Achieving simultaneous 9D pose control for multiple objects in image generation

Addressing limited controllability and quality degradation in multi-object scenes

Mitigating data imbalance issues for low-frequency poses during training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Branched network added to pre-trained base model

CNOCS map representation encodes 9D pose information

Two-stage training with reinforcement learning strategy

🔎 Similar Papers

MagicPose4D: Crafting Articulated Models with Appearance and Motion Control