🤖 AI Summary
Designers urgently require efficient extraction of standardized, front-facing, and reusable design assets from open-scene images; however, existing generative models struggle to simultaneously ensure high fidelity, orthogonality (i.e., canonical viewpoint alignment), and robustness—particularly under occlusion and complex viewpoints. This paper introduces the first generative framework specifically tailored for design asset extraction. It innovatively proposes an inverse-paste mechanism to construct a reward model, enabling closed-loop reinforcement optimization that substantially mitigates hallucination and improves prompt adherence. Built upon a diffusion architecture, the method leverages over 200K synthetic image–subject pairs for pretraining, is rigorously evaluated on real-world benchmarks, and undergoes closed-loop reinforcement fine-tuning. Experiments demonstrate state-of-the-art performance in design asset extraction, yielding high-fidelity, orthogonally aligned, and editable outputs. The framework has been successfully validated within real-world design workflows.
📝 Abstract
Recent research on generative models has primarily focused on creating product-ready visual outputs; however, designers often favor access to standardized asset libraries, a domain that has yet to be significantly enhanced by generative capabilities. Although open-world scenes provide ample raw materials for designers, efficiently extracting high-quality, standardized assets remains a challenge. To address this, we introduce AssetDropper, the first framework designed to extract assets from reference images, providing artists with an open-world asset palette. Our model adeptly extracts a front view of selected subjects from input images, effectively handling complex scenarios such as perspective distortion and subject occlusion. We establish a synthetic dataset of more than 200,000 image-subject pairs and a real-world benchmark with thousands more for evaluation, facilitating the exploration of future research in downstream tasks. Furthermore, to ensure precise asset extraction that aligns well with the image prompts, we employ a pre-trained reward model to fulfill a closed-loop with feedback. We design the reward model to perform an inverse task that pastes the extracted assets back into the reference sources, which assists training with additional consistency and mitigates hallucination. Extensive experiments show that, with the aid of reward-driven optimization, AssetDropper achieves the state-of-the-art results in asset extraction. Project page: AssetDropper.github.io.