🤖 AI Summary
Existing text-driven 3D editing methods suffer from a trade-off between source avatar consistency and textual instruction fidelity when directly applying Score Distillation Sampling (SDS), often resulting in geometric/artifact distortions or texture blurring. To address this, we propose Editing-oriented Score Distillation Sampling (SDS-E), which (i) selectively fuses gradient terms during the diffusion process to preserve structural integrity; (ii) introduces a spatial smoothing regularization to enforce texture continuity; and (iii) designs a gradient-guided view sampling strategy to enhance multi-view consistency. Crucially, SDS-E maintains the original 3D human geometry and animation sequence unchanged while significantly improving texture edit sharpness and semantic fidelity. Extensive experiments demonstrate that SDS-E outperforms state-of-the-art 3D text-to-3D editing methods both qualitatively and quantitatively across standard metrics.
📝 Abstract
We present InstructHumans, a novel framework for instruction-driven 3D human texture editing. Existing text-based editing methods use Score Distillation Sampling (SDS) to distill guidance from generative models. This work shows that naively using such scores is harmful to editing as they destroy consistency with the source avatar. Instead, we propose an alternate SDS for Editing (SDS-E) that selectively incorporates subterms of SDS across diffusion timesteps. We further enhance SDS-E with spatial smoothness regularization and gradient-based viewpoint sampling to achieve high-quality edits with sharp and high-fidelity detailing. InstructHumans significantly outperforms existing 3D editing methods, consistent with the initial avatar while faithful to the textual instructions. Project page: https://jyzhu.top/instruct-humans .