🤖 AI Summary
This work addresses the loss of fine details in subject-driven generation caused by scale and viewpoint variations. To this end, the authors propose FlowFixer, a framework that recovers high-fidelity details through direct image-to-image translation from visual references, thereby circumventing the semantic ambiguity inherent in textual prompts. The method employs a one-step denoising strategy to self-supervise the generation of training data that realistically simulates common generative artifacts. Furthermore, a novel detail fidelity metric based on keypoint matching is introduced, overcoming the limitations of conventional evaluation approaches that rely solely on semantic similarity. Experimental results demonstrate that FlowFixer consistently outperforms existing methods in both qualitative and quantitative assessments, establishing a new benchmark for high-fidelity detail generation in subject-driven synthesis.
📝 Abstract
We present FlowFixer, a refinement framework for subject-driven generation (SDG) that restores fine details lost during generation caused by changes in scale and perspective of a subject. FlowFixer proposes direct image-to-image translation from visual references, avoiding ambiguities in language prompts. To enable image-to-image training, we introduce a one-step denoising scheme to generate self-supervised training data, which automatically removes high-frequency details while preserving global structure, effectively simulating real-world SDG errors. We further propose a keypoint matching-based metric to properly assess fidelity in details beyond semantic similarities usually measured by CLIP or DINO. Experimental results demonstrate that FlowFixer outperforms state-of-the-art SDG methods in both qualitative and quantitative evaluations, setting a new benchmark for high-fidelity subject-driven generation.