🤖 AI Summary
This work addresses the challenge of balancing performance and controllability in multimodal understanding and generation, which is often hindered by conflicting objectives and entangled representations. The authors propose UniDFlow, a novel framework that unifies discrete flow matching for both understanding and generation tasks. By employing task-specific low-rank adapters, UniDFlow decouples these two processes, while a reference-guided multimodal preference alignment mechanism refines relative outputs without requiring extensive retraining. Evaluated across eight benchmarks, the method achieves state-of-the-art performance and demonstrates significantly improved zero-shot generalization. It effectively supports diverse applications including image inpainting, context-aware generation, reference-based editing, and compositional generation.
📝 Abstract
We propose UniDFlow, a unified discrete flow-matching framework for multimodal understanding, generation, and editing. It decouples understanding and generation via task-specific low-rank adapters, avoiding objective interference and representation entanglement, while a novel reference-based multimodal preference alignment optimizes relative outcomes under identical conditioning, improving faithfulness and controllability without large-scale retraining. UniDFlpw achieves SOTA performance across eight benchmarks and exhibits strong zero-shot generalization to tasks including inpainting, in-context image generation, reference-based editing, and compositional generation, despite no explicit task-specific training.