FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation

📅 2025-06-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models treat image understanding and editing as disjoint tasks, hindering unified support for referential-expression-driven interactive image editing. To address this, we propose the first unified architecture jointly optimizing segmentation-aware perception and generative modeling. Our method introduces a dual-branch visual encoder and a MoVQGAN tokenizer, leveraging referential segmentation masks as spatial conditioning to progressively guide a diffusion-based decoder for object-level controllable generation. The framework end-to-end integrates referential segmentation perception with object-centric generation, eliminating cascaded multi-model pipelines. Evaluated across three core tasks—multimodal understanding, referring expression segmentation, and controllable image generation—our approach achieves state-of-the-art performance. Notably, it significantly enhances segmentation-generation synergy, establishing a scalable, unified paradigm for interactive visual editing.

Technology Category

Application Category

📝 Abstract
Recent Large Vision Language Models (LVLMs) demonstrate promising capabilities in unifying visual understanding and generative modeling, enabling both accurate content understanding and flexible editing. However, current approaches treat"what to see"and"how to edit"separately: they either perform isolated object segmentation or utilize segmentation masks merely as conditional prompts for local edit generation tasks, often relying on multiple disjointed models. To bridge these gaps, we introduce FOCUS, a unified LVLM that integrates segmentation-aware perception and controllable object-centric generation within an end-to-end framework. FOCUS employs a dual-branch visual encoder to simultaneously capture global semantic context and fine-grained spatial details. In addition, we leverage a MoVQGAN-based visual tokenizer to produce discrete visual tokens that enhance generation quality. To enable accurate and controllable image editing, we propose a progressive multi-stage training pipeline, where segmentation masks are jointly optimized and used as spatial condition prompts to guide the diffusion decoder. This strategy aligns visual encoding, segmentation, and generation modules, effectively bridging segmentation-aware perception with fine-grained visual synthesis. Extensive experiments across three core tasks, including multimodal understanding, referring segmentation accuracy, and controllable image generation, demonstrate that FOCUS achieves strong performance by jointly optimizing visual perception and generative capabilities.
Problem

Research questions and friction points this paper is trying to address.

Unify visual understanding and generative modeling for editing
Integrate segmentation-aware perception and object-centric generation
Align visual encoding, segmentation, and generation modules effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified LVLM integrating segmentation and generation
Dual-branch encoder for global and spatial details
MoVQGAN tokenizer enhances visual generation quality
🔎 Similar Papers
No similar papers found.
F
Fan Yang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Haidian District, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Science, Beijing, China; Peng Cheng Laboratory, Shenzhen, China
Yousong Zhu
Yousong Zhu
Associate Professor, Chinese Academy of Sciences, Institute of Automation
Multimodal Large Language ModelsSelf-supervised LearningObject Detection
X
Xin Li
Peng Cheng Laboratory, Shenzhen, China
Yufei Zhan
Yufei Zhan
Institute of Automation, Chinese Academy of Science
Computer VisionLarge Multimodal ModelsGrounding and Detection
H
Hongyin Zhao
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Haidian District, Beijing, China
S
Shurong Zheng
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Haidian District, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Science, Beijing, China; Peng Cheng Laboratory, Shenzhen, China
Yaowei Wang
Yaowei Wang
The Hong Kong Polytechnic University
M
Ming Tang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Haidian District, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Science, Beijing, China
J
Jinqiao Wang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Haidian District, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Science, Beijing, China; Peng Cheng Laboratory, Shenzhen, China; Wuhan AI Research, Wuhan, China