LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses key limitations of large language-multimodal models—namely imprecise object-level localization, difficulty in identity preservation, and low accuracy in region-specific editing—by systematically integrating such models with object-centric visual paradigms for the first time. Focusing on four core tasks—object-level scene understanding, referring expression segmentation, controllable editing, and generation—the study establishes a comprehensive capability pipeline from scene parsing to precise manipulation. The authors introduce an integrated framework grounded in object-centric representations, advancing new directions including instance constancy, spatial control, multi-step interaction consistency, and cross-task unified modeling. They further synthesize pivotal methodologies and evaluation protocols, delineate current capability boundaries, and provide a structured roadmap for reliable assessment and system optimization under distribution shifts.

Technology Category

Application Category

📝 Abstract

Large Multimodal Models (LMMs) have achieved remarkable progress in general-purpose vision--language understanding, yet they remain limited in tasks requiring precise object-level grounding, fine-grained spatial reasoning, and controllable visual manipulation. In particular, existing systems often struggle to identify the correct instance, preserve object identity across interactions, and localize or modify designated regions with high precision. Object-centric vision provides a principled framework for addressing these challenges by promoting explicit representations and operations over visual entities, thereby extending multimodal systems from global scene understanding to object-level understanding, segmentation, editing, and generation. This paper presents a comprehensive review of recent advances at the convergence of LMMs and object-centric vision. We organize the literature into four major themes: object-centric visual understanding, object-centric referring segmentation, object-centric visual editing, and object-centric visual generation. We further summarize the key modeling paradigms, learning strategies, and evaluation protocols that support these capabilities. Finally, we discuss open challenges and future directions, including robust instance permanence, fine-grained spatial control, consistent multi-step interaction, unified cross-task modeling, and reliable benchmarking under distribution shift. We hope this paper provides a structured perspective on the development of scalable, precise, and trustworthy object-centric multimodal systems.

Problem

Research questions and friction points this paper is trying to address.

object-centric vision

multimodal models

object-level grounding

spatial reasoning

visual manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

object-centric vision

Large Multimodal Models

referring segmentation