ObjEmbed: Towards Universal Multimodal Object Embeddings

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing vision-language models in aligning fine-grained image regions with textual phrases, which hinders object-level semantic understanding. To overcome this, we propose ObjEmbed, a novel embedding approach for multimodal large language models that decomposes images into object-region embeddings and a global embedding, enabling unified handling of both region-level and image-level tasks. The core innovations include an object-oriented dual embedding strategy that jointly models semantic correspondence and Intersection-over-Union (IoU) prediction, along with a single-pass, efficient forward encoding architecture. This design simultaneously supports visual grounding, local and global cross-modal retrieval, and demonstrates significant improvements in fine-grained semantic discrimination and retrieval accuracy across 18 diverse benchmarks.

Technology Category

Application Category

📝 Abstract
Aligning objects with corresponding textual descriptions is a fundamental challenge and a realistic requirement in vision-language understanding. While recent multimodal embedding models excel at global image-text alignment, they often struggle with fine-grained alignment between image regions and specific phrases. In this work, we present ObjEmbed, a novel MLLM embedding model that decomposes the input image into multiple regional embeddings, each corresponding to an individual object, along with global embeddings. It supports a wide range of visual understanding tasks like visual grounding, local image retrieval, and global image retrieval. ObjEmbed enjoys three key properties: (1) Object-Oriented Representation: It captures both semantic and spatial aspects of objects by generating two complementary embeddings for each region: an object embedding for semantic matching and an IoU embedding that predicts localization quality. The final object matching score combines semantic similarity with the predicted IoU, enabling more accurate retrieval. (2) Versatility: It seamlessly handles both region-level and image-level tasks. (3) Efficient Encoding: All objects in an image, along with the full image, are encoded in a single forward pass for high efficiency. Superior performance on 18 diverse benchmarks demonstrates its strong semantic discrimination.
Problem

Research questions and friction points this paper is trying to address.

fine-grained alignment
vision-language understanding
multimodal object embeddings
visual grounding
image-text alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Object-Oriented Embedding
Multimodal Alignment
Fine-Grained Visual Grounding
IoU-Aware Embedding
Efficient Multimodal Encoding
🔎 Similar Papers
No similar papers found.
Shenghao Fu
Shenghao Fu
Sun Yat-sen University
computer visionobject detectionlarge multi-modal models
Yukun Su
Yukun Su
WeChat, Tencent
Computer VisionDeep LearningComputer Graphics
F
Fengyun Rao
Independent Researcher
Jing Lyu
Jing Lyu
Shanghai Jiao Tong University
Power electronicsstabilityrenewable energy grid integrationhigh-voltage dc transmission
X
Xiaohua Xie
School of Computer Science and Engineering, Sun Yat-sen University, China; Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China; Guangdong Province Key Laboratory of Information Security Technology, China; Pazhou Laboratory (Huangpu), China
Wei-Shi Zheng
Wei-Shi Zheng
Professor @ SUN YAT-SEN UNIVERSITY
Computer VisionPattern RecognitionMachine Learning