Referring to Any Person

📅 2025-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing referring expression grounding models are constrained by the one-to-one reference assumption, rendering them inadequate for real-world scenarios involving multiple persons, diverse attributes, and cross-modal arbitrary person references. This work formally defines person referring expression grounding as a task with five referable entity types and three core properties—multiplicity, attribute richness, and cross-modality. We introduce HumanRef, the first high-quality, multi-person referring benchmark. Furthermore, we propose RexSeek, an end-to-end multimodal large-model-driven framework that jointly optimizes vision-language alignment, fine-grained attribute comprehension, and multi-instance decoupled localization. On HumanRef, RexSeek substantially outperforms all state-of-the-art methods; it maintains leading performance on the RefCOCO series and demonstrates strong cross-category generalization.

Technology Category

Application Category

📝 Abstract
Humans are undoubtedly the most important participants in computer vision, and the ability to detect any individual given a natural language description, a task we define as referring to any person, holds substantial practical value. However, we find that existing models generally fail to achieve real-world usability, and current benchmarks are limited by their focus on one-to-one referring, that hinder progress in this area. In this work, we revisit this task from three critical perspectives: task definition, dataset design, and model architecture. We first identify five aspects of referable entities and three distinctive characteristics of this task. Next, we introduce HumanRef, a novel dataset designed to tackle these challenges and better reflect real-world applications. From a model design perspective, we integrate a multimodal large language model with an object detection framework, constructing a robust referring model named RexSeek. Experimental results reveal that state-of-the-art models, which perform well on commonly used benchmarks like RefCOCO/+/g, struggle with HumanRef due to their inability to detect multiple individuals. In contrast, RexSeek not only excels in human referring but also generalizes effectively to common object referring, making it broadly applicable across various perception tasks. Code is available at https://github.com/IDEA-Research/RexSeek
Problem

Research questions and friction points this paper is trying to address.

Detect individuals using natural language descriptions
Overcome limitations of one-to-one referring benchmarks
Develop a robust model for real-world usability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal large language model integration
HumanRef dataset for real-world applications
RexSeek model excels in human referring
🔎 Similar Papers
No similar papers found.
Qing Jiang
Qing Jiang
PhD student, South China University of Technology
Computer VisionOpen-set Object Detection
L
Lin Wu
International Digital Economy Academy (IDEA), South China University of Technology
Zhaoyang Zeng
Zhaoyang Zeng
International Digital Economy Academy
Computer VisionMultimedia Understanding
Tianhe Ren
Tianhe Ren
PhD student of Electrical and Electronic Engineering, The University of Hong Kong
Computer VisionMachine LearningMulti-Modality
Y
Yuda Xiong
International Digital Economy Academy (IDEA)
Y
Yihao Chen
International Digital Economy Academy (IDEA)
L
Lei Zhang
International Digital Economy Academy (IDEA)