Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing referring expression grounding models treat the task as bounding-box regression, lacking interpretability and abstention capability—making it difficult to verify prediction rationale or reject invalid expressions. Method: We propose the first structured chain-of-thought (CoT) paradigm for referring expression grounding, enabling verifiable explanations and reliable abstention via explicit, candidate-wise matching evaluation. We introduce HumanRef-CoT, a large-scale, human-verified CoT dataset generated by GPT-4o and refined manually. Our approach employs a two-stage training strategy—supervised fine-tuning followed by GRPO-based reinforcement learning—and adopts a candidate-enumeration–instance-level reasoning architecture. Results: Experiments show significant improvements in in-domain accuracy and explanation quality, with effective hallucination suppression. The model also demonstrates strong cross-domain generalization, establishing a new trustworthy paradigm for vision-language referring expression grounding.

Technology Category

Application Category

📝 Abstract
Object referring aims to detect all objects in an image that match a given natural language description. We argue that a robust object referring model should be grounded, meaning its predictions should be both explainable and faithful to the visual content. Specifically, it should satisfy two key properties: 1) Verifiable, by producing interpretable reasoning that justifies its predictions and clearly links them to visual evidence; and 2) Trustworthy, by learning to abstain when no object in the image satisfies the given expression. However, most methods treat referring as a direct bounding box prediction task, offering limited interpretability and struggling to reject expressions with no matching object. In this work, we propose Rex-Thinker, a model that formulates object referring as an explicit CoT reasoning task. Given a referring expression, we first identify all candidate object instances corresponding to the referred object category. Rex-Thinker then performs step-by-step reasoning over each candidate to assess whether it matches the given expression, before making a final prediction. To support this paradigm, we construct a large-scale CoT-style referring dataset named HumanRef-CoT by prompting GPT-4o on the HumanRef dataset. Each reasoning trace follows a structured planning, action, and summarization format, enabling the model to learn decomposed, interpretable reasoning over object candidates. We then train Rex-Thinker in two stages: a cold-start supervised fine-tuning phase to teach the model how to perform structured reasoning, followed by GRPO-based RL learning to improve accuracy and generalization. Experiments show that our approach outperforms standard baselines in both precision and interpretability on in-domain evaluation, while also demonstrating improved ability to reject hallucinated outputs and strong generalization in out-of-domain settings.
Problem

Research questions and friction points this paper is trying to address.

Develop a grounded object referring model with explainable predictions.
Ensure verifiable reasoning linking predictions to visual evidence.
Improve trustworthiness by abstaining on unmatched expressions.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought reasoning for object referring
Large-scale CoT-style dataset construction
Two-stage training with GRPO-based RL
🔎 Similar Papers
No similar papers found.