RGBX-R1: Visual Modality Chain-of-Thought Guided Reinforcement Learning for Multimodal Grounding

📅 2026-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited spatial grounding capability of current multimodal large language models (MLLMs), which are predominantly pretrained on RGB images and struggle to effectively perceive non-RGB visual modalities such as infrared, depth, and event data in complex scenes. To overcome this limitation, the authors propose Visual Modality Chain-of-Thought (VM-CoT), a framework that constructs cross-modal reasoning pathways through an Understand-Associate-Verify (UAV) prompting strategy. They further introduce a two-stage training paradigm comprising Cold-Start Supervised Fine-Tuning (CS-SFT) followed by Spatio-Temporal Reinforcement Fine-Tuning (ST-RFT) based on GRPO, augmented with a MuST reward mechanism. The study establishes the first RGBX-Grounding benchmark and demonstrates a 22.71% performance gain over baselines across three RGBX localization tasks, significantly enhancing MLLMs’ comprehension of multimodal inputs and their precision in spatial grounding.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLM) are primarily pre-trained on the RGB modality, thereby limiting their performance on other modalities, such as infrared, depth, and event data, which are crucial for complex scenarios. To address this, we propose RGBX-R1, a framework to enhance MLLM's perception and reasoning capacities across various X visual modalities. Specifically, we employ an Understand-Associate-Validate (UAV) prompting strategy to construct the Visual Modality Chain-of-Thought (VM-CoT), which aims to expand the MLLMs'RGB understanding capability into X modalities. To progressively enhance reasoning capabilities, we introduce a two-stage training paradigm: Cold-Start Supervised Fine-Tuning (CS-SFT) and Spatio-Temporal Reinforcement Fine-Tuning (ST-RFT). CS-SFT supervises the reasoning process with the guidance of VM-CoT, equipping the MLLM with fundamental modality cognition. Building upon GRPO, ST-RFT employs a Modality-understanding Spatio-Temporal (MuST) reward to reinforce modality reasoning. Notably, we construct the first RGBX-Grounding benchmark, and extensive experiments verify our superiority in multimodal understanding and spatial perception, outperforming baselines by 22.71% on three RGBX grounding tasks.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
RGB modality
infrared
depth
event data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Modality Chain-of-Thought
Multimodal Grounding
Reinforcement Fine-Tuning
RGBX Benchmark
Modality Generalization
🔎 Similar Papers
No similar papers found.