RGBX-R1: Visual Modality Chain-of-Thought Guided Reinforcement Learning for Multimodal Grounding

📅 2026-01-31

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the limited spatial grounding capability of current multimodal large language models (MLLMs), which are predominantly pretrained on RGB images and struggle to effectively perceive non-RGB visual modalities such as infrared, depth, and event data in complex scenes. To overcome this limitation, the authors propose Visual Modality Chain-of-Thought (VM-CoT), a framework that constructs cross-modal reasoning pathways through an Understand-Associate-Verify (UAV) prompting strategy. They further introduce a two-stage training paradigm comprising Cold-Start Supervised Fine-Tuning (CS-SFT) followed by Spatio-Temporal Reinforcement Fine-Tuning (ST-RFT) based on GRPO, augmented with a MuST reward mechanism. The study establishes the first RGBX-Grounding benchmark and demonstrates a 22.71% performance gain over baselines across three RGBX localization tasks, significantly enhancing MLLMs’ comprehension of multimodal inputs and their precision in spatial grounding.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLM) are primarily pre-trained on the RGB modality, thereby limiting their performance on other modalities, such as infrared, depth, and event data, which are crucial for complex scenarios. To address this, we propose RGBX-R1, a framework to enhance MLLM's perception and reasoning capacities across various X visual modalities. Specifically, we employ an Understand-Associate-Validate (UAV) prompting strategy to construct the Visual Modality Chain-of-Thought (VM-CoT), which aims to expand the MLLMs'RGB understanding capability into X modalities. To progressively enhance reasoning capabilities, we introduce a two-stage training paradigm: Cold-Start Supervised Fine-Tuning (CS-SFT) and Spatio-Temporal Reinforcement Fine-Tuning (ST-RFT). CS-SFT supervises the reasoning process with the guidance of VM-CoT, equipping the MLLM with fundamental modality cognition. Building upon GRPO, ST-RFT employs a Modality-understanding Spatio-Temporal (MuST) reward to reinforce modality reasoning. Notably, we construct the first RGBX-Grounding benchmark, and extensive experiments verify our superiority in multimodal understanding and spatial perception, outperforming baselines by 22.71% on three RGBX grounding tasks.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

RGB modality

infrared

depth

event data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Modality Chain-of-Thought

Multimodal Grounding

Reinforcement Fine-Tuning