See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

📅 2025-12-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large vision-language models (VLMs) often rely on coarse-grained or external visual cues, leading to insufficient fine-grained evidence utilization, weak cross-domain generalization, and high inference overhead. To address this, we propose Bidirectional Perception Shaping (BiPS), a training-time mechanism that generates question-driven “where-to-look” attention signals to guide the model toward salient visual regions (e.g., chart line segments). BiPS introduces dual KL-based constraints: KL-consistency ensures comprehensive coverage of relevant pixels, while KL-separation suppresses textual shortcuts and enforces genuine visual grounding. Integrated with question-conditioned mask generation, dual-view construction (evidence-preserving vs. evidence-removing), and end-to-end joint training, BiPS improves Qwen2.5-VL-7B by 8.2% on average across eight benchmarks. It significantly enhances robustness and generalization to unseen datasets and diverse image modalities—including charts, handwritten text, and other fine-grained visual inputs.

Technology Category

Application Category

📝 Abstract
Large vision-language models (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook fine-grained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer supports the original answer, discouraging text-only shortcuts (i.e., answering from text alone) and enforcing fine-grained visual reliance. Across eight benchmarks, BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.
Problem

Research questions and friction points this paper is trying to address.

Enhances fine-grained visual evidence in multimodal reasoning
Improves cross-domain generalization for vision-language models
Reduces inference-time costs while preventing text-only shortcuts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bidirectional perceptual shaping for multimodal reasoning
KL-consistency constraint for coarse visual coverage
KL-separation constraint to prevent text-only shortcuts
🔎 Similar Papers
No similar papers found.
S
Shuoshuo Zhang
Tsinghua University
Y
Yizhen Zhang
Tsinghua University
Jingjing Fu
Jingjing Fu
MS
image/video processing
L
Lei Song
Microsoft Research
J
Jiang Bian
Microsoft Research
Yujiu Yang
Yujiu Yang
SIGS, Tsinghua University
Machine Learning, Nature language processing, Computer vision
R
Rui Wang
Microsoft Research