See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

📅 2025-12-26

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Large vision-language models (VLMs) often rely on coarse-grained or external visual cues, leading to insufficient fine-grained evidence utilization, weak cross-domain generalization, and high inference overhead. To address this, we propose Bidirectional Perception Shaping (BiPS), a training-time mechanism that generates question-driven “where-to-look” attention signals to guide the model toward salient visual regions (e.g., chart line segments). BiPS introduces dual KL-based constraints: KL-consistency ensures comprehensive coverage of relevant pixels, while KL-separation suppresses textual shortcuts and enforces genuine visual grounding. Integrated with question-conditioned mask generation, dual-view construction (evidence-preserving vs. evidence-removing), and end-to-end joint training, BiPS improves Qwen2.5-VL-7B by 8.2% on average across eight benchmarks. It significantly enhances robustness and generalization to unseen datasets and diverse image modalities—including charts, handwritten text, and other fine-grained visual inputs.

Technology Category

Application Category

📝 Abstract

Large vision-language models (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook fine-grained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer supports the original answer, discouraging text-only shortcuts (i.e., answering from text alone) and enforcing fine-grained visual reliance. Across eight benchmarks, BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.

Problem

Research questions and friction points this paper is trying to address.

Enhances fine-grained visual evidence in multimodal reasoning

Improves cross-domain generalization for vision-language models

Reduces inference-time costs while preventing text-only shortcuts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bidirectional perceptual shaping for multimodal reasoning

KL-consistency constraint for coarse visual coverage

KL-separation constraint to prevent text-only shortcuts

🔎 Similar Papers

What to align in multimodal contrastive learning?