CogGuide: Human-Like Guidance for Zero-Shot Omni-Modal Reasoning

📅 2025-09-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) often rely on superficial “shortcuts” and exhibit weak contextual understanding in complex cross-modal reasoning tasks. Method: This paper proposes a human-inspired zero-shot reasoning framework that introduces an interpretable “intention sketch” guidance mechanism. It establishes a three-module pipeline—intention-aware perception, strategy generation, and strategy selection—enabling parameter-free cross-model transfer. Leveraging information-theoretic conditional entropy control and context engineering, the method enhances information utilization efficiency and mitigates myopic reasoning. Contribution/Results: The framework achieves consistent accuracy improvements across multiple benchmarks, with average gains of up to 9.51 percentage points. It demonstrates strong generalizability, robustness, and practical deployability without fine-tuning, validating its effectiveness for real-world multimodal reasoning.

Technology Category

Application Category

📝 Abstract
Targeting the issues of "shortcuts" and insufficient contextual understanding in complex cross-modal reasoning of multimodal large models, this paper proposes a zero-shot multimodal reasoning component guided by human-like cognitive strategies centered on an "intent sketch". The component comprises a plug-and-play three-module pipeline-Intent Perceiver, Strategy Generator, and Strategy Selector-that explicitly constructs a "understand-plan-select" cognitive process. By generating and filtering "intent sketch" strategies to guide the final reasoning, it requires no parameter fine-tuning and achieves cross-model transfer solely through in-context engineering. Information-theoretic analysis shows that this process can reduce conditional entropy and improve information utilization efficiency, thereby suppressing unintended shortcut reasoning. Experiments on IntentBench, WorldSense, and Daily-Omni validate the method's generality and robust gains; compared with their respective baselines, the complete "three-module" scheme yields consistent improvements across different reasoning engines and pipeline combinations, with gains up to approximately 9.51 percentage points, demonstrating the practical value and portability of the "intent sketch" reasoning component in zero-shot scenarios.
Problem

Research questions and friction points this paper is trying to address.

Addresses shortcut issues in multimodal reasoning models
Enhances contextual understanding via human-like cognitive strategies
Improves zero-shot cross-modal reasoning without fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Plug-and-play three-module pipeline for cognitive reasoning
Intent sketch strategies guide zero-shot multimodal reasoning
No parameter fine-tuning with cross-model transfer capability
🔎 Similar Papers
No similar papers found.
Z
Zhoupeng Shou
NoDesk AI, Hangzhou, China
Z
Zhiqiang You
NoDesk AI, Hangzhou, China
Fang Wang
Fang Wang
Postdoc, Stanford University
Reading acquisitiondyslexiacross-linguistic researchbilingualismcognitive neuroscience
H
Haibo Liu
Independent Researcher, Hangzhou, China