🤖 AI Summary
This work addresses low answer confidence and inconsistent reasoning in zero-shot cross-modal (text/image/video) question answering. We propose a training-free, sub-question-guided reasoning framework that automatically decomposes an input question into multiple sub-question–answer (sub-QA) paths. Leveraging the large language model’s intrinsic confidence scores over its own outputs, our method dynamically refines and weight-fuses these sub-paths to enhance multi-path reasoning consistency. Our key contribution is the first introduction of a confidence-driven, sub-question quality-aware refinement mechanism, which empirically reveals the nonlinear relationship between sub-question quantity/quality and reasoning robustness. Experiments demonstrate consistent accuracy improvements across multimodal QA benchmarks, broad compatibility with both open- and closed-source QA models, and substantial gains in zero-shot generalization capability and decision reliability.
📝 Abstract
We propose Confidence-guided Refinement Reasoning (C2R), a novel training-free framework applicable to question-answering (QA) tasks across text, image, and video domains. C2R strategically constructs and refines sub-questions and their answers (sub-QAs), deriving a better confidence score for the target answer. C2R first curates a subset of sub-QAs to explore diverse reasoning paths, then compares the confidence scores of the resulting answer candidates to select the most reliable final answer. Since C2R relies solely on confidence scores derived from the model itself, it can be seamlessly integrated with various existing QA models, demonstrating consistent performance improvements across diverse models and benchmarks. Furthermore, we provide essential yet underexplored insights into how leveraging sub-QAs affects model behavior, specifically analyzing the impact of both the quantity and quality of sub-QAs on achieving robust and reliable reasoning.