CoTZero: Annotation-Free Human-Like Vision Reasoning via Hierarchical Synthetic CoT

📅 2026-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a human-like visual reasoning framework that operates without human-annotated data, addressing the limitations of current vision-language models which rely on superficial correlations and lack structured, causal understanding necessary for compositional and verifiable reasoning. Inspired by cognitive science, the approach employs a global-to-local synthesis strategy to generate structured reasoning paths via a two-stage chain-of-thought (CoT) process. It further aligns hierarchical reasoning through reinforcement fine-tuning (RFT) guided by a Cognitive Consistency Verifiable Reward (CCVR) mechanism. Evaluated on a multi-level semantic inconsistency benchmark, the method achieves an F1 score of 83.33%, demonstrating substantial improvements in both in-domain and out-of-domain reasoning performance as well as model interpretability.

Technology Category

Application Category

📝 Abstract
Recent advances in vision-language models (VLMs) have markedly improved image-text alignment, yet they still fall short of human-like visual reasoning. A key limitation is that many VLMs rely on surface correlations rather than building logically coherent structured representations, which often leads to missed higher-level semantic structure and non-causal relational understanding, hindering compositional and verifiable reasoning. To address these limitations by introducing human models into the reasoning process, we propose CoTZero, an annotation-free paradigm with two components: (i) a dual-stage data synthesis approach and (ii) a cognition-aligned training method. In the first component, we draw inspiration from neurocognitive accounts of compositional productivity and global-to-local analysis. In the bottom-up stage, CoTZero extracts atomic visual primitives and incrementally composes them into diverse, structured question-reasoning forms. In the top-down stage, it enforces hierarchical reasoning by using coarse global structure to guide the interpretation of local details and causal relations. In the cognition-aligned training component, built on the synthesized CoT data, we introduce Cognitively Coherent Verifiable Rewards (CCVR) in Reinforcement Fine-Tuning (RFT) to further strengthen VLMs'hierarchical reasoning and generalization, providing stepwise feedback on reasoning coherence and factual correctness. Experiments show that CoTZero achieves an F1 score of 83.33 percent on our multi-level semantic inconsistency benchmark with lexical-perturbation negatives, across both in-domain and out-of-domain settings. Ablations confirm that each component contributes to more interpretable and human-aligned visual reasoning.
Problem

Research questions and friction points this paper is trying to address.

visual reasoning
vision-language models
structured representation
causal relations
compositional reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

annotation-free
hierarchical reasoning
synthetic chain-of-thought
cognitive alignment
reinforcement fine-tuning
🔎 Similar Papers
No similar papers found.
C
Chengyi Du
University of Electronic Science and Technology of China, Shanghai Artificial Intelligence Laboratory
Y
Yazhe Niu
Shanghai Artificial Intelligence Laboratory, The Chinese University of Hong Kong MMLab
Dazhong Shen
Dazhong Shen
Nanjing University of Aeronautics and Astronautics
Data MiningGenerative AI
L
Luxin Xu
University of Electronic Science and Technology of China