CoTZero: Annotation-Free Human-Like Vision Reasoning via Hierarchical Synthetic CoT

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work proposes a human-like visual reasoning framework that operates without human-annotated data, addressing the limitations of current vision-language models which rely on superficial correlations and lack structured, causal understanding necessary for compositional and verifiable reasoning. Inspired by cognitive science, the approach employs a global-to-local synthesis strategy to generate structured reasoning paths via a two-stage chain-of-thought (CoT) process. It further aligns hierarchical reasoning through reinforcement fine-tuning (RFT) guided by a Cognitive Consistency Verifiable Reward (CCVR) mechanism. Evaluated on a multi-level semantic inconsistency benchmark, the method achieves an F1 score of 83.33%, demonstrating substantial improvements in both in-domain and out-of-domain reasoning performance as well as model interpretability.

Technology Category

Application Category

📝 Abstract

Recent advances in vision-language models (VLMs) have markedly improved image-text alignment, yet they still fall short of human-like visual reasoning. A key limitation is that many VLMs rely on surface correlations rather than building logically coherent structured representations, which often leads to missed higher-level semantic structure and non-causal relational understanding, hindering compositional and verifiable reasoning. To address these limitations by introducing human models into the reasoning process, we propose CoTZero, an annotation-free paradigm with two components: (i) a dual-stage data synthesis approach and (ii) a cognition-aligned training method. In the first component, we draw inspiration from neurocognitive accounts of compositional productivity and global-to-local analysis. In the bottom-up stage, CoTZero extracts atomic visual primitives and incrementally composes them into diverse, structured question-reasoning forms. In the top-down stage, it enforces hierarchical reasoning by using coarse global structure to guide the interpretation of local details and causal relations. In the cognition-aligned training component, built on the synthesized CoT data, we introduce Cognitively Coherent Verifiable Rewards (CCVR) in Reinforcement Fine-Tuning (RFT) to further strengthen VLMs'hierarchical reasoning and generalization, providing stepwise feedback on reasoning coherence and factual correctness. Experiments show that CoTZero achieves an F1 score of 83.33 percent on our multi-level semantic inconsistency benchmark with lexical-perturbation negatives, across both in-domain and out-of-domain settings. Ablations confirm that each component contributes to more interpretable and human-aligned visual reasoning.

Problem

Research questions and friction points this paper is trying to address.

visual reasoning

vision-language models

structured representation

causal relations

compositional reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

annotation-free

hierarchical reasoning

synthetic chain-of-thought