Multimodal Learning 相关度: 9/10

CoTZero: Annotation-Free Human-Like Vision Reasoning via Hierarchical Synthetic CoT

Chengyi Du, Yazhe Niu, Dazhong Shen, Luxin Xu

arXiv: 2602.08339v1 发布: 2026-02-09 更新: 2026-02-09

下载 PDF arXiv 页面

AI 摘要

CoTZero通过无标注的分层合成CoT数据，提升视觉语言模型的人类水平视觉推理能力。

主要贡献

提出了无标注的CoTZero范式
设计了双阶段数据合成方法，模拟人类认知过程
引入了认知对齐的可验证奖励，强化模型的层次推理能力

方法论

CoTZero通过合成数据和强化学习，提升VLMs在推理一致性和事实正确性方面的表现，并采用认知对齐奖励进行训练。

原文摘要

Recent advances in vision-language models (VLMs) have markedly improved image-text alignment, yet they still fall short of human-like visual reasoning. A key limitation is that many VLMs rely on surface correlations rather than building logically coherent structured representations, which often leads to missed higher-level semantic structure and non-causal relational understanding, hindering compositional and verifiable reasoning. To address these limitations by introducing human models into the reasoning process, we propose CoTZero, an annotation-free paradigm with two components: (i) a dual-stage data synthesis approach and (ii) a cognition-aligned training method. In the first component, we draw inspiration from neurocognitive accounts of compositional productivity and global-to-local analysis. In the bottom-up stage, CoTZero extracts atomic visual primitives and incrementally composes them into diverse, structured question-reasoning forms. In the top-down stage, it enforces hierarchical reasoning by using coarse global structure to guide the interpretation of local details and causal relations. In the cognition-aligned training component, built on the synthesized CoT data, we introduce Cognitively Coherent Verifiable Rewards (CCVR) in Reinforcement Fine-Tuning (RFT) to further strengthen VLMs' hierarchical reasoning and generalization, providing stepwise feedback on reasoning coherence and factual correctness. Experiments show that CoTZero achieves an F1 score of 83.33 percent on our multi-level semantic inconsistency benchmark with lexical-perturbation negatives, across both in-domain and out-of-domain settings. Ablations confirm that each component contributes to more interpretable and human-aligned visual reasoning.

arXiv 分类

cs.AI cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类