LLM Reasoning 相关度: 9/10

D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models

Shunsuke Ubukata

arXiv: 2602.21786v1 发布: 2026-02-25 更新: 2026-02-25

下载 PDF arXiv 页面

AI 摘要

D-CoT通过控制标签约束CoT过程，提升小模型推理效率和性能并减少token消耗。

主要贡献

提出D-CoT框架，使用控制标签规范CoT推理过程
在小模型上实现了显著的性能提升和计算成本降低
验证了模型可以内化结构化思维，无需控制标签也能保持高性能

方法论

使用带有控制标签（如<TEMP_LOW>, <TEMP_HIGH>）的CoT数据训练小语言模型，优化推理轨迹。

原文摘要

Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) often induces "overthinking" in Small Language Models (SLMs), leading to performance degradation and excessive token consumption. In this study, we propose Disciplined Chain-of-Thought (D-CoT), a novel framework that enforces a structured reasoning process using control tags -- such as <TEMP_LOW> for fact-checking and <TEMP_HIGH> for multi-perspective exploration -- as auxiliary scaffolding during training. By optimizing the CoT trajectory, D-CoT suppresses reasoning drift and simultaneously achieves token reduction and performance improvement. We demonstrate the efficacy of our approach on Qwen3-8B: with only 5,000 training samples, D-CoT significantly boosts accuracy on GPQA-diamond by 9.9% and MMLU-Pro (0-shot) by 9.1%, while drastically reducing computational costs. Furthermore, we confirm that the model internalizes this disciplined thought structure, maintaining high performance even without explicit control tags during inference.

arXiv 分类

cs.CL