LLM Reasoning 相关度: 9/10

LLM Reasoning Predicts When Models Are Right: Evidence from Coding Classroom Discourse

Bakhtawar Ahtisham, Kirk Vanacore, Zhuqian Zhou, Jinsook Lee, Rene F. Kizilcec

arXiv: 2602.09832v1 发布: 2026-02-10 更新: 2026-02-10

下载 PDF arXiv 页面

AI 摘要

利用LLM的推理能力预测其在教育对话分析中的预测正确性，提高自动化分析质量。

主要贡献

提出基于LLM推理的错误检测方法
分析了正确和错误推理的语言学特征
验证了该方法在教育对话分析中的有效性

方法论

使用LLM生成推理，用TF-IDF编码，训练监督分类器预测模型预测的正确性，并用LIWC分析语言特征。

原文摘要

Large Language Models (LLMs) are increasingly deployed to automatically label and analyze educational dialogue at scale, yet current pipelines lack reliable ways to detect when models are wrong. We investigate whether reasoning generated by LLMs can be used to predict the correctness of a model's own predictions. We analyze 30,300 teacher utterances from classroom dialogue, each labeled by multiple state-of-the-art LLMs with an instructional move construct and an accompanying reasoning. Using human-verified ground-truth labels, we frame the task as predicting whether a model's assigned label for a given utterance is correct. We encode LLM reasoning using Term Frequency-Inverse Document Frequency (TF-IDF) and evaluate five supervised classifiers. A Random Forest classifier achieves an F1 score of 0.83 (Recall = 0.854), successfully identifying most incorrect predictions and outperforming baselines. Training specialist detectors for specific instructional move constructs further improves performance on difficult constructs, indicating that error detection benefits from construct-specific linguistic cues. Using the Linguistic Inquiry and Word Count (LIWC) framework, we examine four linguistic markers of correctness: Causation, Differentiation, Tentativeness, and Insight. Correct predictions exhibit grounded causal language (e.g., because, therefore), while incorrect reasoning is substantially more likely to rely on epistemic hedging (e.g., might, could) and performative metacognition (e.g., think, realize). Syntactic complexity does not distinguish correct from incorrect reasoning, and longer reasoning is not more reliable. These findings demonstrate that reasoning-based error detection offers a practical and scalable approach to quality control in automated educational dialogue analysis.

arXiv 分类

cs.CL

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类