Is Mathematical Problem-Solving Expertise in Large Language Models Associated with Assessment Performance?
AI 摘要
该研究探讨了LLM的数学问题解决能力与其评估学生解题步骤准确性的能力之间的关联性。
主要贡献
- 验证了数学问题解决能力与评估学生解题步骤准确性之间的关联
- 发现评估难度高于直接解题,尤其是在存在错误的情况下
- 揭示了高质量数学评估需要额外能力,如步骤跟踪和精确定位错误
方法论
使用GPT-4和GPT-5,在GSM8K和MATH数据集上进行实验,对比模型解决问题和评估解题步骤的准确率。
原文摘要
Large Language Models (LLMs) are increasingly used in math education not only as problem solvers but also as assessors of learners' reasoning. However, it remains unclear whether stronger math problem-solving ability is associated with stronger step-level assessment performance. This study examines that relationship using the GSM8K and MATH subsets of PROCESSBENCH, a human-annotated benchmark for identifying the earliest erroneous step in mathematical reasoning. We evaluate two LLM-based math tutor agent settings, instantiated with GPT-4 and GPT-5, in two independent tasks on the same math problems: solving the original problem and assessing a benchmark-provided solution by predicting the earliest erroneous step. Results show a consistent within-model pattern: assessment accuracy is substantially higher on math problem items the same model solved correctly than on items it solved incorrectly, with statistically significant associations across both models and datasets. At the same time, assessment remains more difficult than direct problem solving, especially on error-present solutions. These findings suggest that math problem-solving expertise supports stronger assessment performance, but reliable step-level diagnosis also requires additional capabilities such as step tracking, monitoring, and precise error localization. The results have implications for the design and evaluation of AI-supported Adaptive Instructional Systems (AISs) for formative assessment in math education.