Agent Tuning & Optimization 相关度: 8/10

CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading

Pranav Raikote, Korbinian Randl, Ioanna Miliou, Athanasios Lakes, Panagiotis Papapetrou

arXiv: 2603.11957v1 发布: 2026-03-12 更新: 2026-03-12

下载 PDF arXiv 页面

AI 摘要

CHiL(L)Grader框架结合置信度估计和人机协作，实现可靠的AI辅助短答案评分。

主要贡献

提出CHiL(L)Grader框架
引入基于置信度的选择性预测
结合持续学习适应rubrics变化

方法论

利用温度缩放校准置信度，根据置信度选择性预测，并将不确定案例转给人，通过持续学习适应新数据。

原文摘要

Scaling educational assessment with large language models requires not just accuracy, but the ability to recognize when predictions are trustworthy. Instruction-tuned models tend to be overconfident, and their reliability deteriorates as curricula evolve, making fully autonomous deployment unsafe in high-stakes settings. We introduce CHiL(L)Grader, the first automated grading framework that incorporates calibrated confidence estimation into a human-in-the-loop workflow. Using post-hoc temperature scaling, confidence-based selective prediction, and continual learning, CHiL(L)Grader automates only high-confidence predictions while routing uncertain cases to human graders, and adapts to evolving rubrics and unseen questions. Across three short-answer grading datasets, CHiL(L)Grader automatically scores 35-65% of responses at expert-level quality (QWK >= 0.80). A QWK gap of 0.347 between accepted and rejected predictions confirms the effectiveness of the confidence-based routing. Each correction cycle strengthens the model's grading capability as it learns from teacher feedback. These results show that uncertainty quantification is key for reliable AI-assisted grading.

arXiv 分类

cs.CL

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类