Agent Tuning & Optimization 相关度: 8/10

CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading

Pranav Raikote, Korbinian Randl, Ioanna Miliou, Athanasios Lakes, Panagiotis Papapetrou
arXiv: 2603.11957v1 发布: 2026-03-12 更新: 2026-03-12

AI 摘要

CHiL(L)Grader框架结合置信度估计和人机协作,实现可靠的AI辅助短答案评分。

主要贡献

  • 提出CHiL(L)Grader框架
  • 引入基于置信度的选择性预测
  • 结合持续学习适应rubrics变化

方法论

利用温度缩放校准置信度,根据置信度选择性预测,并将不确定案例转给人,通过持续学习适应新数据。

原文摘要

Scaling educational assessment with large language models requires not just accuracy, but the ability to recognize when predictions are trustworthy. Instruction-tuned models tend to be overconfident, and their reliability deteriorates as curricula evolve, making fully autonomous deployment unsafe in high-stakes settings. We introduce CHiL(L)Grader, the first automated grading framework that incorporates calibrated confidence estimation into a human-in-the-loop workflow. Using post-hoc temperature scaling, confidence-based selective prediction, and continual learning, CHiL(L)Grader automates only high-confidence predictions while routing uncertain cases to human graders, and adapts to evolving rubrics and unseen questions. Across three short-answer grading datasets, CHiL(L)Grader automatically scores 35-65% of responses at expert-level quality (QWK >= 0.80). A QWK gap of 0.347 between accepted and rejected predictions confirms the effectiveness of the confidence-based routing. Each correction cycle strengthens the model's grading capability as it learns from teacher feedback. These results show that uncertainty quantification is key for reliable AI-assisted grading.

标签

LLM 教育评估 人机协作 置信度校准 持续学习

arXiv 分类

cs.CL