LLM Reasoning 相关度: 9/10

LLM-as-Judge on a Budget

Aadirupa Saha, Aniket Wagde, Branislav Kveton

arXiv: 2602.15481v1 发布: 2026-02-17 更新: 2026-02-17

下载 PDF arXiv 页面

AI 摘要

提出一种基于多臂老虎机理论的LLM评估优化方法，动态分配计算资源以降低评估误差。

主要贡献

提出一种基于方差自适应的多臂老虎机LLM评估方法。
证明了该方法在最坏情况下的误差界。
实验证明该方法优于均匀分配。

方法论

利用多臂老虎机理论和集中不等式，根据估计的分数方差动态分配查询资源，重点关注不确定性最高的prompt-response对。

原文摘要

LLM-as-a-judge has emerged as a cornerstone technique for evaluating large language models by leveraging LLM reasoning to score prompt-response pairs. Since LLM judgments are stochastic, practitioners commonly query each pair multiple times to estimate mean scores accurately. This raises a critical challenge: given a fixed computational budget $B$, how to optimally allocate queries across $K$ prompt-response pairs to minimize estimation error? % We present a principled variance-adaptive approach leveraging multi-armed bandit theory and concentration inequalities. Our method dynamically allocates queries based on estimated score variances, concentrating resources where uncertainty is highest. Further, our algorithm is shown to achieve a worst-case score-estimation error of $\tilde{O}\left(\sqrt{\frac{\sum_{i=1}^K σ_i^2}{B}}\right)$, $σ_i^2$ being the unknown score variance for pair $i \in [K]$ with near-optimal budget allocation. % Experiments on \emph{Summarize-From-Feedback} and \emph{HelpSteer2} demonstrate that our method significantly outperforms uniform allocation, reducing worst-case estimation error while maintaining identical budgets. Our work establishes a theoretical foundation for efficient LLM evaluation with practical implications for AI safety, model alignment, and automated assessment at scale.

arXiv 分类

cs.LG

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类