Judge Reliability Harness: Stress Testing the Reliability of LLM Judges
AI 摘要
论文提出Judge Reliability Harness,用于评估LLM判定的可靠性,发现不同模型在不同基准测试中表现差异大。
主要贡献
- 开源的LLM判定可靠性评估工具Judge Reliability Harness
- 系统性评估了四个SOTA判定的可靠性
- 揭示了LLM判定在不同基准测试和扰动下的性能差异
方法论
构建验证套件,通过生成可靠性测试,评估二元判断准确性和序数评分性能,使用不同扰动测试LLM判定的一致性。
原文摘要
We present the Judge Reliability Harness, an open source library for constructing validation suites that test the reliability of LLM judges. As LLM based scoring is widely deployed in AI benchmarks, more tooling is needed to efficiently assess the reliability of these methods. Given a benchmark dataset and an LLM judge configuration, the harness generates reliability tests that evaluate both binary judgment accuracy and ordinal grading performance for free-response and agentic task formats. We evaluate four state-of-the-art judges across four benchmarks spanning safety, persuasion, misuse, and agentic behavior, and find meaningful variation in performance across models and perturbation types, highlighting opportunities to improve the robustness of LLM judges. No judge that we evaluated is uniformly reliable across benchmarks using our harness. For example, our preliminary experiments on judges revealed consistency issues as measured by accuracy in judging another LLM's ability to complete a task due to simple text formatting changes, paraphrasing, changes in verbosity, and flipping the ground truth label in LLM-produced responses. The code for this tool is available at: https://github.com/RANDCorporation/judge-reliability-harness