LLM Reasoning 相关度: 7/10

AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms

Haoyu Zhao, Ziran Yang, Jiawei Li, Deyuan He, Zenan Li, Chi Jin, Venugopal V. Veeravalli, Aarti Gupta, Sanjeev Arora
arXiv: 2602.09464v1 发布: 2026-02-10 更新: 2026-02-10

AI 摘要

AlgoVeri提供了一个统一的基准测试,用于评估AI模型在Dafny、Verus和Lean中生成形式验证代码的能力。

主要贡献

  • 提出了AlgoVeri基准测试,包含77个经典算法的验证代码生成任务
  • 揭示了不同验证系统在能力上的关键差距
  • 分析了不同语言设计对模型精化轨迹的影响

方法论

构建包含统一函数契约的基准,在Dafny、Verus和Lean三种验证系统中评估模型的代码生成能力,并分析性能差异。

原文摘要

Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, but a unified methodology for cross-paradigm evaluation is lacking. Existing benchmarks test only individual languages/tools (e.g., Dafny, Verus, and Lean) and each covers very different tasks, so the performance numbers are not directly comparable. We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean. By enforcing identical functional contracts, AlgoVeri reveals critical capability gaps in verification systems. While frontier models achieve tractable success in Dafny ($40.3$% for Gemini-3 Flash), where high-level abstractions and SMT automation simplify the workflow, performance collapses under the systems-level memory constraints of Verus ($24.7$%) and the explicit proof construction required by Lean (7.8%). Beyond aggregate metrics, we uncover a sharp divergence in test-time compute dynamics: Gemini-3 effectively utilizes iterative repair to boost performance (e.g., tripling pass rates in Dafny), whereas GPT-OSS saturates early. Finally, our error analysis shows that language design affects the refinement trajectory: while Dafny allows models to focus on logical correctness, Verus and Lean trap models in persistent syntactic and semantic barriers. All data and evaluation code can be found at https://github.com/haoyuzhao123/algoveri.

标签

形式验证 代码生成 基准测试 AI编程

arXiv 分类

cs.SE cs.AI cs.CL