EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages
AI 摘要
提出EsoLang-Bench,利用冷启动的冷门语言评估LLM的真正推理能力。
主要贡献
- 提出了EsoLang-Bench基准测试集
- 使用冷门编程语言评估LLM的推理能力
- 揭示了LLM在标准基准上的高分可能来自于记忆
方法论
使用Brainfuck等五种冷门编程语言,通过代码生成任务评估LLM在不同prompt策略下的表现,并分析模型性能。
原文摘要
Large language models achieve near-ceiling performance on code generation benchmarks, yet these results increasingly reflect memorization rather than genuine reasoning. We introduce EsoLang-Bench, a benchmark using five esoteric programming languages (Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare) that lack benchmark gaming incentives due to their economic irrationality for pre-training. These languages require the same computational primitives as mainstream programming but have 1,000-100,000x fewer public repositories than Python (based on GitHub search counts). We evaluate five frontier models across five prompting strategies and find a dramatic capability gap: models achieving 85-95% on standard benchmarks score only 0-11% on equivalent esoteric tasks, with 0% accuracy beyond the Easy tier. Few-shot learning and self-reflection fail to improve performance, suggesting these techniques exploit training priors rather than enabling genuine learning. EsoLang-Bench provides the first benchmark designed to mimic human learning by acquiring new languages through documentation, interpreter feedback, and iterative experimentation, measuring transferable reasoning skills resistant to data contamination.