Large Language Models are Algorithmically Blind
AI 摘要
大型语言模型在算法理解和推理方面存在系统性缺陷,表现为“算法盲目性”。
主要贡献
- 揭示了LLM在算法理解方面的局限性
- 提出了“算法盲目性”的概念
- 通过因果发现实验评估了LLM的算法推理能力
方法论
使用因果发现作为测试平台,大规模算法执行产生的数据作为ground truth,评估前沿LLM的性能。
原文摘要
Large language models (LLMs) demonstrate remarkable breadth of knowledge, yet their ability to reason about computational processes remains poorly understood. Closing this gap matters for practitioners who rely on LLMs to guide algorithm selection and deployment. We address this limitation using causal discovery as a testbed and evaluate eight frontier LLMs against ground truth derived from large-scale algorithm executions and find systematic, near-total failure. Models produce ranges far wider than true confidence intervals yet still fail to contain the true algorithmic mean in the majority of instances; most perform worse than random guessing and the marginal above-random performance of the best model is most consistent with benchmark memorization rather than principled reasoning. We term this failure algorithmic blindness and argue it reflects a fundamental gap between declarative knowledge about algorithms and calibrated procedural prediction.