SenseMath: Do LLMs Have Number Sense? Evaluating Shortcut Use, Judgment, and Generation
AI 摘要
论文提出了SenseMath基准,评估LLM的结构敏感数值推理能力,发现LLM缺乏人类般的数字感知。
主要贡献
- 提出了SenseMath基准,用于评估LLM的数值推理能力
- 设计了三种评估设置:Shortcut Use, Applicability Judgment, Problem Generation
- 实验结果表明LLM能使用快捷方式,但缺乏结构性的理解
方法论
设计了4800个包含不同快捷方式、数字规模和变体的题目,通过三种评估设置衡量LLM的性能。
原文摘要
Large language models often default to step-by-step computation even when efficient numerical shortcuts are available. This raises a basic question: do they exhibit number sense in a human-like behavioral sense, i.e., the ability to recognize numerical structure, apply shortcuts when appropriate, and avoid them when they are not? We introduce SenseMath, a controlled benchmark for evaluating structure-sensitive numerical reasoning in LLMs. SenseMath contains 4,800 items spanning eight shortcut categories and four digit scales, with matched strong-shortcut, weak-shortcut, and control variants. It supports three evaluation settings of increasing cognitive demand: Shortcut Use (whether models can apply shortcuts on shortcut-amenable problems); Applicability Judgment (whether they can recognize when a shortcut is appropriate or misleading); and Problem Generation (whether they can generate new problem items that correctly admit a given type of shortcut). Our evaluation across five LLMs, ranging from GPT-4o-mini to Llama-3.1-8B, shows a consistent pattern: when explicitly prompted, models readily adopt shortcut strategies and achieve substantial accuracy gains on shortcut-amenable items (up to 15%), yet under standard chain-of-thought prompting they spontaneously employ such strategies in fewer than 40% of cases, even when they demonstrably possess the requisite capability. Moreover, this competence is confined to the Use level; models systematically over-generalise shortcuts to problems where they do not apply, and fail to generate valid shortcut-bearing problems from scratch. Together, these results suggest that current LLMs exhibit procedural shortcut fluency without the structural understanding of when and why shortcuts work that underlies human number sense.