Preprint

SenseMath

When Number Sense Helps Numerical Reasoning in Large Language Models
Haomin Zhuang · Xiangqi Wang · Yili Shen · Ying Cheng · Xiangliang Zhang
University of Notre Dame
How do LLMs handle number sense?

SenseMath is a controlled benchmark measuring whether LLMs can exploit numerical shortcuts rather than defaulting to step-by-step computation. 1,600 matched item families span 8 categories and 4 digit scales.

SenseMath benchmark overview: item family construction and evaluation pipeline
Figure 1. Benchmark construction and three-level evaluation framework (Use, Judge, Generate).
Accuracy across digit scales

NS prompting uniformly increases shortcut strategy usage but these strategies succeed only where valid shortcuts exist, producing an accuracy asymmetry.

Loading results...
Can you tell if a shortcut applies?

In the J1 task, models must judge whether a math problem can be solved via a number-sense shortcut. Try it yourself, then see how five LLMs performed.

Loading quiz...
Can models create shortcut problems?

In the G2 task, models generate a matched pair of expressions: one with a valid shortcut (strong) and one without (control). We check 6 criteria. Shown below: magnitude estimation examples from each model.

Loading generation results...
Cite this work
@article{zhuang2025sensemath, title = {SenseMath: When Number Sense Helps Numerical Reasoning in Large Language Models}, author = {Zhuang, Haomin and Wang, Xiangqi and Shen, Yili and Cheng, Ying and Zhang, Xiangliang}, journal = {arXiv preprint arXiv:XXXX.XXXXX}, year = {2025} }