Preprint

SenseMath

When Number Sense Helps Numerical Reasoning in Large Language Models

Haomin Zhuang · Xiangqi Wang · Yili Shen · Ying Cheng · Xiangliang Zhang

University of Notre Dame

Overview

How do LLMs handle number sense?

SenseMath is a controlled benchmark measuring whether LLMs can exploit numerical shortcuts rather than defaulting to step-by-step computation. 1,600 matched item families span 8 categories and 4 digit scales.

SenseMath benchmark overview: item family construction and evaluation pipeline

Figure 1. Benchmark construction and three-level evaluation framework (Use, Judge, Generate).

Main Results

Accuracy across digit scales

NS prompting uniformly increases shortcut strategy usage but these strategies succeed only where valid shortcuts exist, producing an accuracy asymmetry.

Loading results...

Interactive Quiz — J1 Judgment

Can you tell if a shortcut applies?

In the J1 task, models must judge whether a math problem can be solved via a number-sense shortcut. Try it yourself, then see how five LLMs performed.

Loading quiz...

G2 Generation

Can models create shortcut problems?

In the G2 task, models generate a matched pair of expressions: one with a valid shortcut (strong) and one without (control). We check 6 criteria. Shown below: magnitude estimation examples from each model.

Loading generation results...

Citation

Cite this work

@article{zhuang2025sensemath, title = {SenseMath: When Number Sense Helps Numerical Reasoning in Large Language Models}, author = {Zhuang, Haomin and Wang, Xiangqi and Shen, Yili and Cheng, Ying and Zhang, Xiangliang}, journal = {arXiv preprint arXiv:XXXX.XXXXX}, year = {2025} }