🤖 AI Summary
Existing LLM mathematical reasoning evaluation relies excessively on static accuracy metrics, failing to expose latent defects in reasoning dynamics. Method: We propose MathBode—the first dynamic assessment framework for LLM capability diagnosis that imports control-theoretic Bode analysis. It models mathematical problems as systems, drives model responses with parameterized sinusoidal inputs, and fits the fundamental harmonic response to extract gain–phase frequency-response “fingerprints.” Contribution/Results: Evaluated across five closed-form mathematical problem families and symbolic computation benchmarks, MathBode reveals, for the first time in the frequency domain, pervasive low-pass characteristics and phase-lag phenomena in LLMs. It enables quantitative differentiation between reasoning fidelity and consistency, yields compact and reproducible evaluation protocols, and is fully open-sourced—including datasets and code.
📝 Abstract
This paper presents MathBode, a dynamic diagnostic for mathematical reasoning in large language models (LLMs). Instead of one-shot accuracy, MathBode treats each parametric problem as a system: we drive a single parameter sinusoidally and fit first-harmonic responses of model outputs and exact solutions. This yields interpretable, frequency-resolved metrics -- gain (amplitude tracking) and phase (lag) -- that form Bode-style fingerprints. Across five closed-form families (linear solve, ratio/saturation, compound interest, 2x2 linear systems, similar triangles), the diagnostic surfaces systematic low-pass behavior and growing phase lag that accuracy alone obscures. We compare several models against a symbolic baseline that calibrates the instrument ($G approx 1$, $φapprox 0$). Results separate frontier from mid-tier models on dynamics, providing a compact, reproducible protocol that complements standard benchmarks with actionable measurements of reasoning fidelity and consistency. We open-source the dataset and code to enable further research and adoption.