🤖 AI Summary
This work addresses a critical gap in current scientific reasoning benchmarks, which predominantly emphasize answer correctness or logical coherence while overlooking the pivotal role of memory activation mechanisms—such as anchors and attractors—that underpin human reasoning. To bridge this gap, we propose the first memory-driven reasoning evaluation framework centered on anchors and attractors, introducing the dual-scale benchmark $A^3$-Bench, which comprises 2,198 interdisciplinary questions meticulously annotated via the SAPM protocol. We further develop the Anchor–Attractor Utilization Index (AAUI) to quantitatively assess models’ capacity for memory activation. Experimental results validate the efficacy of our benchmark and demonstrate that effective memory activation significantly enhances reasoning performance, thereby offering a novel pathway toward building human-like, memory-driven scientific reasoning systems.
📝 Abstract
Scientific reasoning relies not only on logical inference but also on activating prior knowledge and experiential structures. Memory can efficiently reuse knowledge and enhance reasoning consistency and stability. However, existing benchmarks mainly evaluate final answers or step-by-step coherence, overlooking the \textit{memory-driven} mechanisms that underlie human reasoning, which involves activating anchors and attractors, then integrating them into multi-step inference. To address this gap, we propose $A^3$-Bench~ https://a3-bench.github.io, a benchmark designed to evaluate scientific reasoning through dual-scale memory-driven activation, grounded in Anchor and Attractor Activation. First, we annotate 2,198 science reasoning problems across domains using the SAPM process(subject, anchor&attractor, problem, and memory developing). Second, we introduce a dual-scale memory evaluation framework utilizing anchors and attractors, along with the AAUI(Anchor--Attractor Utilization Index) metric to measure memory activation rates. Finally, through experiments with various base models and paradigms, we validate $A^3$-Bench and analyze how memory activation impacts reasoning performance, providing insights into memory-driven scientific reasoning.