T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models

📅 2025-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Small language models (sLMs) struggle to reliably self-verify memory-intensive tasks—such as numerical computation and fact-checking—under test-time compute scaling. To address this, we propose T1, a tool-integrated self-verification framework that pioneers deep embedding of external tools (e.g., code interpreters) into sLM reasoning chains, thereby decoupling memory and reasoning burdens to enable efficient autonomous verification in lightweight models. Our method synergistically combines tool invocation, test-time scaling, knowledge distillation, and explicit modeling of memory load alongside performance boundary analysis. Experiments show that Llama-3.2 1B enhanced with T1 surpasses Llama-3.1 8B on MATH, and achieves significant cross-domain generalization on MATH500 and MMLU-Pro. The core contribution is the first tool-augmented self-verification paradigm specifically designed for sLMs, effectively overcoming their inherent memory bottleneck.

Technology Category

Application Category

📝 Abstract
Recent studies have demonstrated that test-time compute scaling effectively improves the performance of small language models (sLMs). However, prior research has mainly examined test-time compute scaling with an additional larger model as a verifier, leaving self-verification by sLMs underexplored. In this work, we investigate whether sLMs can reliably self-verify their outputs under test-time scaling. We find that even with knowledge distillation from larger verifiers, sLMs struggle with verification tasks requiring memorization, such as numerical calculations and fact-checking. To address this limitation, we propose Tool-integrated self-verification (T1), which delegates memorization-heavy verification steps to external tools, such as a code interpreter. Our theoretical analysis shows that tool integration reduces memorization demands and improves test-time scaling performance. Experiments on the MATH benchmark demonstrate that, with T1, a Llama-3.2 1B model under test-time scaling outperforms the significantly larger Llama-3.1 8B model. Moreover, T1 generalizes effectively to both mathematical (MATH500) and multi-domain knowledge-intensive tasks (MMLU-Pro). Our findings highlight the potential of tool integration to substantially improve the self-verification abilities of sLMs.
Problem

Research questions and friction points this paper is trying to address.

Exploring self-verification in small language models during test-time compute scaling.
Addressing sLMs' limitations in memorization-heavy verification tasks like calculations and fact-checking.
Proposing tool-integrated self-verification to enhance sLMs' performance by offloading memorization demands.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tool-integrated self-verification for sLMs
Delegates memorization tasks to external tools
Improves test-time scaling performance significantly
🔎 Similar Papers
No similar papers found.