Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory

📅 2025-04-10

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

Existing language models lack cross-query memory mechanisms for complex reasoning, leading to redundant exploration of identical strategies or errors. Method: We propose a lightweight, test-time dynamic “memo” mechanism that autonomously accumulates, densely compresses, and cross-task reuses reasoning strategies, algebraic tricks, and code snippets—without parameter modification, fine-tuning, or human feedback. It operates via black-box API adaptation, strategy-guided memory retrieval, and context-aware snippet activation. Contribution/Results: Experiments show dramatic improvements: Claude 3.5 Sonnet achieves 100% relative accuracy gain on AIME; GPT-4o’s success rate on Game of 24 rises from 10% to 99%; equation solving nears 100%; GPQA-Diamond and MMLU-Pro improve by 9% and 8%, respectively. This work introduces the first unsupervised, online, and generalizable continual memory learning mechanism operating entirely at inference time.

Technology Category

Application Category

📝 Abstract

Despite their impressive performance on complex tasks, current language models (LMs) typically operate in a vacuum: Each input query is processed separately, without retaining insights from previous attempts. Here, we present Dynamic Cheatsheet (DC), a lightweight framework that endows a black-box LM with a persistent, evolving memory. Rather than repeatedly re-discovering or re-committing the same solutions and mistakes, DC enables models to store and reuse accumulated strategies, code snippets, and general problem-solving insights at inference time. This test-time learning enhances performance substantially across a range of tasks without needing explicit ground-truth labels or human feedback. Leveraging DC, Claude 3.5 Sonnet's accuracy more than doubled on AIME math exams once it began retaining algebraic insights across questions. Similarly, GPT-4o's success rate on Game of 24 increased from 10% to 99% after the model discovered and reused a Python-based solution. In tasks prone to arithmetic mistakes, such as balancing equations, DC enabled GPT-4o and Claude to reach near-perfect accuracy by recalling previously validated code, whereas their baselines stagnated around 50%. Beyond arithmetic challenges, DC yields notable accuracy gains on knowledge-demanding tasks. Claude achieved a 9% improvement in GPQA-Diamond and an 8% boost on MMLU-Pro problems. Crucially, DC's memory is self-curated, focusing on concise, transferable snippets rather than entire transcript. Unlike finetuning or static retrieval methods, DC adapts LMs' problem-solving skills on the fly, without modifying their underlying parameters. Overall, our findings present DC as a promising approach for augmenting LMs with persistent memory, bridging the divide between isolated inference events and the cumulative, experience-driven learning characteristic of human cognition.

Problem

Research questions and friction points this paper is trying to address.

Enables models to store and reuse problem-solving insights

Enhances performance without explicit labels or feedback

Adapts LMs' skills on the fly without parameter changes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight framework with persistent memory

Stores and reuses problem-solving insights

Self-curated memory for adaptive learning

🔎 Similar Papers

Adaptive Retention&Correction for Continual Learning