Influence Functions for Efficient Data Selection in Reasoning

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the ambiguous definition of “quality” in reasoning data by proposing a novel, causality-driven data filtering paradigm. Unlike existing heuristic approaches—such as those based on problem difficulty or chain-of-thought (CoT) trajectory length—we introduce influence functions to reasoning tasks for the first time, enabling quantitative measurement of each CoT sample’s causal contribution to downstream model accuracy. Our method integrates gradient analysis during fine-tuning, eliminating reliance on proxy metrics like perplexity or embedding similarity. We further design an efficient pruning strategy grounded in this causal attribution. Evaluated on multiple mathematical reasoning benchmarks (e.g., GSM8K, MATH), our approach achieves significant performance gains over state-of-the-art baselines using substantially less data, thereby improving both model accuracy and data efficiency. This work establishes a principled, interpretable, and computationally tractable standard for constructing high-quality reasoning datasets.

Technology Category

Application Category

📝 Abstract
Fine-tuning large language models (LLMs) on chain-of-thought (CoT) data shows that a small amount of high-quality data can outperform massive datasets. Yet, what constitutes "quality" remains ill-defined. Existing reasoning methods rely on indirect heuristics such as problem difficulty or trace length, while instruction-tuning has explored a broader range of automated selection strategies, but rarely in the context of reasoning. We propose to define reasoning data quality using influence functions, which measure the causal effect of individual CoT examples on downstream accuracy, and introduce influence-based pruning, which consistently outperforms perplexity and embedding-based baselines on math reasoning within a model family.
Problem

Research questions and friction points this paper is trying to address.

Defining reasoning data quality using influence functions
Measuring causal effect of CoT examples on accuracy
Improving data selection for math reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using influence functions to measure data quality
Pruning reasoning data based on causal effects
Outperforming perplexity and embedding-based selection methods
🔎 Similar Papers
No similar papers found.