Efficient Forward-Only Data Valuation for Pretrained LLMs and VLMs

📅 2025-08-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost and poor scalability of influence estimation in large language models (LLMs) and vision-language models (VLMs), this paper proposes a lightweight, single-forward-pass data valuation framework. Our method eliminates reliance on Hessian approximations or model retraining, and—uniquely—enables efficient influence estimation without gradient computation. By aligning hidden-layer representations and analyzing prediction errors, we derive a closed-form influence metric grounded in pretrained model representations. Evaluated across multiple benchmark tasks, our approach matches or surpasses gradient-based baselines in identifying critical fine-tuning samples and detecting mislabeled instances, while reducing computational cost by one to two orders of magnitude. This substantial efficiency gain significantly enhances the practicality and scalability of influence analysis for modern foundation models.

Technology Category

Application Category

📝 Abstract
Quantifying the influence of individual training samples is essential for enhancing the transparency and accountability of large language models (LLMs) and vision-language models (VLMs). However, existing data valuation methods often rely on Hessian information or model retraining, making them computationally prohibitive for billion-parameter models. In this work, we introduce For-Value, a forward-only data valuation framework that enables scalable and efficient influence estimation for both LLMs and VLMs. By leveraging the rich representations of modern foundation models, For-Value computes influence scores using a simple closed-form expression based solely on a single forward pass, thereby eliminating the need for costly gradient computations. Our theoretical analysis demonstrates that For-Value accurately estimates per-sample influence by capturing alignment in hidden representations and prediction errors between training and validation samples. Extensive experiments show that For-Value matches or outperforms gradient-based baselines in identifying impactful fine-tuning examples and effectively detecting mislabeled data.
Problem

Research questions and friction points this paper is trying to address.

Efficiently quantifying training sample influence for LLMs and VLMs
Eliminating computationally expensive gradient computations in data valuation
Providing scalable influence estimation without model retraining or Hessian
Innovation

Methods, ideas, or system contributions that make the work stand out.

Forward-only framework for scalable data valuation
Closed-form expression using single forward pass
Leverages hidden representations and prediction errors
🔎 Similar Papers
No similar papers found.