Efficient Forward-Only Data Valuation for Pretrained LLMs and VLMs

📅 2025-08-13

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

To address the high computational cost and poor scalability of influence estimation in large language models (LLMs) and vision-language models (VLMs), this paper proposes a lightweight, single-forward-pass data valuation framework. Our method eliminates reliance on Hessian approximations or model retraining, and—uniquely—enables efficient influence estimation without gradient computation. By aligning hidden-layer representations and analyzing prediction errors, we derive a closed-form influence metric grounded in pretrained model representations. Evaluated across multiple benchmark tasks, our approach matches or surpasses gradient-based baselines in identifying critical fine-tuning samples and detecting mislabeled instances, while reducing computational cost by one to two orders of magnitude. This substantial efficiency gain significantly enhances the practicality and scalability of influence analysis for modern foundation models.

Technology Category

Application Category

📝 Abstract

Quantifying the influence of individual training samples is essential for enhancing the transparency and accountability of large language models (LLMs) and vision-language models (VLMs). However, existing data valuation methods often rely on Hessian information or model retraining, making them computationally prohibitive for billion-parameter models. In this work, we introduce For-Value, a forward-only data valuation framework that enables scalable and efficient influence estimation for both LLMs and VLMs. By leveraging the rich representations of modern foundation models, For-Value computes influence scores using a simple closed-form expression based solely on a single forward pass, thereby eliminating the need for costly gradient computations. Our theoretical analysis demonstrates that For-Value accurately estimates per-sample influence by capturing alignment in hidden representations and prediction errors between training and validation samples. Extensive experiments show that For-Value matches or outperforms gradient-based baselines in identifying impactful fine-tuning examples and effectively detecting mislabeled data.

Problem

Research questions and friction points this paper is trying to address.

Efficiently quantifying training sample influence for LLMs and VLMs

Eliminating computationally expensive gradient computations in data valuation

Providing scalable influence estimation without model retraining or Hessian

Innovation

Methods, ideas, or system contributions that make the work stand out.

Forward-only framework for scalable data valuation

Closed-form expression using single forward pass

Leverages hidden representations and prediction errors

🔎 Similar Papers

On the Feasibility of In-Context Probing for Data Attribution