🤖 AI Summary
This work addresses the challenge in large language model fine-tuning where parameter and data selection, often guided by independent scoring mechanisms, fail to coordinate effectively, leading to computational redundancy. The authors unify these two processes into a bilevel optimization framework sharing a common validation objective and propose DualSFT, a novel method that constructs a gradient interaction matrix to establish a row-column correspondence between parameter importance and data utility. This enables, for the first time, joint closed-form scoring and co-extraction of parameter masks and data subsets. By integrating first- and second-order validation approximations with a single-pass dual-scoring strategy, DualSFT significantly outperforms sequential baselines across 3B–9B scale models, simultaneously enhancing task performance and the trade-off between stability and plasticity under fixed computational budgets.
📝 Abstract
In Large Language Model (LLM) fine-tuning, parameter and data selection are common strategies for reducing fine-tuning cost, yet they are typically driven by separate scoring mechanisms. When a parameter mask and data subset jointly determine restricted fine-tuning, this separation incurs redundant overhead and makes coordinated selection difficult. We cast parameter and data selection as two bilevel selection problems under a common validation objective and derive a shared local response-surrogate scoring rule. Under first- and second-order validation-improvement approximations, parameter importance and data utility emerge as column-wise and row-wise aggregations of a single gradient interaction matrix, yielding a closed-form row-column correspondence for co-extracting both signals. Building on this structure, we propose DualSFT (Dual-Selection Fine-Tuning), a one-shot dual-scoring algorithm that produces a parameter mask and data subset from shared gradient statistics. On 3B-9B LLMs, single-axis DualSFT variants strengthen target-task performance and stability-plasticity trade-offs within their comparison groups, while full DualSFT yields a more favorable joint-constrained trade-off than sequential hybrid baselines under matched budgets.