On Task Vectors and Gradients

📅 2025-08-22

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Task arithmetic demonstrates strong empirical performance in model merging but lacks theoretical grounding. Method: This paper establishes, for the first time, an exact equivalence between task vectors and loss gradients, proving that a single fine-tuning step captures gradient information aligned with dominant fusion directions—revealing task arithmetic as gradient-driven approximate multitask learning. Leveraging second-order error analysis in feedforward networks and classical gradient descent theory, we rigorously justify the critical role of early-training dynamics. Contribution/Results: Experiments across seven vision benchmarks show that task vectors derived from just one epoch of fine-tuning achieve fusion performance comparable to those from fully converged models. This work bridges the theoretical gap in task arithmetic, elucidates its underlying mechanism, and offers a principled perspective for efficient model merging.

Technology Category

Application Category

📝 Abstract

Task arithmetic has emerged as a simple yet powerful technique for model merging, enabling the combination of multiple finetuned models into one. Despite its empirical success, a clear theoretical explanation of why and when it works is lacking. This paper provides a rigorous theoretical foundation for task arithmetic by establishing a connection between task vectors and gradients of the task losses. We show that under standard gradient descent, a task vector generated from one epoch of finetuning is exactly equivalent to the negative gradient of the loss, scaled by the learning rate. For the practical multi-epoch setting, we prove that this equivalence holds approximately, with a second-order error term that we explicitly bound for feed-forward networks. Our empirical analysis across seven vision benchmarks corroborates our theory, demonstrating that the first-epoch gradient dominates the finetuning trajectory in both norm and direction. A key implication is that merging models finetuned for only a single epoch often yields performance comparable to merging fully converged models. These findings reframe task arithmetic as a form of approximate multitask learning, providing a clear rationale for its effectiveness and highlighting the critical role of early training dynamics in model merging.

Problem

Research questions and friction points this paper is trying to address.

Lack of theoretical explanation for task arithmetic effectiveness

Establishing connection between task vectors and gradient losses

Analyzing early training dynamics impact on model merging

Innovation

Methods, ideas, or system contributions that make the work stand out.

Task vectors equivalent to negative gradients

Second-order error bounded for multi-epoch training

First-epoch gradient dominates finetuning trajectory

🔎 Similar Papers

No similar papers found.