TAPIOCA: Why Task- Aware Pruning Improves OOD model Capability

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses why task-aware pruning enhances model performance on out-of-distribution (OOD) data without significantly improving in-distribution (ID) accuracy. We introduce the concept of “task-adapted geometry” and, for the first time, elucidate the underlying mechanism through the lens of norm and pairwise distance structures in the representation space. Our analysis reveals that pruning selectively removes layers that distort the task-relevant geometric structure, thereby aligning OOD representations closer to the ID manifold. Through controlled polynomial regression, large language model experiments, distribution shift benchmarks, residual scaling interventions, and cross-scale validation, we consistently demonstrate across diverse tasks and model scales that this approach substantially boosts OOD accuracy and corrects geometric biases in representations while preserving ID performance.

📝 Abstract

Recent work has promoted task-aware layer pruning as a way to improve model performance on particular tasks, as shown by TALE. In this paper, we investigate when such improvements occur and why. We show first that, across controlled polynomial regression tasks and large language models, such pruning yields no benefit on in-distribution (ID) data but consistently improves out-of-distribution (OOD) accuracy. We further show empirically that OOD inputs induce layerwise norm and pairwise-distance profiles that deviate from the corresponding ID profiles. This leads to a geometric explanation of task-aware pruning: each task induces a task-adapted geometry, characterized empirically by the representation profiles observed on ID inputs. OOD inputs can introduce a distorted version of the task-adapted geometry. Task-aware pruning identifies layers that create or amplify this distortion; by removing them, it shifts OOD representational norms and pairwise distances toward those observed on the adapted distribution. This realigns OOD inputs with the model's task-adapted geometry and improves performance. We provide causal evidence through controlled distribution shifts and residual-scaling interventions, and demonstrate consistent behavior across model scales.

Problem

Research questions and friction points this paper is trying to address.

task-aware pruning

out-of-distribution

model capability

representation geometry

distribution shift

Innovation

Methods, ideas, or system contributions that make the work stand out.

task-aware pruning

out-of-distribution generalization

representation geometry