ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment

📅 2024-10-23

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing data selection methods for language model task customization suffer from two key limitations: (i) neglecting alignment between candidate data and the target task distribution, and (ii) relying on lossy pretrained embeddings that introduce noise and conflicts. To address these issues, this paper proposes a novel, embedding-free approach that directly quantifies data–task distribution alignment via lossless gzip compression rates. By estimating task distribution entropy through compression entropy, our method enables zero-shot, efficient, and robust data importance scoring without requiring pretrained representations. We evaluate it on automated formalization and Python code generation tasks. Results demonstrate substantial improvements over DSIR and D4: training convergence accelerates by up to 85.1%; data filtering is 65.8% faster than DSIR and two orders of magnitude faster than D4. These findings empirically validate the superiority of compact yet highly task-aligned (“small but precise”) datasets.

Technology Category

Application Category

📝 Abstract

Data selection is crucial for optimizing language model (LM) performance on specific tasks, yet most existing methods fail to effectively consider the target task distribution. Current approaches either ignore task-specific requirements entirely or rely on approximations that fail to capture the nuanced patterns needed for tasks like Autoformalization or code generation. Methods that do consider the target distribution often rely on simplistic, sometimes noisy, representations, like hashed n-gram features, which can lead to collisions and introduce noise. We introduce ZIP-FIT, a data selection framework that uses gzip compression to directly measure alignment between potential training data and the target task distribution. In extensive evaluations on Autoformalization and Python code generation, ZIP-FIT significantly outperforms leading baselines like DSIR and D4. Models trained on ZIP-FIT-selected data achieve their lowest cross-entropy loss up to 85.1% faster than baselines, demonstrating that better task alignment leads to more efficient learning. In addition, ZIP-FIT performs selection up to 65.8% faster than DSIR and two orders of magnitude faster than D4. Notably, ZIP-FIT shows that smaller, well-aligned datasets often outperform larger but less targeted ones, demonstrating that a small amount of higher quality data is superior to a large amount of lower quality data. Our results imply that task-aware data selection is crucial for efficient domain adaptation, and that compression offers a principled way to measure task alignment. By showing that targeted data selection can dramatically improve task-specific performance, our work provides new insights into the relationship between data quality, task alignment, and model learning efficiency.

Problem

Research questions and friction points this paper is trying to address.

Improves task-specific data selection for language models

Addresses limitations of noisy n-gram feature methods

Measures data-task alignment via compression without embeddings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses gzip compression for data alignment

Outperforms DSIR and D4 baselines

Faster selection with higher quality data

🔎 Similar Papers

Minimal Algorithmic Information Loss Methods for Dimension Reduction, Feature Selection and Network Sparsification.