Is Compression Really Linear with Code Intelligence?

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Prior work implicitly assumes a linear relationship between data compression capability—measured in bits per character (BPC)—and code intelligence in large language models (LLMs), yet this hypothesis lacks empirical validation. Method: To rigorously test this assumption, the authors propose Format Annealing, a lightweight evaluation paradigm, and construct a large-scale, multilingual, out-of-distribution GitHub code validation set. They systematically benchmark open-source Code LLMs across diverse programming languages and tasks. Results: Experiments reveal a statistically significant logarithmic—not linear—relationship between BPC and code intelligence, refuting the long-standing linear hypothesis. This work establishes, for the first time, a theoretically grounded nonlinear connection between compression efficacy and code understanding ability. It introduces a fairer, more reproducible, and domain-adapted evaluation standard, offering critical methodological support for analyzing and optimizing code-specific LLM capabilities.

Technology Category

Application Category

📝 Abstract

Understanding the relationship between data compression and the capabilities of Large Language Models (LLMs) is crucial, especially in specialized domains like code intelligence. Prior work posited a linear relationship between compression and general intelligence. However, it overlooked the multifaceted nature of code that encompasses diverse programming languages and tasks, and struggled with fair evaluation of modern Code LLMs. We address this by evaluating a diverse array of open-source Code LLMs on comprehensive multi-language, multi-task code benchmarks. To address the challenge of efficient and fair evaluation of pre-trained LLMs' code intelligence, we introduce extit{Format Annealing}, a lightweight, transparent training methodology designed to assess the intrinsic capabilities of these pre-trained models equitably. Compression efficacy, measured as bits-per-character (BPC), is determined using a novel, large-scale, and previously unseen code validation set derived from GitHub. Our empirical results reveal a fundamental logarithmic relationship between measured code intelligence and BPC. This finding refines prior hypotheses of linearity, which we suggest are likely observations of the logarithmic curve's tail under specific, limited conditions. Our work provides a more nuanced understanding of compression's role in developing code intelligence and contributes a robust evaluation framework in the code domain.

Problem

Research questions and friction points this paper is trying to address.

Examining nonlinear compression-intelligence link in Code LLMs

Addressing fair evaluation gaps in multi-task code benchmarks

Introducing Format Annealing for equitable model assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Format Annealing for fair LLM evaluation

Uses novel GitHub code set for BPC measurement

Reveals logarithmic code intelligence-BPC relationship

🔎 Similar Papers

Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models