Spectral Scaling Laws in Language Models: How Effectively Do Feed-Forward Networks Use Their Latent Space?

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

This work addresses the low latent-space utilization observed during width scaling of feed-forward networks (FFNs) in large language models (LLMs). Methodologically, it introduces an asymmetric scaling law for soft and hard rank, defines the Spectral Utilization Index (SUI), and reframes FFN width design as a trade-off between tail capacity and dominant-mode capacity. Leveraging spectral analysis tools—including hard rank, soft rank, and spectral concentration—it conducts lightweight diagnostics across LLaMA, GPT-2, and nGPT families. The key contributions are: (i) the first empirical identification of spectral asymmetry in FFN activation spectra and an approximate power-law growth pattern in soft rank; and (ii) quantitative validation of backbone saturation and tail redundancy under large widths. These findings establish an interpretable, transferable theoretical foundation and practical guidance for efficient FFN architecture design.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) scale, the question is not only how large they become, but how much of their capacity is effectively utilized. Existing scaling laws relate model size to loss, yet overlook how components exploit their latent space. We study feed-forward networks (FFNs) and recast width selection as a spectral utilization problem. Using a lightweight diagnostic suite -- Hard Rank (participation ratio), Soft Rank (Shannon rank), Spectral Concentration, and the composite Spectral Utilization Index (SUI) -- we quantify how many latent directions are meaningfully activated across LLaMA, GPT-2, and nGPT families. Our key finding is an asymmetric spectral scaling law: soft rank follows an almost perfect power law with FFN width, while hard rank grows only sublinearly and with high variance. This asymmetry suggests that widening FFNs mostly adds low-energy tail directions, while dominant-mode subspaces saturate early. Moreover, at larger widths, variance further collapses into a narrow subspace, leaving much of the latent space under-utilized. These results recast FFN width selection as a principled trade-off between tail capacity and dominant-mode capacity, offering concrete guidance for inference-efficient LLM design.

Problem

Research questions and friction points this paper is trying to address.

Quantify spectral utilization in feed-forward networks of language models

Analyze asymmetric scaling between soft and hard spectral ranks

Optimize FFN width selection for efficient latent space usage

Innovation

Methods, ideas, or system contributions that make the work stand out.

Recast width selection as spectral utilization problem

Quantify latent directions via diagnostic suite SUI

Propose asymmetric spectral scaling law for FFNs

🔎 Similar Papers

Emergence of a High-Dimensional Abstraction Phase in Language Transformers