Semantic Tube Prediction: Beating LLM Data Efficiency with JEPA

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the data inefficiency of large language models constrained by Chinchilla scaling laws by introducing Semantic Trajectory Prediction (STP), a novel task that incorporates geometric priors into language modeling for the first time. The approach posits that token sequences evolve along geodesics on a semantic manifold and leverages JEPA-style regularization to constrain the trajectory of hidden states, thereby enhancing the signal-to-noise ratio of training signals without explicit multi-view augmentation while preserving generative diversity. This method substantially surpasses existing scaling laws in data efficiency, achieving baseline-level accuracy on the NL-RX-SYNTH dataset with only 1/16 of the training data.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) obey consistent scaling laws -- empirical power-law fits that predict how loss decreases with compute, data, and parameters. While predictive, these laws are descriptive rather than prescriptive: they characterize typical training, not optimal training. Surprisingly few works have successfully challenged the data-efficiency bounds implied by these laws -- which is our primary focus. To that end, we introduce the Geodesic Hypothesis, positing that token sequences trace geodesics on a smooth semantic manifold and are therefore locally linear. Building on this principle, we propose a novel Semantic Tube Prediction (STP) task, a JEPA-style regularizer that confines hidden-state trajectories to a tubular neighborhood of the geodesic. STP generalizes JEPA to language without requiring explicit multi-view augmentations. We show this constraint improves signal-to-noise ratio, and consequently preserves diversity by preventing trajectory collisions during inference. Empirically, STP allows LLMs to match baseline accuracy with 16$\times$ less training data on the NL-RX-SYNTH dataset, directly violating the data term of Chinchilla-style scaling laws and demonstrating that principled geometric priors can surpass brute-force scaling. Code is available at https://github.com/galilai-group/llm-jepa#stp.
Problem

Research questions and friction points this paper is trying to address.

data efficiency
scaling laws
large language models
semantic manifold
training data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Tube Prediction
JEPA
Geodesic Hypothesis
data efficiency
scaling laws
🔎 Similar Papers
No similar papers found.