Semantic Tube Prediction: Beating LLM Data Efficiency with JEPA

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the data inefficiency of large language models constrained by Chinchilla scaling laws by introducing Semantic Trajectory Prediction (STP), a novel task that incorporates geometric priors into language modeling for the first time. The approach posits that token sequences evolve along geodesics on a semantic manifold and leverages JEPA-style regularization to constrain the trajectory of hidden states, thereby enhancing the signal-to-noise ratio of training signals without explicit multi-view augmentation while preserving generative diversity. This method substantially surpasses existing scaling laws in data efficiency, achieving baseline-level accuracy on the NL-RX-SYNTH dataset with only 1/16 of the training data.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) obey consistent scaling laws -- empirical power-law fits that predict how loss decreases with compute, data, and parameters. While predictive, these laws are descriptive rather than prescriptive: they characterize typical training, not optimal training. Surprisingly few works have successfully challenged the data-efficiency bounds implied by these laws -- which is our primary focus. To that end, we introduce the Geodesic Hypothesis, positing that token sequences trace geodesics on a smooth semantic manifold and are therefore locally linear. Building on this principle, we propose a novel Semantic Tube Prediction (STP) task, a JEPA-style regularizer that confines hidden-state trajectories to a tubular neighborhood of the geodesic. STP generalizes JEPA to language without requiring explicit multi-view augmentations. We show this constraint improves signal-to-noise ratio, and consequently preserves diversity by preventing trajectory collisions during inference. Empirically, STP allows LLMs to match baseline accuracy with 16$\times$ less training data on the NL-RX-SYNTH dataset, directly violating the data term of Chinchilla-style scaling laws and demonstrating that principled geometric priors can surpass brute-force scaling. Code is available at https://github.com/galilai-group/llm-jepa#stp.

Problem

Research questions and friction points this paper is trying to address.

data efficiency

scaling laws

large language models

semantic manifold

training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Tube Prediction

JEPA

Geodesic Hypothesis

data efficiency

scaling laws

🔎 Similar Papers

No similar papers found.

Authors to Follow