Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the inefficiency of discrete flow matching (DFM) for text generation, which typically requires hundreds of iterative steps. It identifies that the performance bottleneck in knowledge distillation stems not from student model capacity but from the poor quality of intermediate trajectories. To resolve this, the authors propose Trajectory-Shaping Discrete Flow Matching (TS-DFM), the first method to explicitly recognize trajectory quality as the key distillation bottleneck. TS-DFM introduces a lightweight energy function during training to dynamically guide intermediate steps toward more coherent candidate sequences, substantially improving trajectory quality without adding inference overhead. Experiments show that on a 170M-parameter language model, an 8-step student model trained with TS-DFM achieves a 32% lower perplexity than its 1024-step teacher and is 128× faster at inference, outperforming baselines that rely on more data or larger models.

📝 Abstract

Discrete flow matching generates text by iteratively transforming noise tokens into coherent language, but may require hundreds of forward passes. Distillation uses the multi-step trajectory to train a student to reproduce the process in a few steps. When the student underperforms, the usual explanation is insufficient capacity. We argue the opposite: the trajectory is the bottleneck, not the student. Each training trajectory is built through a chain of blind stochastic jumps with no evaluation of sequence quality; a single bad decision at an early midpoint propagates through subsequent steps, yet the student must imitate the result. Trajectory-Shaped Discrete Flow Matching (TS-DFM) replaces these blind jumps with guided navigation: a lightweight energy compass evaluates candidate continuations at each midpoint, selecting the most coherent. All shaping is training-only; inference cost is unchanged. On 170M-parameter language modeling, the shaped student at 8 steps achieves 32% lower perplexity than the 1,024-step teacher while being 128x faster, with gains consistent across source distributions and three evaluators of increasing scale. TS-DFM achieves the best perplexity of any discrete-generation baseline we compare against, including methods trained on 6x more data or using 5x larger models.

Problem

Research questions and friction points this paper is trying to address.

Discrete Flow Matching

Trajectory Quality

Distillation

Text Generation

Stochastic Jumps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete Flow Matching

Trajectory Distillation

Energy-Guided Navigation