PISCES: Annotation-free Text-to-Video Post-Training via Optimal Transport-Aligned Rewards

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the limitations of existing reward-based post-training methods for text-to-video generation, which suffer from either high annotation costs or misaligned vision-language embeddings, leading to suboptimal supervision signals. The authors propose a novel unsupervised post-training approach that, for the first time, leverages optimal transport theory to design rewards without human annotations. Their method establishes a dual-alignment mechanism—encompassing distribution-level quality rewards and token-level semantic rewards—to jointly optimize text-video alignment at both global distributional and fine-grained semantic levels. By integrating direct backpropagation with reinforcement learning-based fine-tuning, the approach significantly outperforms both annotated and unannotated state-of-the-art methods on the VBench benchmark. Human preference evaluations further confirm its superiority in generating videos with enhanced quality and semantic consistency.

Technology Category

Application Category

📝 Abstract

Text-to-video (T2V) generation aims to synthesize videos with high visual quality and temporal consistency that are semantically aligned with input text. Reward-based post-training has emerged as a promising direction to improve the quality and semantic alignment of generated videos. However, recent methods either rely on large-scale human preference annotations or operate on misaligned embeddings from pre-trained vision-language models, leading to limited scalability or suboptimal supervision. We present $\texttt{PISCES}$, an annotation-free post-training algorithm that addresses these limitations via a novel Dual Optimal Transport (OT)-aligned Rewards module. To align reward signals with human judgment, $\texttt{PISCES}$ uses OT to bridge text and video embeddings at both distributional and discrete token levels, enabling reward supervision to fulfill two objectives: (i) a Distributional OT-aligned Quality Reward that captures overall visual quality and temporal coherence; and (ii) a Discrete Token-level OT-aligned Semantic Reward that enforces semantic, spatio-temporal correspondence between text and video tokens. To our knowledge, $\texttt{PISCES}$ is the first to improve annotation-free reward supervision in generative post-training through the lens of OT. Experiments on both short- and long-video generation show that $\texttt{PISCES}$ outperforms both annotation-based and annotation-free methods on VBench across Quality and Semantic scores, with human preference studies further validating its effectiveness. We show that the Dual OT-aligned Rewards module is compatible with multiple optimization paradigms, including direct backpropagation and reinforcement learning fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

text-to-video generation

reward-based post-training

annotation-free

semantic alignment

optimal transport

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimal Transport

Text-to-Video Generation

Annotation-free Post-training