Tiny Inference-Time Scaling with Latent Verifiers

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the inefficiency of using multimodal large language models (MLLMs) as verifiers for diffusion-generated images, which traditionally requires decoding latent representations into pixel space and re-encoding them—introducing redundant computation and high overhead. To overcome this limitation, the authors propose VHS (Verifier on Hidden States), the first lightweight verification mechanism that operates directly on intermediate hidden states of a single-step Diffusion Transformer generator. By performing verification entirely in the latent space without pixel-level decoding, VHS significantly reduces computational costs: with a minimal candidate set, it achieves a 63.3% reduction in combined generation and verification time, 51% fewer FLOPs, and 14.5% lower GPU memory usage, while simultaneously improving GenEval scores by 2.7%. This approach breaks from the conventional paradigm of pixel-space verification.

Technology Category

Application Category

📝 Abstract

Inference-time scaling has emerged as an effective way to improve generative models at test time by using a verifier to score and select candidate outputs. A common choice is to employ Multimodal Large Language Models (MLLMs) as verifiers, which can improve performance but introduce substantial inference-time cost. Indeed, diffusion pipelines operate in an autoencoder latent space to reduce computation, yet MLLM verifiers still require decoding candidates to pixel space and re-encoding them into the visual embedding space, leading to redundant and costly operations. In this work, we propose Verifier on Hidden States (VHS), a verifier that operates directly on intermediate hidden representations of Diffusion Transformer (DiT) single-step generators. VHS analyzes generator features without decoding to pixel space, thereby reducing the per-candidate verification cost while improving or matching the performance of MLLM-based competitors. We show that, under tiny inference budgets with only a small number of candidates per prompt, VHS enables more efficient inference-time scaling reducing joint generation-and-verification time by 63.3%, compute FLOPs by 51% and VRAM usage by 14.5% with respect to a standard MLLM verifier, achieving a +2.7% improvement on GenEval at the same inference-time budget.

Problem

Research questions and friction points this paper is trying to address.

inference-time scaling

verifier

diffusion models

computational cost

latent space

Innovation

Methods, ideas, or system contributions that make the work stand out.

inference-time scaling

latent verifiers

diffusion models