You Only Forward Once: An Efficient Compositional Judging Paradigm

📅 2025-11-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) used as evaluators face a fundamental trade-off: outputting a single scalar score contradicts their generative nature and hinders fine-grained assessment, while autoregressive, step-by-step analysis severely limits inference throughput in high-scale evaluation scenarios. Method: We propose YOFO—a structured-template-based, efficient multidimensional evaluation paradigm that reformulates evaluation as parallel binary classification across multiple criteria, resolved via a single forward pass. Leveraging autoregressive decoding, YOFO conditionally generates logits for the final token of each dimension under template guidance to enable simultaneous yes/no decisions, supporting dependency-aware analysis and post-hoc chain-of-thought augmentation. Results: On recommendation evaluation benchmarks, YOFO achieves state-of-the-art performance with inference speedups of several orders of magnitude, offering unprecedented efficiency, strong interpretability, and full alignment with generative modeling principles.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) show strong potential as judges. However, existing approaches face a fundamental trade-off: adapting MLLMs to output a single score misaligns with the generative nature of MLLMs and limits fine-grained requirement understanding, whereas autoregressively generating judging analyses is prohibitively slow in high-throughput settings. Observing that judgment reduces to verifying whether inputs satisfy a set of structured requirements, we propose YOFO, a template-conditioned method that judges all requirements in a single forward pass. Built on an autoregressive model, YOFO accepts a structured requirement template and, in one inference step, produces a binary yes/no decision for each requirement by reading the logits of the final token associated with that requirement. This design yields orders-of-magnitude speedups while preserving interpretability. Extensive experiments show that YOFO not only achieves state-of-the-art results on standard recommendation datasets, but also supports dependency-aware analysis-where subsequent judgments are conditioned on previous ones-and further benefits from post-hoc CoT.
Problem

Research questions and friction points this paper is trying to address.

Multimodal LLMs struggle with efficient fine-grained requirement verification
Autoregressive judging analysis generation is too slow for high-throughput settings
Existing methods face trade-off between speed and detailed requirement understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Template-conditioned judging in single forward pass
Binary decisions via final token logits reading
Dependency-aware analysis with post-hoc CoT support
🔎 Similar Papers
No similar papers found.
T
Tianlong Zhang
Harbin Institute of Technology, Shenzhen
Hongwei Xue
Hongwei Xue
University of Science and Technology of China
Multi-ModalVision-Language
Shilin Yan
Shilin Yan
Fudan University
MLLMsComputer VisionMulti-Modal
D
Di Wu
Accio, Alibaba Group
C
Chen Xu
Accio, Alibaba Group
Y
Yunyun Yang
Harbin Institute of Technology, Shenzhen