🤖 AI Summary
This work addresses the limitations of existing personalized AI research, which is largely confined to vision–language modalities and lacks a unified benchmark encompassing text, image, and audio, while also failing to systematically handle scenarios without user profiles or cross-modal alignment. To bridge this gap, we introduce Omni-Persona, the first comprehensive multimodal personalization benchmark, comprising 18 fine-grained tasks across four categories (approximately 750 samples). We formalize personalization as a cross-modal routing problem grounded in a Persona Modality Graph and propose a calibrated accuracy (Cal) metric to jointly evaluate a model’s ability to correctly ground responses and appropriately abstain when no persona is provided. Experiments reveal critical issues in open-source models, including audio–visual grounding gaps and hallucination under high recall without personas. Furthermore, rule-based reinforcement learning with verification feedback (RLVR) demonstrates superior generalization over supervised fine-tuning, offering key insights for post-training and reward design.
📝 Abstract
While multimodal large language models have advanced across text, image, and audio, personalization research has remained primarily vision-language, with unified omnimodal benchmarking that jointly covers text, image, and audio still limited, and lacking the methodological rigor to account for absent-persona scenarios or systematic grounding studies. We introduce Omni-Persona, the first comprehensive benchmark for omnimodal personalization. We formalize the task as cross-modal routing over the \emph{Persona Modality Graph}, encompassing 4 task groups and 18 fine-grained tasks across ${\sim}750$ items. To rigorously diagnose grounding behavior, we propose \emph{Calibrated Accuracy ($\mathrm{Cal}$)}, which jointly rewards correct grounding and appropriate abstention, incorporating absent-persona queries within a unified evaluation framework. On our dedicated experiments, three diagnostic findings emerge: (i) open-source models show a consistent audio-vs-visual grounding gap that RLVR partially narrows via dense rule-based supervision; (ii) answerable recall and parameter scale are incomplete diagnostics, since strong recall can coexist with absent-persona hallucination and larger models do not always achieve higher $\mathrm{Cal}$, exposing calibration as a separate evaluation axis; and (iii) SFT is bounded by the difficulty of constructing annotated ground-truth supervision at scale, while RLVR generalizes more consistently through outcome-level verifiable feedback yet drifts toward conservative behavior and lower generation quality under our reward design. Omni-Persona thus serves as a diagnostic framework that surfaces the pitfalls of omnimodal personalization, guiding future post-training and reward design.