🤖 AI Summary
Existing multimodal large language models (MLLMs) rely heavily on costly, human-annotated preference data for alignment, severely limiting scalability. To address this, we propose OrdMoE—a novel self-supervised, zero-annotation framework that leverages the implicit token-level quality ordering inherent in Mixture-of-Experts (MoE) routing scores to construct preference sequences over model responses. By introducing hierarchical expert grouping and progressive activation, OrdMoE generates multimodal responses with incrementally improving quality. Crucially, it eliminates dependence on external human preference labels. Evaluated across multiple multimodal benchmarks, OrdMoE significantly enhances alignment and overall performance—matching or exceeding state-of-the-art methods trained on human-annotated preferences. Our core contribution lies in repurposing internal MoE routing signals as transferable, self-supervised preference supervision, enabling efficient and scalable multimodal alignment without manual annotation.
📝 Abstract
Preference learning has recently emerged as a pivotal strategy for post-training alignment of Multimodal Large Language Models (MLLMs). However, existing approaches predominantly rely on external human-annotated preference data, which is costly and labor-intensive to collect. In this work, we propose OrdMoE, a novel preference alignment framework that bypasses the reliance on external human preferences entirely by leveraging intrinsic signals within Mixture-of-Experts (MoE) architectures. Specifically, we observe that the router's expert selection scores implicitly encode a quality-aware ranking of responses (i.e. higher-scoring experts consistently generate higher-quality outputs). Building on this insight, OrdMoE constructs an internal preference hierarchy by grouping experts into ranked tiers based on their per-token routing scores and activating each tier separately to produce a sequence of responses with increasing quality. This yields a zero-cost, self-supervised preference ordering over generated responses, which can be directly optimized using standard preference learning objectives. Extensive experiments across multiple multimodal benchmarks demnstrate that OrdMoE significantly enhances both alignment and overall performance of multimodal Mixture-of-Experts LLMs, achieving competitive results without requiring any human-annotated preference data.