OrdMoE: Preference Alignment via Hierarchical Expert Group Ranking in Multimodal Mixture-of-Experts LLMs

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing multimodal large language models (MLLMs) rely heavily on costly, human-annotated preference data for alignment, severely limiting scalability. To address this, we propose OrdMoE—a novel self-supervised, zero-annotation framework that leverages the implicit token-level quality ordering inherent in Mixture-of-Experts (MoE) routing scores to construct preference sequences over model responses. By introducing hierarchical expert grouping and progressive activation, OrdMoE generates multimodal responses with incrementally improving quality. Crucially, it eliminates dependence on external human preference labels. Evaluated across multiple multimodal benchmarks, OrdMoE significantly enhances alignment and overall performance—matching or exceeding state-of-the-art methods trained on human-annotated preferences. Our core contribution lies in repurposing internal MoE routing signals as transferable, self-supervised preference supervision, enabling efficient and scalable multimodal alignment without manual annotation.

Technology Category

Application Category

📝 Abstract

Preference learning has recently emerged as a pivotal strategy for post-training alignment of Multimodal Large Language Models (MLLMs). However, existing approaches predominantly rely on external human-annotated preference data, which is costly and labor-intensive to collect. In this work, we propose OrdMoE, a novel preference alignment framework that bypasses the reliance on external human preferences entirely by leveraging intrinsic signals within Mixture-of-Experts (MoE) architectures. Specifically, we observe that the router's expert selection scores implicitly encode a quality-aware ranking of responses (i.e. higher-scoring experts consistently generate higher-quality outputs). Building on this insight, OrdMoE constructs an internal preference hierarchy by grouping experts into ranked tiers based on their per-token routing scores and activating each tier separately to produce a sequence of responses with increasing quality. This yields a zero-cost, self-supervised preference ordering over generated responses, which can be directly optimized using standard preference learning objectives. Extensive experiments across multiple multimodal benchmarks demnstrate that OrdMoE significantly enhances both alignment and overall performance of multimodal Mixture-of-Experts LLMs, achieving competitive results without requiring any human-annotated preference data.

Problem

Research questions and friction points this paper is trying to address.

Aligning multimodal LLMs without human-annotated preference data

Leveraging MoE router scores to create internal quality hierarchy

Optimizing self-supervised preference ordering for enhanced model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses MoE router scores for internal preference ranking

Groups experts into quality tiers for response generation

Enables self-supervised alignment without human annotation

🔎 Similar Papers

Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization