🤖 AI Summary
Multimodal large language models (MLLMs) often exacerbate social biases when enhanced for reasoning capabilities, yet it remains unclear whether a fundamental trade-off exists between reasoning performance and bias mitigation.
Method: We propose the first systematic framework for co-optimizing reasoning ability and social fairness in MLLMs, introducing a reinforcement learning (RL) approach guided by sample-ratio control—dynamically balancing debiasing and reasoning samples at an optimal 1:4 ratio. Our method integrates supervised fine-tuning, knowledge distillation, and rule-guided RL to comprehensively characterize the bias–reasoning Pareto frontier across multiple paradigms.
Contribution/Results: Experiments demonstrate that our approach achieves a “sweet spot”: retaining 88% of the original reasoning accuracy while reducing stereotype scores by 10%. This yields a reproducible, scalable optimization pathway toward fairer and more capable MLLMs.
📝 Abstract
Multimodal Large Language Models (MLLMs) already achieve state-of-the-art results across a wide range of tasks and modalities. To push their reasoning ability further, recent studies explore advanced prompting schemes and post-training fine-tuning. Although these techniques improve logical accuracy, they frequently leave the models' outputs burdened with pronounced social biases. Clarifying how reasoning gains interact with bias mitigation-and whether the two objectives inherently trade off-therefore remains an open and pressing research problem. Our study begins by benchmarking three bias-mitigation strategies-supervised fine-uning (SFT), knowledge distillation (KD), and rule-based reinforcement learning (RL)-under identical conditions, establishing their baseline strengths and weaknesses. Building on these results, we vary the proportion of debias-focused and reasoning-centric samples within each paradigm to chart the reasoning-versus-bias trade-off. Our sweeps reveal a consistent sweet spot: a roughly 1:4 mix trained with reinforcement learning cuts stereotype scores by 10% while retaining 88% of the model's original reasoning accuracy, offering concrete guidance for balancing fairness and capability in MLLMs.