Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback

📅 2024-12-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal instruction alignment suffers from a lack of human preference data, unified alignment methodologies, and reliable evaluation frameworks. Method: We propose the first full-modality (text/image/audio/video) instruction alignment framework, featuring (i) a 200K-sample cross-modal human preference dataset; (ii) a language-feedback-driven unified alignment paradigm integrating RL with Language Feedback (RL-LF), multimodal preference modeling, a unified instruction encoder, and cross-modal reward modeling; and (iii) Eval-Anything—the first comprehensive multimodal capability benchmark. Results: Our framework significantly improves instruction-following performance across arbitrary input-output modality combinations, achieving an average +23.6% gain on Eval-Anything. All datasets, models, and code are publicly released, establishing foundational resources for multimodal alignment research.

Technology Category

Application Category

📝 Abstract
Reinforcement learning from human feedback (RLHF) has proven effective in enhancing the instruction-following capabilities of large language models; however, it remains underexplored in the cross-modality domain. As the number of modalities increases, aligning all-modality models with human intentions -- such as instruction following -- becomes a pressing challenge. In this work, we make the first attempt to fine-tune all-modality models (i.e. input and output with any modality, also named any-to-any models) using human preference data across all modalities (including text, image, audio, and video), ensuring its behavior aligns with human intentions. This endeavor presents several challenges. First, there is no large-scale all-modality human preference data in existing open-source resources, as most datasets are limited to specific modalities, predominantly text and image. Secondly, the effectiveness of binary preferences in RLHF for post-training alignment in complex all-modality scenarios remains an unexplored area. Finally, there is a lack of a systematic framework to evaluate the capabilities of all-modality models, particularly regarding modality selection and synergy. To address these challenges, we propose the align-anything framework, which includes meticulously annotated 200k all-modality human preference data. Then, we introduce an alignment method that learns from unified language feedback, effectively capturing complex modality-specific human preferences and enhancing the model's instruction-following capabilities. Furthermore, to assess performance improvements in all-modality models after post-training alignment, we construct a challenging all-modality capability evaluation framework -- eval-anything. All data, models, and code frameworks have been open-sourced for the community. For more details, please refer to https://github.com/PKU-Alignment/align-anything.
Problem

Research questions and friction points this paper is trying to address.

Multi-modal Command Understanding
Human-like Behavior
Cross-modal Feedback Training
Innovation

Methods, ideas, or system contributions that make the work stand out.

align-anything
human feedback learning
eval-anything
🔎 Similar Papers
No similar papers found.
J
Jiaming Ji
Institute for AI, Peking University
J
Jiayi Zhou
Institute for AI, Peking University
Hantao Lou
Hantao Lou
Peking University
AI AlignmentAI SafetyInterpretabilityTrustworthy AI
B
Boyuan Chen
Institute for AI, Peking University
Donghai Hong
Donghai Hong
Peking University
AI SafetyAI AlignmentMulti-Modal Model
X
Xuyao Wang
Institute for AI, Peking University
W
Wenqi Chen
Institute for AI, Peking University
Kaile Wang
Kaile Wang
Peking University
R
Rui Pan
Institute for AI, Peking University
J
Jiahao Li
Institute for AI, Peking University
M
Mohan Wang
Institute for AI, Peking University
Josef Dai
Josef Dai
Zhejiang University
Alignment
T
Tianyi Qiu
Institute for AI, Peking University
H
Hua Xu
Institute for AI, Peking University
D
Dong Li
Huawei Noah’s Ark LAB
W
Weipeng Chen
Baichuan Inc.
Jun Song
Jun Song
Shenzhen University
nanophotonics
B
Bo Zheng
Taobao & Tmall Group of Alibaba
Y
Yaodong Yang
Institute for AI, Peking University