Cascaded Self-Evaluation Augmented Training for Efficient Multimodal Large Language Models

📅 2025-01-10

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Multimodal large language models (EMLLMs) struggle with high-quality chain-of-thought reasoning and self-assessment under parameter constraints. Method: We propose Cas-SEAT, a cascaded self-evaluation augmentation training framework featuring: (1) a novel cascaded prompt decomposition mechanism that converts lengthy prompts into reusable short-task chains; (2) a lightweight, open-source 7B EMLLM–based self-evaluation data synthesis paradigm requiring no human annotation; and (3) the Cas-SEAT Dataset—the first benchmark dedicated to EMLLM self-evaluation optimization. Contribution/Results: Through multi-stage data distillation and supervised fine-tuning, Cas-SEAT achieves improvements of +19.68%, +55.57%, and +46.79% on MathVista, Math-V, and We-Math, respectively, significantly enhancing model self-diagnosis and correction capabilities while enabling efficient low-resource deployment.

Technology Category

Application Category

📝 Abstract

Efficient Multimodal Large Language Models (EMLLMs) have rapidly advanced recently. Incorporating Chain-of-Thought (CoT) reasoning and step-by-step self-evaluation has improved their performance. However, limited parameters often hinder EMLLMs from effectively using self-evaluation during inference. Key challenges include synthesizing evaluation data, determining its quantity, optimizing training and inference strategies, and selecting appropriate prompts. To address these issues, we introduce Self-Evaluation Augmented Training (SEAT). SEAT uses more powerful EMLLMs for CoT reasoning, data selection, and evaluation generation, then trains EMLLMs with the synthesized data. However, handling long prompts and maintaining CoT reasoning quality are problematic. Therefore, we propose Cascaded Self-Evaluation Augmented Training (Cas-SEAT), which breaks down lengthy prompts into shorter, task-specific cascaded prompts and reduces costs for resource-limited settings. During data synthesis, we employ open-source 7B-parameter EMLLMs and annotate a small dataset with short prompts. Experiments demonstrate that Cas-SEAT significantly boosts EMLLMs' self-evaluation abilities, improving performance by 19.68%, 55.57%, and 46.79% on the MathVista, Math-V, and We-Math datasets, respectively. Additionally, our Cas-SEAT Dataset serves as a valuable resource for future research in enhancing EMLLM self-evaluation.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Reasoning

Large Language Models

Resource-constrained Environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cascaded Self-Evaluation Augmented Training

Efficient Multimodal Large Language Models

Task-Specific Cascaded Prompts

🔎 Similar Papers

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

2024-08-21arXiv.orgCitations: 3

Microsoft

$119,800 -

San Francisco Bay area / New York City metropolitan area

Research Scientist Intern, Multimodal AI (PhD)