One Framework to Rule Them All: Unifying Multimodal Tasks with LLM Neural-Tuning

📅 2024-08-06

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Existing large language models (LLMs) are predominantly trained on unimodal corpora, limiting their generalization to multimodal tasks; moreover, prevailing fine-tuning paradigms are task-specific and suffer from poor scalability. To address these limitations, we propose the first unified multimodal multi-task framework: heterogeneous signals—including visual and linguistic inputs—are jointly encoded with task instructions into a single token sequence, processed end-to-end by a shared LLM. We further introduce a neuro-inspired sparse fine-tuning mechanism—motivated by cortical sparse coding—that dynamically activates task-specific parameter subsets. Additionally, we construct MMUD, the first fine-grained multimodal multi-task benchmark. On MMUD, our framework concurrently performs four distinct tasks—reasoning-based segmentation, referring segmentation, image captioning, and text-to-image generation—achieving a 37% improvement in parameter efficiency and a 2.1× increase in inference throughput.

Technology Category

Application Category

📝 Abstract

Large-scale models have exhibited remarkable capabilities across diverse domains, including automated medical services and intelligent customer support. However, as most large models are trained on single-modality corpora, enabling them to effectively process and understand multimodal signals remains a significant challenge. Current research often focuses on designing task-specific or scenario-specific tuning strategies, which limits the scalability and versatility. To address this limitation, we propose a unified framework that concurrently handles multiple tasks and modalities. In this framework, all modalities and tasks are represented as unified tokens and trained using a single, consistent approach. To enable efficient multitask processing, we introduce a novel tuning strategy termed neural tuning, inspired by the concept of sparse distributed representation in the human brain, where only specific subsets of neurons are activated for each task. Furthermore, to advance research in multimodal and multitask learning, we present a new benchmark, MMUD, which includes samples annotated with multiple task labels spanning reasoning segmentation, referring segmentation, image captioning, and text-to-image generation. By applying neural tuning to pretrained large models on the MMUD benchmark, we demonstrate the ability to handle multiple tasks simultaneously in a streamlined and efficient manner. All models, code, and datasets will be released publicly upon publication, fostering further research and innovation in this field.

Problem

Research questions and friction points this paper is trying to address.

Unifying multimodal tasks with a single framework

Enabling large models to process multimodal signals effectively

Overcoming limitations of task-specific tuning strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multimodal framework with consistent token representation

Neural tuning strategy inspired by sparse brain activation

MMUD benchmark for multitask multimodal learning evaluation

🔎 Similar Papers

No similar papers found.