Olympus: A Universal Task Router for Computer Vision Tasks

📅 2024-12-12
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of unifying support for over twenty heterogeneous vision tasks across images, videos, and 3D objects in multimodal large language models (MLLMs), this paper proposes the first instruction-driven multimodal task routing framework. The framework centers on a lightweight controller that decouples the perceptual understanding capability of MLLMs from the generative capabilities of specialized executors, enabling zero-shot cross-modal and multi-task routing and chained action planning without training heavy generative models. Its key innovations are: (1) the first instruction-based multimodal task routing mechanism; and (2) plug-and-play compatibility with existing MLLMs and dynamic task extensibility. Experiments demonstrate an average routing accuracy of 94.75% and chained action precision of 91.82%, matching the performance of task-specific models.

Technology Category

Application Category

📝 Abstract
We introduce Olympus, a new approach that transforms Multimodal Large Language Models (MLLMs) into a unified framework capable of handling a wide array of computer vision tasks. Utilizing a controller MLLM, Olympus delegates over 20 specialized tasks across images, videos, and 3D objects to dedicated modules. This instruction-based routing enables complex workflows through chained actions without the need for training heavy generative models. Olympus easily integrates with existing MLLMs, expanding their capabilities with comparable performance. Experimental results demonstrate that Olympus achieves an average routing accuracy of 94.75% across 20 tasks and precision of 91.82% in chained action scenarios, showcasing its effectiveness as a universal task router that can solve a diverse range of computer vision tasks. Project page: http://yuanze-lin.me/Olympus_page/
Problem

Research questions and friction points this paper is trying to address.

Transforms MLLMs into a unified computer vision framework
Delegates 20+ specialized tasks to dedicated modules
Enables complex workflows without training heavy models
Innovation

Methods, ideas, or system contributions that make the work stand out.

MLLM-based universal framework for vision tasks
Controller delegates tasks to specialized modules
Instruction-based routing enables complex workflows
🔎 Similar Papers
No similar papers found.