Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional vision-language models are limited to generating descriptive text and struggle with multi-step visual reasoning tasks. To address this, we propose a novel vision agent framework that synergistically integrates neural perception (large foundation models) with symbolic execution (tool calling), enabling multimodal input/output processing and autonomous visual reasoning. The framework unifies diverse specialized vision tools—including object detection, keypoint localization, panoptic segmentation, OCR, and geometric analysis—via a standardized interface, facilitating end-to-end task orchestration and execution. Evaluated on major multimodal benchmarks—including MMMU, MMBench, DocVQA, and MMLongBench—our approach achieves state-of-the-art performance, demonstrating substantial improvements in automation for complex visual tasks. This work advances vision intelligence toward production-ready systems capable of robust, compositional visual understanding and action.

Technology Category

Application Category

📝 Abstract
We introduce Orion, a visual agent framework that can take in any modality and generate any modality. Using an agentic framework with multiple tool-calling capabilities, Orion is designed for visual AI tasks and achieves state-of-the-art results. Unlike traditional vision-language models that produce descriptive outputs, Orion orchestrates a suite of specialized computer vision tools, including object detection, keypoint localization, panoptic segmentation, Optical Character Recognition, and geometric analysis, to execute complex multi-step visual workflows. The system achieves competitive performance on MMMU, MMBench, DocVQA, and MMLongBench while extending monolithic vision-language models to production-grade visual intelligence. By combining neural perception with symbolic execution, Orion enables autonomous visual reasoning, marking a transition from passive visual understanding to active, tool-driven visual intelligence.
Problem

Research questions and friction points this paper is trying to address.

Unifying multimodal perception and generation through a visual agent framework
Orchestrating specialized computer vision tools for complex visual workflows
Transitioning from passive visual understanding to active tool-driven reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multimodal agent framework with tool-calling capabilities
Orchestrates specialized computer vision tools for complex workflows
Combines neural perception with symbolic execution for reasoning
🔎 Similar Papers
No similar papers found.