🤖 AI Summary
Traditional vision-language models are limited to generating descriptive text and struggle with multi-step visual reasoning tasks. To address this, we propose a novel vision agent framework that synergistically integrates neural perception (large foundation models) with symbolic execution (tool calling), enabling multimodal input/output processing and autonomous visual reasoning. The framework unifies diverse specialized vision tools—including object detection, keypoint localization, panoptic segmentation, OCR, and geometric analysis—via a standardized interface, facilitating end-to-end task orchestration and execution. Evaluated on major multimodal benchmarks—including MMMU, MMBench, DocVQA, and MMLongBench—our approach achieves state-of-the-art performance, demonstrating substantial improvements in automation for complex visual tasks. This work advances vision intelligence toward production-ready systems capable of robust, compositional visual understanding and action.
📝 Abstract
We introduce Orion, a visual agent framework that can take in any modality and generate any modality. Using an agentic framework with multiple tool-calling capabilities, Orion is designed for visual AI tasks and achieves state-of-the-art results. Unlike traditional vision-language models that produce descriptive outputs, Orion orchestrates a suite of specialized computer vision tools, including object detection, keypoint localization, panoptic segmentation, Optical Character Recognition, and geometric analysis, to execute complex multi-step visual workflows. The system achieves competitive performance on MMMU, MMBench, DocVQA, and MMLongBench while extending monolithic vision-language models to production-grade visual intelligence. By combining neural perception with symbolic execution, Orion enables autonomous visual reasoning, marking a transition from passive visual understanding to active, tool-driven visual intelligence.