OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning

📅 2025-05-07

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

The lack of public training data and methodology for CLIP models hinders reproducibility and progress in multimodal research. Method: This paper introduces OpenVision—a fully open-source, low-cost family of vision encoders—featuring the first publicly released large-scale training dataset (Recap-DataComp-1B), training code, and complete training recipes. It proposes a multi-scale architecture with tunable capacity-efficiency trade-offs, enabling flexible deployment across edge devices to high-performance systems, and integrates contrastive learning optimization with architectural scaling within the CLIPS framework. Contribution/Results: OpenVision matches or exceeds CLIP’s performance when integrated into mainstream multimodal frameworks such as LLaVA. Seven models spanning 5.9M to 632.1M parameters are released, substantially lowering barriers to entry. OpenVision establishes a new paradigm for open, reproducible, and scalable vision encoders in multimodal research.

Technology Category

Application Category

📝 Abstract

OpenAI's CLIP, released in early 2021, have long been the go-to choice of vision encoder for building multimodal foundation models. Although recent alternatives such as SigLIP have begun to challenge this status quo, to our knowledge none are fully open: their training data remains proprietary and/or their training recipes are not released. This paper fills this gap with OpenVision, a fully-open, cost-effective family of vision encoders that match or surpass the performance of OpenAI's CLIP when integrated into multimodal frameworks like LLaVA. OpenVision builds on existing works -- e.g., CLIPS for training framework and Recap-DataComp-1B for training data -- while revealing multiple key insights in enhancing encoder quality and showcasing practical benefits in advancing multimodal models. By releasing vision encoders spanning from 5.9M to 632.1M parameters, OpenVision offers practitioners a flexible trade-off between capacity and efficiency in building multimodal models: larger models deliver enhanced multimodal performance, while smaller versions enable lightweight, edge-ready multimodal deployments.

Problem

Research questions and friction points this paper is trying to address.

Develop fully-open vision encoders for multimodal learning

Match or surpass OpenAI CLIP performance cost-effectively

Provide flexible encoder sizes for capacity-efficiency trade-offs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fully-open vision encoders for multimodal learning

Cost-effective family matching CLIP performance

Flexible parameter range for capacity-efficiency tradeoff

🔎 Similar Papers

Law of Vision Representation in MLLMs