SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient semantic understanding, object localization, and dense feature representation in multilingual vision-language encoders, this paper proposes a unified training framework integrating captioning pretraining, self-distillation, masked vision-language modeling, and online data quality filtering. It is the first to jointly optimize these advanced techniques; supports native aspect-ratio multi-resolution inputs; and introduces a debiased data mixing strategy to enhance multilingual fairness. Within the ViT architecture, contrastive learning and caption generation objectives are jointly optimized, augmented with multi-scale training. Experiments demonstrate comprehensive superiority over SigLIP on zero-shot classification, image-text retrieval, and vision-language model (VLM) visual representation transfer. Localization and dense prediction capabilities are significantly improved. Four open-source models spanning 86M to 1B parameters are released.

Technology Category

Application Category

📝 Abstract
We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input's native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fairness. To allow users to trade off inference cost with performance, we release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).
Problem

Research questions and friction points this paper is trying to address.

Enhance multilingual vision-language encoders
Improve semantic understanding and localization
Optimize dense features and multilingual fairness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual vision-language encoders
Captioning-based pretraining
Self-supervised losses integration
🔎 Similar Papers
No similar papers found.
Michael Tschannen
Michael Tschannen
Google DeepMind
Machine LearningComputer Vision
A
Alexey Gritsenko
Google DeepMind,*Core contributor
X
Xiao Wang
Google DeepMind,*Core contributor
Muhammad Ferjad Naeem
Muhammad Ferjad Naeem
Research Scientist, Google
Artificial IntelligenceComputer VisionMachine LearningDeep Learning
I
Ibrahim M. Alabdulmohsin
Google DeepMind,*Core contributor
Nikhil Parthasarathy
Nikhil Parthasarathy
Google DeepMind
active learningvisual perceptioncomputational neurosciencerepresentation learning
T
Talfan Evans
Google DeepMind,◦Work done while at Google DeepMind
Lucas Beyer
Lucas Beyer
Meta, OpenAI, Google DeepMind, Google Brain, RWTH Aachen
Representation LearningGood ShitComputer VisionRobotics
Y
Ye Xia
Google DeepMind
Basil Mustafa
Basil Mustafa
Google Deepmind
Machine LearningComputer VisionMultimodalityUncertainty QuantificationAI for Health
O
Olivier Hénaff
Google DeepMind,◦Work done while at Google DeepMind
J
Jeremiah Harmsen
Google DeepMind
A
A. Steiner
Google DeepMind
X
Xiao-Qi Zhai
Google DeepMind,◦Work done while at Google DeepMind,†Project lead