Xray-Visual Models: Scaling Vision models on Industry Scale Data

📅 2026-02-18

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work proposes a unified and efficient visual model training framework to address the challenges of general-purpose image and video understanding in industrial-scale multimodal social media data. Built upon the Vision Transformer architecture, the framework leverages 15 billion image–text pairs and 10 billion video–label pairs through a three-stage joint training paradigm that integrates Masked Autoencoder (MAE) self-supervision, hashtag-based semi-supervised classification, and CLIP-style contrastive learning. The approach introduces an Efficient Vision Transformer (EViT) with dynamic token reorganization and, for the first time, employs a large language model as the text encoder in CLIP-style retrieval (LLM2CLIP). The resulting model achieves state-of-the-art performance on benchmarks including ImageNet, Kinetics, HMDB51, and MSCOCO, demonstrating significant improvements in generalization, deployment efficiency, robustness to perturbations, and cross-modal retrieval capabilities.

Technology Category

Application Category

📝 Abstract

We present Xray-Visual, a unified vision model architecture for large-scale image and video understanding trained on industry-scale social media data. Our model leverages over 15 billion curated image-text pairs and 10 billion video-hashtag pairs from Facebook and Instagram, employing robust data curation pipelines that incorporate balancing and noise suppression strategies to maximize semantic diversity while minimizing label noise. We introduce a three-stage training pipeline that combines self-supervised MAE, semi-supervised hashtag classification, and CLIP-style contrastive learning to jointly optimize image and video modalities. Our architecture builds on a Vision Transformer backbone enhanced with efficient token reorganization (EViT) for improved computational efficiency. Extensive experiments demonstrate that Xray-Visual achieves state-of-the-art performance across diverse benchmarks, including ImageNet for image classification, Kinetics and HMDB51 for video understanding, and MSCOCO for cross-modal retrieval. The model exhibits strong robustness to domain shift and adversarial perturbations. We further demonstrate that integrating large language models as text encoders (LLM2CLIP) significantly enhances retrieval performance and generalization capabilities, particularly in real-world environments. Xray-Visual establishes new benchmarks for scalable, multimodal vision models, while maintaining superior accuracy and computational efficiency.

Problem

Research questions and friction points this paper is trying to address.

large-scale vision models

image-text understanding

video-hashtag alignment

multimodal learning

industrial-scale data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Xray-Visual

three-stage training pipeline

efficient token reorganization (EViT)