Xray-Visual Models: Scaling Vision models on Industry Scale Data

📅 2026-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a unified and efficient visual model training framework to address the challenges of general-purpose image and video understanding in industrial-scale multimodal social media data. Built upon the Vision Transformer architecture, the framework leverages 15 billion image–text pairs and 10 billion video–label pairs through a three-stage joint training paradigm that integrates Masked Autoencoder (MAE) self-supervision, hashtag-based semi-supervised classification, and CLIP-style contrastive learning. The approach introduces an Efficient Vision Transformer (EViT) with dynamic token reorganization and, for the first time, employs a large language model as the text encoder in CLIP-style retrieval (LLM2CLIP). The resulting model achieves state-of-the-art performance on benchmarks including ImageNet, Kinetics, HMDB51, and MSCOCO, demonstrating significant improvements in generalization, deployment efficiency, robustness to perturbations, and cross-modal retrieval capabilities.

Technology Category

Application Category

📝 Abstract
We present Xray-Visual, a unified vision model architecture for large-scale image and video understanding trained on industry-scale social media data. Our model leverages over 15 billion curated image-text pairs and 10 billion video-hashtag pairs from Facebook and Instagram, employing robust data curation pipelines that incorporate balancing and noise suppression strategies to maximize semantic diversity while minimizing label noise. We introduce a three-stage training pipeline that combines self-supervised MAE, semi-supervised hashtag classification, and CLIP-style contrastive learning to jointly optimize image and video modalities. Our architecture builds on a Vision Transformer backbone enhanced with efficient token reorganization (EViT) for improved computational efficiency. Extensive experiments demonstrate that Xray-Visual achieves state-of-the-art performance across diverse benchmarks, including ImageNet for image classification, Kinetics and HMDB51 for video understanding, and MSCOCO for cross-modal retrieval. The model exhibits strong robustness to domain shift and adversarial perturbations. We further demonstrate that integrating large language models as text encoders (LLM2CLIP) significantly enhances retrieval performance and generalization capabilities, particularly in real-world environments. Xray-Visual establishes new benchmarks for scalable, multimodal vision models, while maintaining superior accuracy and computational efficiency.
Problem

Research questions and friction points this paper is trying to address.

large-scale vision models
image-text understanding
video-hashtag alignment
multimodal learning
industrial-scale data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Xray-Visual
three-stage training pipeline
efficient token reorganization (EViT)
LLM2CLIP
industry-scale multimodal data
🔎 Similar Papers
No similar papers found.
S
Shlok Mishra
Meta-AI
Tsung-Yu Lin
Tsung-Yu Lin
Research Scientist, Meta
Computer VisionMachine Learning
Linda Wang
Linda Wang
Meta, Lyft, University of Waterloo
Computer VisionDeep LearningAutonomous VehiclesMedical Imaging
Hongli Xu
Hongli Xu
University of Science and Technology of China
Software Defined NetworkCooperative CommunicationSensor Networks
Y
Yimin Liu
Meta-AI
M
Michael Hsu
Meta-AI
Chaitanya Ahuja
Chaitanya Ahuja
Meta AI
Multimodal Machine LearningGenerative ModelingComputer VisionNatural Language Processing
Hao Yuan
Hao Yuan
Research Scientist, Meta Platforms, Inc.
Deep Learning
Jianpeng Cheng
Jianpeng Cheng
Meta AI
Multimodal AIContextual AI
Hong-You Chen
Hong-You Chen
Meta
Vision Foundation ModelsMultimodal LLMPersonalizationMachine Learning
H
Haoyuan Xu
Meta-AI
Chao Li
Chao Li
Reality Labs Research, Meta
3D ReconstructionMotion CaptureVR/AREgocentric VideoImage Super-Resolution
Abhijeet Awasthi
Abhijeet Awasthi
AI Research Scientist, Meta
Machine LearningNatural Language ProcessingSpeech RecognitionArtificial Intelligence
J
Jihye Moon
Meta-AI
D
Don Husa
Meta-AI
M
Michael Ge
Meta-AI
Sumedha Singla
Sumedha Singla
University Of Pittsburgh
Medical imagingComputer VisionMLAI
Arkabandhu Chowdhury
Arkabandhu Chowdhury
Research Scientist, Meta AI
LLMVLMComputer VisionGenerative AI
P
Phong Dingh
Meta-AI
Satya Narayan Shukla
Satya Narayan Shukla
Meta AI
LLMsMultimodal modelsEmbedding ModelsMissing DataDeep Learning
Y
Yonghuan Yang
Meta-AI
D
David Jacobs
Meta-AI, University of Maryland (work done while at Meta AI)
Q
Qi Guo
Meta-AI
J
Jun Xiao
Meta-AI
X
Xiangjun Fan
Meta-AI