PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

📅 2025-04-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) are predominantly closed-source black boxes, severely hindering reproducibility and rigorous scientific evaluation—especially for fine-grained video understanding, where high-quality annotated data and standardized benchmarks remain scarce. Method: We introduce PLM—the first fully open-source, end-to-end reproducible perception-language model—eschewing knowledge distillation and instead systematically deconstructing the training paradigm. We release 2.8 million human-annotated, fine-grained video question-answer pairs and spatiotemporally aligned descriptive captions. We further propose PLM-VideoBench, a comprehensive benchmark enabling four-dimensional video reasoning evaluation (“what,” “where,” “when,” and “how”). Contribution/Results: All components—including data, code, model weights, and training recipes—are publicly released. Experiments demonstrate that PLM achieves state-of-the-art performance among open-source models on multiple fine-grained video understanding tasks, advancing transparent, verifiable multimodal research.

Technology Category

Application Category

📝 Abstract
Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM-VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about"what","where","when", and"how"of a video. We make our work fully reproducible by providing data, training recipes, code&models.
Problem

Research questions and friction points this paper is trying to address.

Open-source vision-language models for transparent research
Addressing data gaps in detailed video understanding
Providing reproducible tools for video reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-access Perception Language Model (PLM) framework
Large-scale synthetic data for video understanding
PLM-VideoBench for detailed video evaluation
🔎 Similar Papers
No similar papers found.
J
Jang Hyun Cho
Meta FAIR, UT Austin
Andrea Madotto
Andrea Madotto
Research Scientist at FAIR
Multimodal LLMsVLMsNLPDialogue SystemsConversational AI
E
E. Mavroudi
Meta FAIR
Triantafyllos Afouras
Triantafyllos Afouras
FAIR, Meta, University of Oxford
Computer VisionMachine LearningArtificial Intelligence
Tushar Nagarajan
Tushar Nagarajan
FAIR, Meta
Computer visionMachine Learning
Muhammad Maaz
Muhammad Maaz
PhD Computer Vision at MBZUAI
Computer VisionDeep LearningVision-LanguageGenerative AI
Yale Song
Yale Song
Google
Computer VisionMultimodal LearningRepresentation Learning
T
Tengyu Ma
Meta FAIR
Shuming Hu
Shuming Hu
Research Engineer, Meta
machine learningphysics
S
Suyog Jain
Meta FAIR
Miguel Martin
Miguel Martin
Meta FAIR
Huiyu Wang
Huiyu Wang
Research Scientist, FAIR, Meta
Computer Vision
H
Hanoona Rasheed
MBZUAI
Peize Sun
Peize Sun
Meta FAIR; HKU
Computer VisionDeep Learning
P
Po-Yao Huang
Meta FAIR
Daniel Bolya
Daniel Bolya
Meta, FAIR
Computer Vision and Machine Learning and Artificial Intelligence
Nikhila Ravi
Nikhila Ravi
Meta AI Research
S
Shashank Jain
Meta Reality Labs
T
Tammy Stark
Meta Reality Labs
S
Shane Moon
Meta Reality Labs
B
Babak Damavandi
Meta Reality Labs
V
Vivian Lee
Meta FAIR
A
Andrew Westbury
Meta FAIR
S
Salman Khan
MBZUAI
P
Philipp Krahenbuhl
UT Austin
Piotr Dollár
Piotr Dollár
FAIR
computer visiondeep learningmachine learningartificial intelligence
Lorenzo Torresani
Lorenzo Torresani
Northeastern University
Computer VisionMachine Learning
Kristen Grauman
Kristen Grauman
Professor of Computer Science, University of Texas at Austin
Computer VisionMachine Learning
Christoph Feichtenhofer
Christoph Feichtenhofer
Meta
Computer VisionMachine LearningArtificial Intelligence