P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge in physics olympiad problems where critical constraints are embedded in diagrams while textual descriptions are absent, leading to a disconnect between visual perception and physical reasoning. To bridge this gap, the authors propose P1-VL, an open-source vision-language model that integrates curriculum-based reinforcement learning—featuring progressively increasing problem difficulty—with an agent-driven self-verification mechanism at inference time. This enables end-to-end multimodal perception to high-level physical reasoning. Evaluated on the HiPhO benchmark, P1-VL achieves state-of-the-art performance among open-source vision-language models, securing 12 gold medals and ranking second globally, just behind Gemini-3-Pro, while also demonstrating strong generalization across diverse STEM domains.

Technology Category

Application Category

📝 Abstract
The transition from symbolic manipulation to science-grade reasoning represents a pivotal frontier for Large Language Models (LLMs), with physics serving as the critical test anchor for binding abstract logic to physical reality. Physics demands that a model maintain physical consistency with the laws governing the universe, a task that fundamentally requires multimodal perception to ground abstract logic in reality. At the Olympiad level, diagrams are often constitutive rather than illustrative, containing essential constraints, such as boundary conditions and spatial symmetries, that are absent from the text. To bridge this visual-logical gap, we introduce P1-VL, a family of open-source vision-language models engineered for advanced scientific reasoning. Our method harmonizes Curriculum Reinforcement Learning, which employs progressive difficulty expansion to stabilize post-training, with Agentic Augmentation, enabling iterative self-verification at inference. Evaluated on HiPhO, a rigorous benchmark of 13 exams from 2024-2025, our flagship P1-VL-235B-A22B becomes the first open-source Vision-Language Model (VLM) to secure 12 gold medals and achieves the state-of-the-art performance in the open-source models. Our agent-augmented system achieves the No.2 overall rank globally, trailing only Gemini-3-Pro. Beyond physics, P1-VL demonstrates remarkable scientific reasoning capacity and generalizability, establishing significant leads over base models in STEM benchmarks. By open-sourcing P1-VL, we provide a foundational step toward general-purpose physical intelligence to better align visual perceptions with abstract physical laws for machine scientific discovery.
Problem

Research questions and friction points this paper is trying to address.

scientific reasoning
visual perception
physics Olympiads
multimodal reasoning
physical consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Model
Scientific Reasoning
Curriculum Reinforcement Learning
Agentic Augmentation
Physical Intelligence
🔎 Similar Papers
No similar papers found.
Yun Luo
Yun Luo
Shanghai AI Lab
natural language processinggraph neural network
F
Futing Wang
Shanghai AI Laboratory
Qianjia Cheng
Qianjia Cheng
Shanghai AI Lab
Fangchen Yu
Fangchen Yu
Ph.D Candidate, The Chinese University of Hong Kong, Shenzhen
Satistical Machine LearningOptimizationAI for ScienceMLLM
H
Haodi Lei
Shanghai AI Laboratory
J
Jianhao Yan
Shanghai AI Laboratory
C
Chenxi Li
Shanghai AI Laboratory
J
Jiacheng Chen
Shanghai AI Laboratory
Y
Yufeng Zhao
Shanghai AI Laboratory
H
Haiyuan Wan
Shanghai AI Laboratory
Y
Yuchen Zhang
Shanghai AI Laboratory
Shenghe Zheng
Shenghe Zheng
Harbin Institute of Technology
Large Language ModelEfficient AINeural Architecture Search
Junchi Yao
Junchi Yao
University of Electronic Science and Technology of China, Shanghai AI Lab
XAILLM AgentsLLM4Science
Qingyang Zhang
Qingyang Zhang
PhD student, Tianjin University
Large Reasoning ModelsOut-of-DistributionMultimodal Fusion
H
Haonan He
Shanghai AI Laboratory
Wenxuan Zeng
Wenxuan Zeng
Peking University
Efficient Deep LearningLarge Language Model
L
Li Sheng
Shanghai AI Laboratory
C
Chengxing Xie
Shanghai AI Laboratory
Y
Yuxin Zuo
Shanghai AI Laboratory
Yizhuo Li
Yizhuo Li
The University of Hong Kong
Y
Yulun Wu
Shanghai AI Laboratory
R
Rui Huang
Shanghai AI Laboratory
Dongzhan Zhou
Dongzhan Zhou
Researcher at Shanghai AI Lab
AI4Sciencecomputer visiondeep learning
Kai Chen
Kai Chen
Shanghai AI Laboratory
LLMVLMComputer Vision
Yu Qiao
Yu Qiao
Professor of Shanghai AI Laboratory; Shenzhen Institutes of Advanced Technology, CAS
Computer VisionPattern RecognitionLarge Multimodal ModelLarge Language Model