AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification

📅 2026-05-06
📈 Citations: 0
Influential: 0
📄 PDF

career value

208K/year
🤖 AI Summary
This study addresses the bottleneck in manual vetting caused by the surge of multimodal data in astronomical observations by introducing AstroAlertBench, a novel benchmark based on 1,500 real transient alerts from the Zwicky Transient Facility (ZTF). For the first time, “honesty”—defined as a model’s ability to self-assess the reliability of its own reasoning—is incorporated as an evaluation dimension. The framework systematically evaluates 13 state-of-the-art multimodal large language models through a three-stage logical pipeline comprising metadata anchoring, scientific reasoning, and five-tier hierarchical classification, all leveraging combined image and metadata inputs. The findings reveal that high accuracy does not necessarily imply high reliability, prompting the proposal of a human–AI collaborative evaluation protocol. This work establishes the first empirical framework and benchmark for developing calibrated, interpretable AI assistants in astronomy.
📝 Abstract
Modern astronomical observatories generate a massive volume of multimodal data, creating a critical bottleneck for expert human review. While multimodal large language models (LLMs) have shown promise in interpreting complex visual and textual inputs, their ability to perform specialized scientific classification while providing interpretable reasoning remains understudied. We introduce AstroAlertBench, a comprehensive multimodal benchmark designed to evaluate LLM performance in astronomical event review along a three-stage logical chain: metadata grounding, scientific reasoning, and hierarchical classification over five categories. We use a pilot sample of 1,500 real-world alerts from the Zwicky Transient Facility (ZTF), a wide-field survey that scans the northern sky to detect transient astronomical events. On this dataset, we benchmark 13 frontier closed-source and open-weight LLMs that support visual input. Our results reveal that high accuracy does not always align with model ``honesty,'' defined as the ability to self-evaluate its reasoning, which affects its reliability as a real-world assistant. We further initialize a human-in-the-loop evaluation protocol as a precursor to future community-scale participation. Together, AstroAlertBench provides a framework for developing calibrated and interpretable astronomical assistants.
Problem

Research questions and friction points this paper is trying to address.

multimodal LLMs
astronomical classification
model honesty
scientific reasoning
transient events
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal LLMs
astronomical classification
reasoning honesty
AstroAlertBench
human-in-the-loop evaluation
🔎 Similar Papers
Claire Chen
Claire Chen
PhD student, Stanford University
contact-rich manipulationrobot learningmulti-modal sensing
J
Jiabao Sean Xiao
California Institute of Technology
S
Shuze Daniel Liu
Massachusetts Institute of Technology
F
Facundo Perez Paolino
California Institute of Technology
L
Luke Handley
California Institute of Technology
T
Theophile Jegou du Laz
California Institute of Technology
R
Ricky Nilsson
California Institute of Technology
A
Alice Zou
California Institute of Technology
Matthew Graham
Matthew Graham
U.S. Census Bureau
A
Ashish Mahabal
California Institute of Technology