🤖 AI Summary
This study addresses the bottleneck in manual vetting caused by the surge of multimodal data in astronomical observations by introducing AstroAlertBench, a novel benchmark based on 1,500 real transient alerts from the Zwicky Transient Facility (ZTF). For the first time, “honesty”—defined as a model’s ability to self-assess the reliability of its own reasoning—is incorporated as an evaluation dimension. The framework systematically evaluates 13 state-of-the-art multimodal large language models through a three-stage logical pipeline comprising metadata anchoring, scientific reasoning, and five-tier hierarchical classification, all leveraging combined image and metadata inputs. The findings reveal that high accuracy does not necessarily imply high reliability, prompting the proposal of a human–AI collaborative evaluation protocol. This work establishes the first empirical framework and benchmark for developing calibrated, interpretable AI assistants in astronomy.
📝 Abstract
Modern astronomical observatories generate a massive volume of multimodal data, creating a critical bottleneck for expert human review. While multimodal large language models (LLMs) have shown promise in interpreting complex visual and textual inputs, their ability to perform specialized scientific classification while providing interpretable reasoning remains understudied. We introduce AstroAlertBench, a comprehensive multimodal benchmark designed to evaluate LLM performance in astronomical event review along a three-stage logical chain: metadata grounding, scientific reasoning, and hierarchical classification over five categories. We use a pilot sample of 1,500 real-world alerts from the Zwicky Transient Facility (ZTF), a wide-field survey that scans the northern sky to detect transient astronomical events. On this dataset, we benchmark 13 frontier closed-source and open-weight LLMs that support visual input. Our results reveal that high accuracy does not always align with model ``honesty,'' defined as the ability to self-evaluate its reasoning, which affects its reliability as a real-world assistant. We further initialize a human-in-the-loop evaluation protocol as a precursor to future community-scale participation. Together, AstroAlertBench provides a framework for developing calibrated and interpretable astronomical assistants.