BatteryPass-12K: The First Dataset for the Novel Digital Battery Passport Conformance Task

📅 2026-04-28

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This study addresses the absence of publicly available datasets for compliance classification in Digital Battery Passports (DBPs) by formally defining the task and introducing BatteryPass-12K, the first synthetic benchmark dataset constructed from real-world pilot samples. Through zero-shot and few-shot evaluations, prompt injection attack testing, and comparative analysis across diverse language models—including sparse mixture-of-experts (MoEs), small language models (SLMs), and dense large language models (LLMs)—the research demonstrates that GPT-5.4 achieves F1 scores of 0.98 and 0.71 on validation and test sets, respectively. Few-shot learning substantially enhances performance, certain small models outperform larger counterparts, and prompt injection attacks significantly degrade accuracy. These findings challenge the assumption that model scale alone ensures superior performance, highlighting instead the advantages of reasoning-capable architectures and the untapped potential of compact models.

📝 Abstract

We introduce a novel task of digital battery passport (DBP) conformance classification and introduce the first public benchmark for the task: BatteryPass-12K, created synthetically from real pilot samples. This is as the EU's battery regulation on DBPs comes into effect soon and there exists no public dataset. We evaluated 22 language models (LMs) in zero-shot inference, spanning small LMs (SLMs), mixture of experts (MoEs), and dense LLMs. We also conducted analysis, additional evaluations of few-shot inference and prompt-injection attacks to find that (1) Thinking models have the best performance (with GPT-5.4 scoring 0.98 (0.03) and 0.71 (0.22) on average as F1 (and confidence interval at 95%) on the validation and test sets, respectively), (2) few-shot examples improve performance significantly, (3) generally capable frontier models find the task challenging, (4) merely scaling model parameters does not necessarily lead to improved performance, as SLMs outperformed some LLMs, and (5) prompt-injection attacks degrade performance. We note that BatteryPass-12K, though limited to real pilot samples, may be useful for other known or emerging tasks in the battery domain, e.g. lifecycle reasoning. We publicly release the dataset under a permissive licence (CC-BY-4.0).

Problem

Research questions and friction points this paper is trying to address.

digital battery passport

conformance classification

battery regulation

benchmark dataset

EU battery regulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

digital battery passport

conformance classification

BatteryPass-12K