Encoding and Understanding Astrophysical Information in Large Language Model-Generated Summaries

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether large language models (LLMs) can implicitly encode physics-based summary statistics—grounded in astrophysical measurements—from natural-language descriptions, and examines how prompt design and linguistic structure influence such encoding. Method: We introduce sparse autoencoders (SAEs) for the first time to analyze LLM text embeddings, enabling interpretable decomposition of physically meaningful features in generated representations. Coupled with systematic prompt engineering, we quantify the contributions of linguistic factors—including terminology precision, syntactic structure, and numerical expression—to the encoding fidelity of physical quantities. Results: We demonstrate that LLM embedding spaces robustly encode observable astrophysical quantities. Prompting strategies significantly modulate encoding quality, with numerical expression and domain-terminological consistency emerging as critical linguistic determinants. This work establishes a novel evaluation paradigm and provides an interpretable analytical toolkit for assessing the scientific knowledge representation capabilities of LLMs.

Technology Category

Application Category

📝 Abstract
Large Language Models have demonstrated the ability to generalize well at many levels across domains, modalities, and even shown in-context learning capabilities. This enables research questions regarding how they can be used to encode physical information that is usually only available from scientific measurements, and loosely encoded in textual descriptions. Using astrophysics as a test bed, we investigate if LLM embeddings can codify physical summary statistics that are obtained from scientific measurements through two main questions: 1) Does prompting play a role on how those quantities are codified by the LLM? and 2) What aspects of language are most important in encoding the physics represented by the measurement? We investigate this using sparse autoencoders that extract interpretable features from the text.
Problem

Research questions and friction points this paper is trying to address.

Investigating how LLMs encode astrophysical data from scientific measurements
Examining how prompting affects LLM codification of physical quantities
Identifying which language aspects best encode physics from measurements
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using LLM embeddings to encode astrophysical measurement data
Investigating prompt influence on physics quantity codification
Applying sparse autoencoders for interpretable feature extraction
🔎 Similar Papers
No similar papers found.