What Do Learned Models Measure?

📅 2026-01-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
When machine learning models are employed as measurement instruments, it remains unclear whether their outputs genuinely reflect stable and consistent latent constructs beyond merely achieving predictive performance. This work formally introduces the concept of “learned measurement functions” and proposes “measurement stability” as a distinct evaluation criterion. Through theoretical analysis and empirical case studies, we demonstrate that conventional metrics—such as generalization error, calibration, and robustness—do not guarantee measurement consistency. Our findings reveal that models with comparable predictive accuracy can implement systematically inequivalent measurement functions, and that these discrepancies become pronounced under distributional shifts, thereby exposing critical limitations in current evaluation frameworks.

Technology Category

Application Category

📝 Abstract
In many scientific and data-driven applications, machine learning models are increasingly used as measurement instruments, rather than merely as predictors of predefined labels. When the measurement function is learned from data, the mapping from observations to quantities is determined implicitly by the training distribution and inductive biases, allowing multiple inequivalent mappings to satisfy standard predictive evaluation criteria. We formalize learned measurement functions as a distinct focus of evaluation and introduce measurement stability, a property capturing invariance of the measured quantity across admissible realizations of the learning process and across contexts. We show that standard evaluation criteria in machine learning, including generalization error, calibration, and robustness, do not guarantee measurement stability. Through a real-world case study, we show that models with comparable predictive performance can implement systematically inequivalent measurement functions, with distribution shift providing a concrete illustration of this failure. Taken together, our results highlight a limitation of existing evaluation frameworks in settings where learned model outputs are identified as measurements, motivating the need for an additional evaluative dimension.
Problem

Research questions and friction points this paper is trying to address.

learned measurement
measurement stability
machine learning evaluation
distribution shift
inductive bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

measurement stability
learned measurement functions
distribution shift
model evaluation
inductive bias
🔎 Similar Papers
No similar papers found.