What Do Learned Models Measure?

📅 2026-01-26

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

When machine learning models are employed as measurement instruments, it remains unclear whether their outputs genuinely reflect stable and consistent latent constructs beyond merely achieving predictive performance. This work formally introduces the concept of “learned measurement functions” and proposes “measurement stability” as a distinct evaluation criterion. Through theoretical analysis and empirical case studies, we demonstrate that conventional metrics—such as generalization error, calibration, and robustness—do not guarantee measurement consistency. Our findings reveal that models with comparable predictive accuracy can implement systematically inequivalent measurement functions, and that these discrepancies become pronounced under distributional shifts, thereby exposing critical limitations in current evaluation frameworks.

Technology Category

Application Category

📝 Abstract

In many scientific and data-driven applications, machine learning models are increasingly used as measurement instruments, rather than merely as predictors of predefined labels. When the measurement function is learned from data, the mapping from observations to quantities is determined implicitly by the training distribution and inductive biases, allowing multiple inequivalent mappings to satisfy standard predictive evaluation criteria. We formalize learned measurement functions as a distinct focus of evaluation and introduce measurement stability, a property capturing invariance of the measured quantity across admissible realizations of the learning process and across contexts. We show that standard evaluation criteria in machine learning, including generalization error, calibration, and robustness, do not guarantee measurement stability. Through a real-world case study, we show that models with comparable predictive performance can implement systematically inequivalent measurement functions, with distribution shift providing a concrete illustration of this failure. Taken together, our results highlight a limitation of existing evaluation frameworks in settings where learned model outputs are identified as measurements, motivating the need for an additional evaluative dimension.

Problem

Research questions and friction points this paper is trying to address.

learned measurement

measurement stability

machine learning evaluation

distribution shift

inductive bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

measurement stability

learned measurement functions

distribution shift