🤖 AI Summary
Data scientists frequently lack systematic guidance when operationalizing ambiguous concepts (e.g., “writing authenticity,” “medical need”) into model-ready proxy target variables. To address this, we conducted semi-structured interviews with 15 data scientists across education and healthcare domains, followed by cross-domain thematic coding. We propose the “assemblage metrics” framework, identifying five core design criteria: validity, simplicity, predictiveness, portability, and resource efficiency. Our analysis reveals an iterative, problem-reconstruction–driven practice in which target variables are dynamically negotiated through trade-offs among these criteria. This work offers the first systematic characterization of such trade-offs in proxy target construction. It contributes a theoretically grounded framework and methodological tools for HCI, CSCW, and machine learning communities to support principled, transparent, and trustworthy predictive modeling—bridging conceptual abstraction with operationalizable measurement.
📝 Abstract
Data scientists often formulate predictive modeling tasks involving fuzzy, hard-to-define concepts, such as the "authenticity" of student writing or the "healthcare need" of a patient. Yet the process by which data scientists translate fuzzy concepts into a concrete, proxy target variable remains poorly understood. We interview fifteen data scientists in education (N=8) and healthcare (N=7) to understand how they construct target variables for predictive modeling tasks. Our findings suggest that data scientists construct target variables through a bricolage process, involving iterative negotiation between high-level measurement objectives and low-level practical constraints. Data scientists attempt to satisfy five major criteria for a target variable through bricolage: validity, simplicity, predictability, portability, and resource requirements. To achieve this, data scientists adaptively use problem (re)formulation strategies, such as swapping out one candidate target variable for another when the first fails to meet certain criteria (e.g., predictability), or composing multiple outcomes into a single target variable to capture a more holistic set of modeling objectives. Based on our findings, we present opportunities for future HCI, CSCW, and ML research to better support the art and science of target variable construction.