🤖 AI Summary
This study addresses the challenge of estimating long-term causal effects in digital platform experiments, where such effects are often indirectly inferred through numerous noisy short-term proxy variables that reflect a low-dimensional latent mediator. The authors formulate this as a latent variable estimation problem and propose using regularized regression methods—such as ridge regression—to effectively distill information from high-dimensional proxies. Theoretical analysis reveals that ridge regression exhibits diminishing bias as the number of proxies increases and yields a closed-form solution for the bias–variance trade-off, thereby overcoming limitations of conventional proxy selection approaches. Empirical evaluations on both simulated data and the California GAIN experiment demonstrate that the proposed method substantially outperforms naive proxy selection strategies in accurately estimating long-term treatment effects.
📝 Abstract
We propose a method for estimating long-term treatment effects with many short-term proxy outcomes: a central challenge when experimenting on digital platforms. We formalize this challenge as a latent variable problem where observed proxies are noisy measures of a low-dimensional set of unobserved surrogates that mediate treatment effects. Through theoretical analysis and simulations, we demonstrate that regularized regression methods substantially outperform naive proxy selection. We show in particular that the bias of Ridge regression decreases as more proxies are added, with closed-form expressions for the bias-variance tradeoff. We illustrate our method with an empirical application to the California GAIN experiment.