Date: Sunday, February 2, 2025
Hello, AEA365 community! Liz DiLuzio here, Lead Curator of the blog. This week is Individuals Week, which means we take a break from our themed weeks and spotlight the Hot Tips, Cool Tricks, Rad Resources and Lessons Learned from any evaluator interested in sharing. Would you like to contribute to future individuals weeks? Email me at AEA365@eval.org with an idea or a draft and we will make it happen.
My name is Travis Candieas and I’m a Ph.D. candidate in Policy, Program Evaluation, and Research Methods with an emphasis in Quantitative Methods in the Social Sciences at the University of California, Santa Barbara. I’m interested in how methods affect evaluation outcomes, and how evaluation outcomes affect public policy. Continuing from my presentation at the most recent AEA conference, the focus of this article will be on how one might use AI and machine learning for evaluation in randomized trials.
Recent research on evaluation has shown a strong interest in artificial intelligence (AI) and its relevance for program evaluation. Today, people are using AI to plan trips, get help with coding tasks, and ask chatbots for assistance with online shopping. However, few people know about how AI can help us be more confident in our causal inferences. Machine learning for evaluation sounds promising, but how does it work, and will it work for my evaluation?
Evaluators involved in randomized trials or quasi-experimental designs may be aware of the basic assumptions underlying the analysis of experimental data. The Stable Unit Treatment Variable Assumption (SUTVA) assumes no participant cross over between treatment and control groups and no hidden treatment conditions. Unfortunately, most know that maintaining SUTVA in ongoing real-world evaluations remains a challenge and is often infeasible.
Fortunately, researchers have developed tools such as the propensity score to reduce bias treatment effect estimates. How does this work? Using propensity scores, we calculate the probability that an individual is assigned to the treatment group based on observable characteristics such as gender or parent education. These probabilities are used to stratify the population and estimate subgroup effects. Common estimation methods include sub-classification, matching, re-weighting, and regression applications. One hot tip for applying the propensity analysis is to consider at what level of the population you want to estimate the treatment effect (i.e. within a subgroup, between specific individuals, or across a time-varying covariates such as age, education level, etc.).
So that’s great, but what happens when you can’t balance treatment and comparison groups? Currently, methods for applying the propensity score method have remained limited to traditional multivariate techniques for adjusting estimated treatment effects. Fortunately, researchers at Stanford University have started to explore the application of machine learning that enable estimation of more specific and individualized treatment effects. Check out their rad resource on machine learning for causal inference for an in-depth implementation guide using R.
So how does machine learning in causal inference work? The basis of machine learning is the bias-variance trade-off. Essentially, the residual mean squared error (RMSE; the standardized mean difference between observed and expected observations) can be obtained from any model, not just the ones traditionally used to derive the propensity score. In machine learning, the data are partitioned into a larger training set, and a smaller testing set. Using the larger training set, models are trained, and performance is compared using the RMSE or accuracy and specificity (depending on whether the variable is continuous or categorical). The best performing model is selected and tested on the smaller dataset. One cool trick is using tidymodels in R to conduct your machine learning analysis (and using ChatGPT, sparingly, to assist in solving errors).
Overall, lessons learned from implementing the propensity score in real-world evaluations illustrates examples of when statistical adjustments cannot correct bias to recover valid treatment effects. However, if you find the propensity score is not working, alternative models may perform better in predicting the probability that an individual will participate in treatment. Overall, machine learning for evaluation and causal inference can help reduce bias and explore program effects for heterogeneous populations.
Do you have questions, concerns, kudos, or content to extend this aea365 contribution? Please add them in the comments section for this post on the aea365 webpage so that we may enrich our community of practice. Would you like to submit an aea365 Tip? Please send a note of interest to aea365@eval.org . aea365 is sponsored by the American Evaluation Association and provides a Tip-a-Day by and for evaluators. The views and opinions expressed on the AEA365 blog are solely those of the original authors and other contributors. These views and opinions do not necessarily represent those of the American Evaluation Association, and/or any/all contributors to this site.