Tutorial for Clone-Censor-Weighting Analyses

Sections

Data: Discusses creation of synthetic data used in this examples
Estimation: Estimation describes an step-by-step approach for CCW methods.
Inference; Discussion and practical on obtaining confidence intervals.
Additional Topics: Some various related topics, how to speed up computation etc.
Appendix: Some notes on inference and weighting approach, that may be helpful in adapting this tutorial to a specific project.

For production notes and future efforts see About.

Troubleshooting your own project

For those having issues with their own CCW projects. I recommend that you download the synthetic data for this project, found here: Example Data. Then follow the estimation page code: Estimation and see if you can replicate my results. Once you can do that, compare the data, modeling and weighting steps to your own project and try to find where differences lie.

Background reading

The “Clone-Censor-Weight” analytical approach is popular in the target trial emulation (TTE) framework. A TTE need not be done use CCW, but the approach is popular due to its flexibility. This website is a tutorial meant to guide researchers and others in how to execute a CCW analysis.

I provide some personal insight into the process, different approaches that can be done, and mainly focus on the analytical methods. In order to actually emulate a target trial, a researcher needs to understand the fundamentals of randomized, controlled trial (RCT) design, causal inference, probability statistics and their practical application with a statistical programming language. I do not go into detail on this.

The following readings are recommended for that background and to prepare for the tutorial guide.

For a full in-depth review of causal inference methods, I recommend Robins’ and Hernan’s Causal Inference: What If.
Randomized Controlled Trials by Emily Zabor et al. provides background on RCT design. Randomization is a key feature which allows a defensible assumption of no confounding. However, another important but often overlooked advantage of RCTs is a well-defined intervention which allows clear causal contrasts to be made (e.g. enroll eligible persons and then give treatment A versus give treatment B). In observational analyses, a well-defined intervention, and the timing of assignment and assessment of eligibility for that intervention are not always aligned in a logical fashion.
Miguel Hernan is a leading expert on TTE and provides a short description of it here: JAMA 2022. It is a two-step process; first defining the ideal RCT and how it can be emulated in the data, the second step is using TTE methods to perform that emulation (the focus of this guide).
Statistical analysis; depending on the approach used, the researcher will need to understand and execute logistic regression models, matching algorithms, and/or failure time (“survival”) analyses. Additionally, most of these methods do not have known statistical properties in the setting of TTE (due to duplicated data, weighting etc.) and so most applied researchers are using bootstrapping to generate uncertainty intervals. See Causal survival analysis (Chapter 17 of “What If” book linked above).

Acknowledgements

The tutorial represents my gathered and organized notes from research projects and didactic training. Collaborators and mentors include: Issa Dahabreh, Kaley Hayes, Daniel Harris, Donald Miller and Andrew Zullo.