Simulate data for CCW project

Causal Directed Acyclic Graph

We assume a generic treatment, A, has no causal effect on generic outcome, Y. However, confounding is present through another random variable. The DAG includes age, X a confounder, gender defined through a binary variable F for female. X is both a baseline variable, and a time-varying confounder. Below we assign effects of these variables that will lead to identification of a causal effect between A and Y, if no controlling for the confounding is done.

Figure 1. Causal DAG of treatment and outcome

If a naive analysis (no adjustment) was performed the estimate will be biased, because treatment and outcome are d-connected.

Figure 2. D-connected treatment and outcome

Figure 3. D-separated treatment and outcome after adjustment

Adjustment for X should lead to identification of a NULL effect of treatment on outcome.

Simulation Parameters

set.seed(100)
1n = 10000L
fup = 90

2df = tibble(id = 1:n,
            age = round(rnorm(n, mean = 75, sd = 10), 1),
            female = sample(c(1, 0), size = n, replace = TRUE, prob = c(0.66, 0.34)),
            fup = fup,
            X = rnorm(n, 0, sd=1),
            )
  
setDT(df) 
3d_panel = df[rep(seq(.N), fup)]
d_panel[, exit := (seq_len(.N)), by = list(id)]
d_panel[, enter := exit-1]
d_panel[, time := seq_len(.N), by = list(id)]

1: Specify 10000 individuals with 90 observation periods of follow-up.
2: Specify covariates, age, female and X.
3: Expand dataset, each observation period is an interval of length = 1.

For X, further include X_t which varies across time. A change in X_t is triggered by intermediate variable X_shift which is a random binomial draw (meant to represent a triggering event for a change in X). So for each person, there is a baseline X, then random timepoints in follow-up where X changes.

d_panel[, X_shift := rbinom(.N, 1, 0.05)]
d_panel[, X_shift := cumsum(X_shift), by = list(id)]
d_panel[, X_t := X+(X_shift*(rnorm(1, 0, 1)))]

Figure 4. Distribution of Time-varying Confounder

Time to treatment and outcome is simulated in a longitudinal dataset, to allow for time-varying effect of X. Then the dataset is modified so that persons follow-up ends at the time of outcome:

  set.seed(101)
  
  d_panel_2 = d_panel %>% 
1    # Treat event
    mutate(log_odds = -4.1 + 0.0001*time + -0.00001*(time^2) + 0.2*X + 0.2*X_t,
           p = exp(log_odds) / (1 + exp(log_odds)),
           treat = rbinom(n(), size = 1, prob = p),
           treat = if_else(treat==1, 1, NA_integer_)
    ) %>%
    group_by(id) %>%
      fill(treat, .direction = 'down') %>%
    ungroup %>%
    mutate(treat = coalesce(treat, 0L)) %>%
2    # Outcome event
    mutate(log_odds = -5 + 0.009*time + 0.0003*(time^2) + 0.4*X + 0.3*X_t + 
                                  -0.2*female + 0.07*age + -0.001*(age^2),
           p = exp(log_odds) / (1 + exp(log_odds)),
           outcome = rbinom(n(), size = 1, prob = p),
           outcome = if_else(outcome==1, 1, NA_integer_)
    ) %>%
    group_by(id) %>%
      fill(outcome, .direction = 'down') %>% 
    ungroup %>%
    mutate(outcome = coalesce(outcome, 0L)) %>%
3    group_by(id) %>%
      mutate(t_outcome = coalesce(min(time[outcome==1]), Inf),
             t_treat   = coalesce(min(time[treat==1]), Inf)) %>%
    ungroup %>%
    dplyr::filter(outcome  == 0 | time == t_outcome) %>%
    dplyr::filter(time <= 60) %>%
    mutate(t_treat = if_else(t_outcome < t_treat, Inf, t_treat),
           t_treat = if_else(60 < t_treat, Inf, t_treat))

1: Simulation of treatment event times. Treatment is only a function of time, X, and X_t.
2: Simulation of outcome event times. Outcome is a function of time, X, and X_t, F, Age.
3: Modify dataset so that follow-up times coincide with new treatment and outcome variables.

Simulated treatment times

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1      19      49     Inf     Inf     Inf

Simulated outcome times

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1      58      77     Inf     Inf     Inf

Figure 5. Treatment and Outcome Incidence Across Follow-up

Figure 6. Correlation plot of covariates, treatment, outcome

Simple correlation statistics support the simulated dataset has target associations.

Summary of survival dataset

Rows: 486,944
Columns: 16
$ id        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ age       <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, …
$ female    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ fup       <dbl> 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, …
$ X         <dbl> -0.7311102, -0.7311102, -0.7311102, -0.7311102, -0.7311102, …
$ exit      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ enter     <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
$ time      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ X_shift   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, …
$ X_t       <dbl> -0.7311102, -0.7311102, -0.7311102, -0.7311102, -0.7311102, …
$ log_odds  <dbl> -5.502477, -5.492577, -5.482077, -5.470977, -5.459277, -5.44…
$ p         <dbl> 0.004060109, 0.004100338, 0.004143439, 0.004189494, 0.004238…
$ treat     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ outcome   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ t_outcome <dbl> 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, …
$ t_treat   <dbl> Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, …

Cloning procedure

The cloning procedure is very project specific, tied to how the treatment strategy is defined so it is hard to give general recommendations here. For this tutorial, we describe an arbitrary window in which treatment is expected to initiate and outline strategies around this:

Treatment Strategies:

Do not ever receive treatment
Initiate treatment within 12 time periods (days, weeks etc.) and if not then treatment will initiate on week 12

Note

The grace window is a funny concept when you first consider it. This does not reflect any trial I have ever heard of actually being done but may identify an interesting or important effect. It is essentially outlying a treatment “protocol” and saying what if everyone adhered to this protocol counterfactual to what was observed.

To clone, you make two copies of the data, and make a new artifical censoring variable, which is a censoring time for the period when clones observed data are no longer consistent with their assigned strategy.

1data_cloned = bind_rows(
                     d_panel_2 %>%
                       mutate(assign = 0,
                              censor = if_else(t_treat <= time, 1L, 0L),
                              event = if_else(censor==0 & t_outcome<=time, 1L, 0L),
                       ),
2                     d_panel_2 %>%
                       mutate(assign = 1,
                              censor = if_else(t_treat > 12 & time>=12, 1L, 0L),
                              event  = if_else(censor==0 & t_outcome<=time, 1L, 0L)
                              )
                       ) %>%
  arrange(id, assign, time) %>%
  group_by(id, assign) %>%
  mutate(t_censor = min(time[censor==1])) %>%
3  dplyr::filter(censor == 0 | time == min(time[censor==1])) %>%
  ungroup %>%
  select(id, time, t_outcome, t_treat, t_censor, censor, event, assign, enter, exit, everything(), -fup, -X_shift, -log_odds, -p)

1: Clones assigned to strategy 1 (No treatment)
2: Clones assigned to strategy 2 (grace window for treatment)
3: Update failure times and events counting for artificial censoring

Summary of simulated, cloned data-set


assign	0	1
Unique persons	10000	10000
Median (Follow-up)	30	12
Total treat (ever)	5116	5116
Total treat (t<=12)	1768	1768
Total events	2380	1364
Total censored	5116	7659

A person-level version of the data is also created (excluded time-varying X) for comparison:

data_cloned_p = data_cloned %>%
  group_by(id, assign) %>%
    mutate(t_censor = min(time[censor==1])) %>%
  ungroup %>%
  mutate(time = pmin(t_censor, t_outcome, 60),
         event = case_when(
           time==t_censor ~ 0, 
           time==t_outcome ~ 1,
           T ~ 0)) %>%
  select(id, assign, t_treat, t_outcome, t_censor, time, event, X, female, age) %>%
  distinct()