Additional Topics

Reporting CCW results, and troubleshooting

Evaluation of Target Trial Design Elements

Evaluation of Grace Period

The selection of a grace period is ideally based on knowledge of the treatment in real-world use. For example, if there were treatment guidelines recommending an intervention take place within a specific time-frame (6 weeks from hospital discharge, 1 hour from presentation in the ED etc.), then it makes sense to follow that recommendation in selecting a grace period. However, practical limitations may change this for a project. Even though it may be recognized that 6 weeks is recommended, maybe very few initiate that early and 12 weeks captures more treated cases.

Whatever period is selected, a sound practice is to evaluate how different results appear when different grace periods are selected. It is simplest when the time-frame doesn’t change the estimated effect very much, but a variety of things could impact this. For example, if probability of initiation of treatment varies widely across time due to confounding or some general time-trend then selecting different grace windows could substantially change results.

Computational issues

Because the pooled logistic regression models a person-time dataset, for large sample sizes and long follow-up periods this can require a large dataset and make estimation very time consuming.

Benchmarked Estimation Step

glm(event==0 ~ poly(time, 2, raw=T)*assign, data=dta_c_panel, family=binomial())

Speeding up GLM ML procedure

There are two main things that can be done to speed up GLM, 1) initialize parameters based on prior estimation procedure. 2) Use efficient matrix functions and parallel computation.

Initialization hack

This is a simple trick, either 1) run the GLM once and store est, or 2) run the GLM on a 10% sample.

d_fit = glm(event==0 ~ poly(time, 2, raw=T)*assign, 
1            data=dta_c_panel, family=binomial())

d_fit_2 = glm(event==0 ~ poly(time, 2, raw=T)*assign, 
              data=dta_c_panel, family=binomial(), 
2              start = d_fit$coefficients)

1: GLM procedure with automatic initialization step.
2: GLM with start= option using coefficients from prior step.

BLAS/LAPACK and Parallel computation

The parglm package provides much faster computations (but somewhat more unstable).

library(parglm)

d_fit_3 = parglm(event==0 ~ poly(time, 2, raw=T)*assign, 
       family=binomial(), data=dta_c_panel, 
       start = d_fit$coefficients,
1       control = parglm.control(method = "FAST", nthreads = 8L))

1: parglm() function works mostly as glm(), the parglm.control() allows some additional options for parallel computing and QR decomposition.

Benchmarking GLM methods

Unit: seconds
           expr   min    lq  mean median    uq   max neval
       base GLM 1.830 1.930 1.920  1.930 1.940 1.980     5
 GLM with inits 0.639 0.750 0.761  0.764 0.784 0.868     5
         PARGLM 0.377 0.378 0.388  0.382 0.393 0.413     5

Even in a small example, parglm() significantly outperforms base glm(). This will scale considerably with multiple cores and larger datasets as well.

parglm() may be more unstable (convergence issues), but should be sufficient for most problems.

Comparison of glm() and parglm() results
Coefficient	glm()	parglm()
(Intercept)	5.13909	5.13909
poly(time, 2, raw = T)1	-0.00057	-0.00057
poly(time, 2, raw = T)2	-0.00020	-0.00020
assign	0.14231	0.14231
poly(time, 2, raw = T)1:assign	-0.01979	-0.01979
poly(time, 2, raw = T)2:assign	0.00020	0.00020

Bootstrapping procedure

I consider bootstrapping to be the standard for this type of analysis, the statistical properties are not well-described, but some use influence-based statistics.

The typical bootstrap procedure resamples an entire dataset iteratively, but this can be very inefficient depending on how you set it up because it may involve holding the original dataset, and another new bootstrapped dataset in memory, also potentially a matrix in a regression optimization step. However some clever use of matrices and shortcuts can work around this.

Note

The bootstrap procedure must sample at the person level to account for the cloning.

Inefficient Bootstrap

boot_it_1 = function(x) {
  
  setDT(x)
  
1  d_ids = distinct(x, id)
  
2  d_boot = slice_sample(d_ids, prop=1, replace=T)
  
3  d_panel_outc_2 = left_join(d_boot,
                             x, by = join_by(id),
                             relationship = "many-to-many")
  
  d_glm_pe_1 = glm(event==0 ~ poly(time, 2, raw=T), 
                     data=d_panel_outc_2[d_panel_outc_2$assign==1, ], 
                     family=binomial()) 

  d_glm_pe_0 = glm(event==0 ~ poly(time, 2, raw=T), 
                   data=d_panel_outc_2[d_panel_outc_2$assign==0, ], 
                     family=binomial())
  
  d_panel_outc_2$pr_1 = predict(d_glm_pe_1, newdata = d_panel_outc_2, 
                              type='response') 
  d_panel_outc_2$pr_0 = predict(d_glm_pe_0, newdata = d_panel_outc_2, 
                             type='response') 

  d_panel_outc_2[, `:=`(pr_cum_1 = cumprod(pr_1)), by=list(id, assign)] 
  d_panel_outc_2[, `:=`(pr_cum_0 = cumprod(pr_0)), by=list(id, assign)] 
  
  d_res = d_panel_outc_2 %>% 
    group_by(time) %>% 
    summarize(pr_ev_1 = mean(1-pr_cum_1),
              pr_ev_0 = mean(1-pr_cum_0), 
              .groups = 'drop') %>% 
    ungroup %>%
    mutate(cid = pr_ev_1 - pr_ev_0, 
           cir = pr_ev_1 / pr_ev_0)
  
  return(d_res$cid[d_res$time==60])
}

1: Generate a list of unique person IDs
2: Sample from list with replacement
3: Perform left-join on resampled list back to dataset

More efficient bootstrap

Rather than sampling rows of the matrix with replacement, an alternative is to approximate the sampling with a frequency weight. If you randomly assign every observation a value drawn from a Poisson distribution with mean 1, and use this value as a frequency weight in estimators you will closely approximate the full bootstrap procedure as long as the overall sample size is >100.(Hanley and MacGibbon 2006) This is very computationally efficient because you do not need to know the dataset size prior to assigning the frequency weight, and do not join or work with multiple large matrices.

boot_it_2 = function(x) {
  
  setDT(x)
  
1  x[, freqwt:=rpois(n=1, lambda=1), by = factor(id)]

  d_glm_pe_1 = glm(event==0 ~ poly(time, 2, raw=T), 
                     data=x[x$assign==1, ], 
                     family=binomial(), weights = freqwt) 

  d_glm_pe_0 = glm(event==0 ~ poly(time, 2, raw=T), 
                   data=x[x$assign==0, ], 
                     family=binomial(), weights = freqwt) 
  
  x$pr_1 = predict(d_glm_pe_1, newdata = x, 
                              type='response') 
  x$pr_0 = predict(d_glm_pe_0, newdata = x, 
                             type='response') 

  x[, `:=`(pr_cum_1 = cumprod(pr_1)), by=list(id, assign)] 
  x[, `:=`(pr_cum_0 = cumprod(pr_0)), by=list(id, assign)] 
  
  d_res = x %>% 
    group_by(time) %>% 
2    summarize(pr_ev_1 = weighted.mean(1-pr_cum_1, freqwt),
              pr_ev_0 = weighted.mean(1-pr_cum_0, freqwt),
              .groups = 'drop') %>% 
    ungroup %>%
    mutate(cid = pr_ev_1 - pr_ev_0, 
           cir = pr_ev_1 / pr_ev_0) 
  
  return(d_res$cid[d_res$time==60])
}

1: Generate a frequency weight by group ID from a Poisson distribution with mean 1
2: Note the use of weighted.mean() versus mean() in other code.

Benchmarking GLM methods

Unit: seconds
                 expr  min   lq mean median   uq  max neval
 Resampling bootstrap 1.77 1.80 1.88   1.89 1.95 2.00     5
    Poisson bootstrap 1.78 1.84 1.83   1.84 1.85 1.87     5

Parallel computation

The next thing to work on is parallel computation steps. The efficiency depends on how the data is setup and what steps are running in parallel, it is not efficient to hold several very large datasets in memory at once so that a CPU worker can be assigned to each.

If you don’t believe this provides similar coverage estimates, here are the intervals from 50 bootstraps with both procedures:

Here is a simple example for a Dell Laptop using the Furrr package with 10 workers:

library(furrr)
boots = 200
1plan(multisession, workers = 10)

resampling_res = future_map_dbl(1:boots, 
                                function(x) boot_it_1(dta_c_panel), 
2                                .options = furrr_options(seed=T))

poisson_res = future_map_dbl(1:boots, 
                                function(x) boot_it_2(dta_c_panel), 
                                .options = furrr_options(seed=T))

1: Assign number of CPU workers to job
2: The seed=T option is important because the bootstrap function uses RNG.

Distribution of risk differences by either bootstrap methods
Method	min	Lower CI	mean	q50th	Upper CI	max	SD
Poisson	0.0386	0.0526	0.0767	0.0769	0.0994	0.1043	0.0119
Resampling	0.0398	0.0533	0.0806	0.0802	0.1096	0.1165	0.0146

References

Hanley, James A., and Brenda MacGibbon. 2006. “Creating Non-Parametric Bootstrap Samples Using Poisson Frequencies.” Computer Methods and Programs in Biomedicine 83 (1): 57–62. https://doi.org/10.1016/j.cmpb.2006.04.006.