glm(event==0 ~ poly(time, 2, raw=T)*assign, data=dta_c_panel, family=binomial())
Additional Topics
Reporting CCW results, and troubleshooting
Evaluation of Target Trial Design Elements
Evaluation of Grace Period
The selection of a grace period is ideally based on knowledge of the treatment in real-world use. For example, if there were treatment guidelines recommending an intervention take place within a specific time-frame (6 weeks from hospital discharge, 1 hour from presentation in the ED etc.), then it makes sense to follow that recommendation in selecting a grace period. However, practical limitations may change this for a project. Even though it may be recognized that 6 weeks is recommended, maybe very few initiate that early and 12 weeks captures more treated cases.
Whatever period is selected, a sound practice is to evaluate how different results appear when different grace periods are selected. It is simplest when the time-frame doesn’t change the estimated effect very much, but a variety of things could impact this. For example, if probability of initiation of treatment varies widely across time due to confounding or some general time-trend then selecting different grace windows could substantially change results.
Computational issues
Because the pooled logistic regression models a person-time dataset, for large sample sizes and long follow-up periods this can require a large dataset and make estimation very time consuming.
Benchmarked Estimation Step
Speeding up GLM ML procedure
There are two main things that can be done to speed up GLM, 1) initialize parameters based on prior estimation procedure. 2) Use efficient matrix functions and parallel computation.
Initialization hack
This is a simple trick, either 1) run the GLM once and store est, or 2) run the GLM on a 10% sample.
= glm(event==0 ~ poly(time, 2, raw=T)*assign,
d_fit 1data=dta_c_panel, family=binomial())
= glm(event==0 ~ poly(time, 2, raw=T)*assign,
d_fit_2 data=dta_c_panel, family=binomial(),
2start = d_fit$coefficients)
- 1
- GLM procedure with automatic initialization step.
- 2
-
GLM with
start=
option using coefficients from prior step.
BLAS/LAPACK and Parallel computation
The parglm
package provides much faster computations (but somewhat more unstable).
library(parglm)
= parglm(event==0 ~ poly(time, 2, raw=T)*assign,
d_fit_3 family=binomial(), data=dta_c_panel,
start = d_fit$coefficients,
1control = parglm.control(method = "FAST", nthreads = 8L))
- 1
-
parglm()
function works mostly asglm()
, theparglm.control()
allows some additional options for parallel computing and QR decomposition.
Benchmarking GLM methods
Unit: seconds
expr min lq mean median uq max neval
base GLM 1.830 1.930 1.920 1.930 1.940 1.980 5
GLM with inits 0.639 0.750 0.761 0.764 0.784 0.868 5
PARGLM 0.377 0.378 0.388 0.382 0.393 0.413 5
Even in a small example, parglm()
significantly outperforms base glm()
. This will scale considerably with multiple cores and larger datasets as well.
parglm()
may be more unstable (convergence issues), but should be sufficient for most problems.
Coefficient | glm() | parglm() |
---|---|---|
(Intercept) | 5.13909 | 5.13909 |
poly(time, 2, raw = T)1 | -0.00057 | -0.00057 |
poly(time, 2, raw = T)2 | -0.00020 | -0.00020 |
assign | 0.14231 | 0.14231 |
poly(time, 2, raw = T)1:assign | -0.01979 | -0.01979 |
poly(time, 2, raw = T)2:assign | 0.00020 | 0.00020 |
Bootstrapping procedure
I consider bootstrapping to be the standard for this type of analysis, the statistical properties are not well-described, but some use influence-based statistics.
The typical bootstrap procedure resamples an entire dataset iteratively, but this can be very inefficient depending on how you set it up because it may involve holding the original dataset, and another new bootstrapped dataset in memory, also potentially a matrix in a regression optimization step. However some clever use of matrices and shortcuts can work around this.
The bootstrap procedure must sample at the person level to account for the cloning.
Inefficient Bootstrap
= function(x) {
boot_it_1
setDT(x)
1= distinct(x, id)
d_ids
2= slice_sample(d_ids, prop=1, replace=T)
d_boot
3= left_join(d_boot,
d_panel_outc_2 by = join_by(id),
x, relationship = "many-to-many")
= glm(event==0 ~ poly(time, 2, raw=T),
d_glm_pe_1 data=d_panel_outc_2[d_panel_outc_2$assign==1, ],
family=binomial())
= glm(event==0 ~ poly(time, 2, raw=T),
d_glm_pe_0 data=d_panel_outc_2[d_panel_outc_2$assign==0, ],
family=binomial())
$pr_1 = predict(d_glm_pe_1, newdata = d_panel_outc_2,
d_panel_outc_2type='response')
$pr_0 = predict(d_glm_pe_0, newdata = d_panel_outc_2,
d_panel_outc_2type='response')
`:=`(pr_cum_1 = cumprod(pr_1)), by=list(id, assign)]
d_panel_outc_2[, `:=`(pr_cum_0 = cumprod(pr_0)), by=list(id, assign)]
d_panel_outc_2[,
= d_panel_outc_2 %>%
d_res group_by(time) %>%
summarize(pr_ev_1 = mean(1-pr_cum_1),
pr_ev_0 = mean(1-pr_cum_0),
.groups = 'drop') %>%
%>%
ungroup mutate(cid = pr_ev_1 - pr_ev_0,
cir = pr_ev_1 / pr_ev_0)
return(d_res$cid[d_res$time==60])
}
- 1
- Generate a list of unique person IDs
- 2
- Sample from list with replacement
- 3
- Perform left-join on resampled list back to dataset
More efficient bootstrap
Rather than sampling rows of the matrix with replacement, an alternative is to approximate the sampling with a frequency weight. If you randomly assign every observation a value drawn from a Poisson distribution with mean 1, and use this value as a frequency weight in estimators you will closely approximate the full bootstrap procedure as long as the overall sample size is >100.(Hanley and MacGibbon 2006) This is very computationally efficient because you do not need to know the dataset size prior to assigning the frequency weight, and do not join or work with multiple large matrices.
= function(x) {
boot_it_2
setDT(x)
1:=rpois(n=1, lambda=1), by = factor(id)]
x[, freqwt
= glm(event==0 ~ poly(time, 2, raw=T),
d_glm_pe_1 data=x[x$assign==1, ],
family=binomial(), weights = freqwt)
= glm(event==0 ~ poly(time, 2, raw=T),
d_glm_pe_0 data=x[x$assign==0, ],
family=binomial(), weights = freqwt)
$pr_1 = predict(d_glm_pe_1, newdata = x,
xtype='response')
$pr_0 = predict(d_glm_pe_0, newdata = x,
xtype='response')
`:=`(pr_cum_1 = cumprod(pr_1)), by=list(id, assign)]
x[, `:=`(pr_cum_0 = cumprod(pr_0)), by=list(id, assign)]
x[,
= x %>%
d_res group_by(time) %>%
2summarize(pr_ev_1 = weighted.mean(1-pr_cum_1, freqwt),
pr_ev_0 = weighted.mean(1-pr_cum_0, freqwt),
.groups = 'drop') %>%
%>%
ungroup mutate(cid = pr_ev_1 - pr_ev_0,
cir = pr_ev_1 / pr_ev_0)
return(d_res$cid[d_res$time==60])
}
- 1
- Generate a frequency weight by group ID from a Poisson distribution with mean 1
- 2
-
Note the use of
weighted.mean()
versusmean()
in other code.
Benchmarking GLM methods
Unit: seconds
expr min lq mean median uq max neval
Resampling bootstrap 1.77 1.80 1.88 1.89 1.95 2.00 5
Poisson bootstrap 1.78 1.84 1.83 1.84 1.85 1.87 5
Parallel computation
The next thing to work on is parallel computation steps. The efficiency depends on how the data is setup and what steps are running in parallel, it is not efficient to hold several very large datasets in memory at once so that a CPU worker can be assigned to each.
If you don’t believe this provides similar coverage estimates, here are the intervals from 50 bootstraps with both procedures:
Here is a simple example for a Dell Laptop using the Furrr
package with 10 workers:
library(furrr)
= 200
boots 1plan(multisession, workers = 10)
= future_map_dbl(1:boots,
resampling_res function(x) boot_it_1(dta_c_panel),
2.options = furrr_options(seed=T))
= future_map_dbl(1:boots,
poisson_res function(x) boot_it_2(dta_c_panel),
.options = furrr_options(seed=T))
- 1
- Assign number of CPU workers to job
- 2
-
The
seed=T
option is important because the bootstrap function uses RNG.
Method | min | Lower CI | mean | q50th | Upper CI | max | SD |
---|---|---|---|---|---|---|---|
Poisson | 0.0386 | 0.0526 | 0.0767 | 0.0769 | 0.0994 | 0.1043 | 0.0119 |
Resampling | 0.0398 | 0.0533 | 0.0806 | 0.0802 | 0.1096 | 0.1165 | 0.0146 |