vignettes/01-mahalanobis.Rmd
01-mahalanobis.Rmd
Welcome to jumble! This program was written as a companion to academic work performed by researchers at Brown University to conduct randomization procedures for cluster-randomized nursing home trials.
In this vignette we will explain how to use Mahalanobis distance as a measure of covariate balance between two groups.
Imagine a randomized controlled trial of patients, assigned to treatment A or B. The primary goal is to have singular statistic to describe how closely ‘balanced’ individuals in group A are to group B. Morgan and Rubin describe the qualities of this statistic which are desirable.(Morgan KL 2015) Namely it should be ‘affinely invariant’. Which means that an affine transformation of the covariates would lead to the same re-randomization acceptance. This kind of metric will reflect the joint distribution of covariates, and provide equal balance for any linear combination of covariates. Mahalanobis distance meets these criteria.
Mahalanobis or M-distance measures the distance between point P and distribution D. It is generally a measure of many standard deviations P is from the mean D.full description
Mahalanobis distance can be used as a scalar measure of multivariate balance using the following formula.(Morgan KL 2015)
\[ M \equiv \frac{n_tn_c}n (\bar{X_t} - \bar{X_c}) ' cov(X)^{-1} (\bar{X_t} - \bar{X_c}) \] Where n is the sample size, t treated, c controls, and X represents the covariate means.
Alternatively, for stratified randomization by a distance measure, you can compute each unit’s distance from the overall group mean.
\[ M \equiv (x_i - \bar{X}) ' cov(X)^{-1} (\bar{x_i} - \bar{X}) \]
The ACT trial was a single randomized trial, here are the observed mean differences +/- standard deviation in that single randomization.
## [,1]
## age -0.0005
## race 0.0643
## gender -0.0506
## symptom -0.0437
## wtkg 0.0887
## hemo -0.0126
## msm -0.0459
## drugs -0.0639
## karnof -0.0177
## oprior 0.0843
These are the standardized mean differences in selected covariates, with large treatment effects or sample sizes, small differences may not be significant, but if evaluating small effects, could be an important source of confounding.
To compute the mahalanobis formula, use must do some matrix algebra.
\[ \bar{X_t} - \bar{X_c}\]
X_t <- colMeans(df_mdist[df_mdist$arms==0, var_nms])
X_c <- colMeans(df_mdist[df_mdist$arms==1, var_nms])
X_delta <- X_t - X_c
## Difference in covariate means by treatment:
## -0.004 0.029 -0.019 -0.017 1.191 -0.003 -0.022 -0.021 -0.104 0.013
## age race gender symptom wtkg hemo msm drugs karnof oprior
## age 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0
## race 0.0 5.9 0.0 0.2 0.0 2.6 2.3 0.1 0.0 -0.9
## gender 0.0 0.0 14.9 0.0 -0.1 -9.1 -9.2 -0.9 0.0 -1.1
## symptom 0.0 0.2 0.0 7.1 0.0 0.2 -0.6 -0.2 0.0 -0.4
## wtkg 0.0 0.0 -0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## hemo 0.1 2.6 -9.1 0.2 0.0 24.0 10.5 3.0 0.0 -0.8
## msm 0.0 2.3 -9.2 -0.6 0.0 10.5 12.5 2.9 0.0 0.2
## drugs 0.0 0.1 -0.9 -0.2 0.0 3.0 2.9 10.0 0.0 0.9
## karnof 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1
## oprior 0.0 -0.9 -1.1 -0.4 0.0 -0.8 0.2 0.9 0.1 43.8
\[ M \equiv \frac{n_tn_c}n (\bar{X_t} - \bar{X_c}) ' cov(X)^{-1} (\bar{X_t} - \bar{X_c}) \]
## Manually computed M-distance: 8.542155
jumble package function mdis_grps
:
M_jumb <- mdis_grps(df_mdist[df_mdist$arms==0, 2:length(df_mdist)],
df_mdist[df_mdist$arms==1, 2:length(df_mdist)]) * ssc
## Jumble M-distance: 8.542155
There is also an R-function which can compute M-distance.
## Base R Mahalanobis: 8.542155
There is also an Rfast function using C++:
## Rfast Mahalanobis: 8.542155
Manual calculation same as Base, same as Rfast
## [1] TRUE
All functions provide nearly equal answers.
## Unit: microseconds
## expr min lq mean median uq max neval cld
## jumble 910.0 980.45 1253.4432 1023.65 1192.00 71895.0 1000 b
## rbase 72.7 88.10 116.5636 101.70 121.95 2249.4 1000 a
## rfast 9.5 14.00 24.7162 20.50 30.45 194.4 1000 a
Rfast blows away competition!
Typically you would take the square root of this distance, but that is not necessary for our purposes.
When performing a stratified analysis, M-distance can be used to construct the strata and perform a permuted block randomization.
In this case, you are evaluating each person / unit’s distance from the mean of the whole group, then doing a stratified randomization by groups of like distance. Note: M-distance is an absolute measure, so does not pair those with similar covariates, but rather the joint distribution of covariates is similarly different from the mean values. So a distant value could mean covariates in the upper or lower quartile.
The formula is now:
\[ M \equiv ({X_i} - \bar{X}) ' cov(X)^{-1} ({X_i} - \bar{X}) \]
Test all methods give equal answers:
## [1] TRUE
All functions provide nearly equal answers.
Because a limitation of re-randomization is computation time, it is important to have a function which can compute M-distance quickly.
library(microbenchmark)
microbenchmark(
mine = mdis_chrt(tst),
rfast = Rfast::mahala(tst, colMeans(tst), sigma=cov(tst)),
rbase = {mahalanobis(tst, colMeans(tst), cov=cov(tst))},
times = 1000
)
## Unit: microseconds
## expr min lq mean median uq max neval cld
## mine 5055.2 5720.80 9031.9160 6165.15 9549.75 126747.7 1000 b
## rfast 203.7 219.45 262.9880 243.20 288.95 1198.2 1000 a
## rbase 308.3 340.65 442.6604 389.50 463.30 9109.7 1000 a
My handwritten code is very in-efficient, Rfast outperforms the base code.
In the re-randomization procedures, the R-fast computation is used for speed. Unit testing is in place to ensure the Mahalanobis calculations are consistent across Rfast, Base R and the manual computations for version control and reproducibility.
Hammer SM, et al. 1996. “A Trial Comparing Nucleoside Monotherapy with Combination Therapy in Hiv-Infected Adults with Cd4 Cell Counts from 200 to 500 Per Cubic Millimeter.” N Eng J M 335: 1081–90.
Morgan KL, Rubin DB. 2015. “Rerandomization to Balance Tiers of Covariates.” J Am Stat Assoc 110 (512): 1412–21.