Welcome to jumble! This program was written as a companion to academic work performed by researchers at Brown University to conduct randomization procedures for cluster-randomized nursing home trials.

Introduction

In this vignette we will explain how to use Mahalanobis distance as a measure of covariate balance between two groups.

Example dataset

The example dataset used in this package is freely available and comes from a clinical HIV therapy trial conducted in the 1990s.(Hammer SM 1996) See: ?jumble::ACTG175

Load dataset

Measure of covariate balance

Imagine a randomized controlled trial of patients, assigned to treatment A or B. The primary goal is to have singular statistic to describe how closely ‘balanced’ individuals in group A are to group B. Morgan and Rubin describe the qualities of this statistic which are desirable.(Morgan KL 2015) Namely it should be ‘affinely invariant’. Which means that an affine transformation of the covariates would lead to the same re-randomization acceptance. This kind of metric will reflect the joint distribution of covariates, and provide equal balance for any linear combination of covariates. Mahalanobis distance meets these criteria.

Mahalanobis distance

Mahalanobis or M-distance measures the distance between point P and distribution D. It is generally a measure of many standard deviations P is from the mean D.full description

Mahalanobis distance can be used as a scalar measure of multivariate balance using the following formula.(Morgan KL 2015)

\[ M \equiv \frac{n_tn_c}n (\bar{X_t} - \bar{X_c}) ' cov(X)^{-1} (\bar{X_t} - \bar{X_c}) \] Where n is the sample size, t treated, c controls, and X represents the covariate means.

Alternatively, for stratified randomization by a distance measure, you can compute each unit’s distance from the overall group mean.

\[ M \equiv (x_i - \bar{X}) ' cov(X)^{-1} (\bar{x_i} - \bar{X}) \]

Trial Covariate Balance

The ACT trial was a single randomized trial, here are the observed mean differences +/- standard deviation in that single randomization.

##            [,1]
## age     -0.0005
## race     0.0643
## gender  -0.0506
## symptom -0.0437
## wtkg     0.0887
## hemo    -0.0126
## msm     -0.0459
## drugs   -0.0639
## karnof  -0.0177
## oprior   0.0843

These are the standardized mean differences in selected covariates, with large treatment effects or sample sizes, small differences may not be significant, but if evaluating small effects, could be an important source of confounding.

What is the mahalanobis distance for these groups?

To compute the mahalanobis formula, use must do some matrix algebra.

Reduce dataset to needed covariates, format

  1. Only include treatment indicator and k covarates
  2. Ensure all covariates are numeric

Sample size constant

\[ \frac{n_tn_c}n\]

This is easy to compute

ssc <- (nrow(df_mdist[df_mdist$arms==0, ]) * nrow(df_mdist[df_mdist$arms==1, ])) / (nrow(df_mdist))
## Sample size constant:  263.476

Covariate means

\[ \bar{X_t} - \bar{X_c}\]

X_t <- colMeans(df_mdist[df_mdist$arms==0, var_nms])

X_c <- colMeans(df_mdist[df_mdist$arms==1, var_nms])

X_delta <- X_t - X_c 
## Difference in covariate means by treatment: 
##   -0.004 0.029 -0.019 -0.017 1.191 -0.003 -0.022 -0.021 -0.104 0.013

Inverse of covariance of covariate matrix

##         age race gender symptom wtkg hemo  msm drugs karnof oprior
## age     0.0  0.0    0.0     0.0  0.0  0.1  0.0   0.0    0.0    0.0
## race    0.0  5.9    0.0     0.2  0.0  2.6  2.3   0.1    0.0   -0.9
## gender  0.0  0.0   14.9     0.0 -0.1 -9.1 -9.2  -0.9    0.0   -1.1
## symptom 0.0  0.2    0.0     7.1  0.0  0.2 -0.6  -0.2    0.0   -0.4
## wtkg    0.0  0.0   -0.1     0.0  0.0  0.0  0.0   0.0    0.0    0.0
## hemo    0.1  2.6   -9.1     0.2  0.0 24.0 10.5   3.0    0.0   -0.8
## msm     0.0  2.3   -9.2    -0.6  0.0 10.5 12.5   2.9    0.0    0.2
## drugs   0.0  0.1   -0.9    -0.2  0.0  3.0  2.9  10.0    0.0    0.9
## karnof  0.0  0.0    0.0     0.0  0.0  0.0  0.0   0.0    0.0    0.1
## oprior  0.0 -0.9   -1.1    -0.4  0.0 -0.8  0.2   0.9    0.1   43.8

Bring formula together

\[ M \equiv \frac{n_tn_c}n (\bar{X_t} - \bar{X_c}) ' cov(X)^{-1} (\bar{X_t} - \bar{X_c}) \]

M_man = ssc * (t(X_delta) %*% (df_cov %*% X_delta)) %>% as.vector(.)
## Manually computed M-distance:  8.542155

jumble package function mdis_grps:

M_jumb <- mdis_grps(df_mdist[df_mdist$arms==0, 2:length(df_mdist)], 
          df_mdist[df_mdist$arms==1, 2:length(df_mdist)]) * ssc 
## Jumble M-distance:  8.542155

There is also an R-function which can compute M-distance.

## Base R Mahalanobis:  8.542155

There is also an Rfast function using C++:

## Rfast Mahalanobis:  8.542155

Manual calculation same as Base, same as Rfast

all.equal(M_man, M_jumb, M_base, M_fast) # Nearly equal
## [1] TRUE

All functions provide nearly equal answers.

## Unit: microseconds
##    expr   min     lq      mean  median      uq     max neval cld
##  jumble 910.0 980.45 1253.4432 1023.65 1192.00 71895.0  1000   b
##   rbase  72.7  88.10  116.5636  101.70  121.95  2249.4  1000  a 
##   rfast   9.5  14.00   24.7162   20.50   30.45   194.4  1000  a

Rfast blows away competition!

Typically you would take the square root of this distance, but that is not necessary for our purposes.

When performing a stratified analysis, M-distance can be used to construct the strata and perform a permuted block randomization.

In this case, you are evaluating each person / unit’s distance from the mean of the whole group, then doing a stratified randomization by groups of like distance. Note: M-distance is an absolute measure, so does not pair those with similar covariates, but rather the joint distribution of covariates is similarly different from the mean values. So a distant value could mean covariates in the upper or lower quartile.

The formula is now:

\[ M \equiv ({X_i} - \bar{X}) ' cov(X)^{-1} ({X_i} - \bar{X}) \]

Test all methods give equal answers:

all.equal(M_man, M_base, M_fast) # Nearly equal
## [1] TRUE

All functions provide nearly equal answers.

Benchmarking

Because a limitation of re-randomization is computation time, it is important to have a function which can compute M-distance quickly.

## Unit: microseconds
##   expr    min      lq      mean  median      uq      max neval cld
##   mine 5055.2 5720.80 9031.9160 6165.15 9549.75 126747.7  1000   b
##  rfast  203.7  219.45  262.9880  243.20  288.95   1198.2  1000  a 
##  rbase  308.3  340.65  442.6604  389.50  463.30   9109.7  1000  a

My handwritten code is very in-efficient, Rfast outperforms the base code.
In the re-randomization procedures, the R-fast computation is used for speed. Unit testing is in place to ensure the Mahalanobis calculations are consistent across Rfast, Base R and the manual computations for version control and reproducibility.

Contacting authors

The primary author of the package was Kevin W. McConeghy. See here

References

Hammer SM, et al. 1996. “A Trial Comparing Nucleoside Monotherapy with Combination Therapy in Hiv-Infected Adults with Cd4 Cell Counts from 200 to 500 Per Cubic Millimeter.” N Eng J M 335: 1081–90.

Morgan KL, Rubin DB. 2015. “Rerandomization to Balance Tiers of Covariates.” J Am Stat Assoc 110 (512): 1412–21.