The metafor Package

This is an old revision of the document!

Assembling Data for a Meta-Analysis of (Log) Odds Ratios

Suppose the goal of a meta-analysis is to aggregate the results from studies contrasting two groups (e.g., treatment versus control) and each study measured a dichotomous outcome of interest (e.g., treatment success versus failure). A commonly used effect size measure used to quantify the size of the group difference is then the odds ratio.

As an example, consider the data reported in Colditz et al. (1994) on the effectiveness of the Bacillus Calmette-Guerin (BCG) vaccine against tuberculosis (for this illustration, we will remove some variables that are not further needed):

library(metafor)
dat.bcg <- dat.bcg[,c(2:7)]
dat.bcg

                 author year tpos  tneg cpos  cneg
1               Aronson 1948    4   119   11   128
2      Ferguson & Simes 1949    6   300   29   274
3       Rosenthal et al 1960    3   228   11   209
4     Hart & Sutherland 1977   62 13536  248 12619
5  Frimodt-Moller et al 1973   33  5036   47  5761
6       Stein & Aronson 1953  180  1361  372  1079
7      Vandiviere et al 1973    8  2537   10   619
8            TPT Madras 1980  505 87886  499 87892
9      Coetzee & Berjak 1968   29  7470   45  7232
10      Rosenthal et al 1961   17  1699   65  1600
11       Comstock et al 1974  186 50448  141 27197
12   Comstock & Webster 1969    5  2493    3  2338
13       Comstock et al 1976   27 16886   29 17825

Variables tpos and tneg indicate the number of TB positive and TB negative cases in the treated (vaccinated) group, while variables cpos and cneg indicate the number of TB positive and TB negative cases in the control (non-vaccinated) group. The data can be arranged in terms of a 2×2 table of the form:

        |  TB+   TB-
--------+------------
Treated |  tpos  tneg
Control |  cpos  cneg

With this information, we can compute the log odds ratios (and corresponding sampling variance) for each study with:

dat1 <- escalc(measure="OR", ai=tpos, bi=tneg, ci=cpos, di=cneg, data=dat.bcg)
dat1

                 author year tpos  tneg cpos  cneg      yi     vi
1               Aronson 1948    4   119   11   128 -0.9387 0.3571
2      Ferguson & Simes 1949    6   300   29   274 -1.6662 0.2081
3       Rosenthal et al 1960    3   228   11   209 -1.3863 0.4334
4     Hart & Sutherland 1977   62 13536  248 12619 -1.4564 0.0203
5  Frimodt-Moller et al 1973   33  5036   47  5761 -0.2191 0.0520
6       Stein & Aronson 1953  180  1361  372  1079 -0.9581 0.0099
7      Vandiviere et al 1973    8  2537   10   619 -1.6338 0.2270
8            TPT Madras 1980  505 87886  499 87892  0.0120 0.0040
9      Coetzee & Berjak 1968   29  7470   45  7232 -0.4717 0.0570
10      Rosenthal et al 1961   17  1699   65  1600 -1.4012 0.0754
11       Comstock et al 1974  186 50448  141 27197 -0.3408 0.0125
12   Comstock & Webster 1969    5  2493    3  2338  0.4466 0.5342
13       Comstock et al 1976   27 16886   29 17825 -0.0173 0.0716

Note that the escalc() function directly computes the log-transformed odds ratios, as these are the values we need for the meta-analysis. A negative log odds ratio therefore indicates that the odds of a TB infection were lower in the treated group compared to the control group in a particular study.

A random-effects model can then be fitted to these data with:

res1 <- rma(yi, vi, data=dat1)
res1

Random-Effects Model (k = 13; tau^2 estimator: REML)
 
tau^2 (estimated amount of total heterogeneity): 0.3378 (SE = 0.1784)
tau (square root of estimated tau^2 value):      0.5812
I^2 (total heterogeneity / total variability):   92.07%
H^2 (total variability / sampling variability):  12.61
 
Test for Heterogeneity:
Q(df = 12) = 163.1649, p-val < .0001
 
Model Results:
 
estimate       se     zval     pval    ci.lb    ci.ub
 -0.7452   0.1860  -4.0057   <.0001  -1.1098  -0.3806      ***
 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Therefore, the estimated average log odds ratio is equal to $\hat{\mu} = -0.75$ (with 95% CI: $-1.11$ to $-0.38$). For easier interpretation, we can back-transform the results with:

predict(res1, transf=exp, digits=2)

 pred ci.lb ci.ub pi.lb pi.ub
 0.47  0.33  0.68  0.14  1.57

The odds of a TB infection are therefore estimated to be approximately half as large on average in vaccinated groups (i.e., an odds ratio of $0.47$ with 95% CI: $0.33$ to $0.68$), or put differently, we can say that the odds of infection are on average 53% lower (i.e., $1 - 0.47 = 0.53$) in vaccinated groups. However, there is a considerable amount of heterogeneity in the findings (as indicated by the large estimate of $\tau^2$, the large $I^2$ value, and the significant $Q$-test).

Now suppose that the 2×2 table data are not reported in all studies, but that the following dataset could be assembled based on information reported in the studies:

dat2 <- data.frame(summary(dat1, transf=exp))
names(dat2)[which(names(dat2) == "yi")] <- "or"
dat2$pval <- 2 * pnorm(abs(dat2$zi), lower.tail=FALSE)
dat2[,c("or","ci.lb","ci.ub","pval")] <- round(dat2[,c("or","ci.lb","ci.ub","pval")], 2)
dat2$vi <- dat2$sei <- dat2$zi <- NULL
dat2[c(1,12),c(3:6,8:9)] <- NA
dat2[c(4,9), c(3:6,10)] <- NA
dat2[c(2,3,5:8,10,11,13),c(7:10)] <- NA
dat2

                 author year tpos  tneg cpos  cneg   or ci.lb ci.ub pval
1               Aronson 1948   NA    NA   NA    NA 0.39    NA    NA 0.12
2      Ferguson & Simes 1949    6   300   29   274   NA    NA    NA   NA
3       Rosenthal et al 1960    3   228   11   209   NA    NA    NA   NA
4     Hart & Sutherland 1977   NA    NA   NA    NA 0.23  0.18  0.31   NA
5  Frimodt-Moller et al 1973   33  5036   47  5761   NA    NA    NA   NA
6       Stein & Aronson 1953  180  1361  372  1079   NA    NA    NA   NA
7      Vandiviere et al 1973    8  2537   10   619   NA    NA    NA   NA
8            TPT Madras 1980  505 87886  499 87892   NA    NA    NA   NA
9      Coetzee & Berjak 1968   NA    NA   NA    NA 0.62  0.39  1.00   NA
10      Rosenthal et al 1961   17  1699   65  1600   NA    NA    NA   NA
11       Comstock et al 1974  186 50448  141 27197   NA    NA    NA   NA
12   Comstock & Webster 1969   NA    NA   NA    NA 1.56    NA    NA 0.54
13       Comstock et al 1976   27 16886   29 17825   NA    NA    NA   NA

In particular, in studies 1 and 12, authors reported only the odds ratio and the corresponding p-value (based on a Wald-type test whether the odds ratio differs significantly from 1) and in studies 4 and 9, authors reported only the odds ratio and the corresponding 95% confidence interval bounds. Given only this information, it is possible to reconstruct the full dataset for the meta-analysis (after each step below, I would recommend examining the contents of the dat2 object to better appreciate what the code is doing).

For the studies reporting 2×2 data, we can again use the escalc() function to obtain the log odds ratios and corresponding sampling variances with:
```
dat2 <- escalc(measure="OR", ai=tpos, bi=tneg, ci=cpos, di=cneg, data=dat2)
```
For studies not reporting 2×2 data, the values for the yi and vi variables are missing.
For the studies reporting the odds ratios directly, we can easily transform these values to log odds ratios with:
```
dat2$yi <- replmiss(dat2$yi, log(dat2$or))
```
The replmiss() function is a little helper function in the metafor package that can be used to only replace the missing values.
The p-values can be converted into the corresponding z-values with:
```
dat2$zi <- sign(dat2$yi) * qnorm(dat2$pval/2, lower.tail=FALSE)
```
Note that the two-sided p-values need to be turned into one-sided p-values first (i.e., by dividing by 2) before using the qnorm() function. Also, the p-values do not contain information on the sign of the test statistics, but that information is available via the log odds ratios themselves.
Together with the log odds ratios, the test statistics can be converted into the corresponding standard errors with:
```
dat2$sei <- dat2$yi / dat2$zi
```
The confidence interval bounds can be converted into the standard errors with:
```
dat2$sei <- replmiss(dat2$sei, with(dat2, (log(ci.ub) - log(ci.lb))/(2*1.96)))
```
Again, only the missing values should be replaced where possible.
Finally, any missing values for the vi variable (the sampling variances) can now be replaced with:
```
dat2$vi <- replmiss(dat2$vi, dat2$sei^2)
```
Finally, since we won't need these variables any further, we can remove the test statistics and standard errors with:
```
dat2$zi <- dat2$sei <- NULL
```

We can now examine the contents of the dataset:

dat2

                 author year tpos  tneg cpos  cneg   or ci.lb ci.ub pval      yi     vi
1               Aronson 1948   NA    NA   NA    NA 0.39    NA    NA 0.12 -0.9416 0.3668
2      Ferguson & Simes 1949    6   300   29   274   NA    NA    NA   NA -1.6662 0.2081
3       Rosenthal et al 1960    3   228   11   209   NA    NA    NA   NA -1.3863 0.4334
4     Hart & Sutherland 1977   NA    NA   NA    NA 0.23  0.18  0.31   NA -1.4697 0.0192
5  Frimodt-Moller et al 1973   33  5036   47  5761   NA    NA    NA   NA -0.2191 0.0520
6       Stein & Aronson 1953  180  1361  372  1079   NA    NA    NA   NA -0.9581 0.0099
7      Vandiviere et al 1973    8  2537   10   619   NA    NA    NA   NA -1.6338 0.2270
8            TPT Madras 1980  505 87886  499 87892   NA    NA    NA   NA  0.0120 0.0040
9      Coetzee & Berjak 1968   NA    NA   NA    NA 0.62  0.39  1.00   NA -0.4780 0.0577
10      Rosenthal et al 1961   17  1699   65  1600   NA    NA    NA   NA -1.4012 0.0754
11       Comstock et al 1974  186 50448  141 27197   NA    NA    NA   NA -0.3408 0.0125
12   Comstock & Webster 1969   NA    NA   NA    NA 1.56    NA    NA 0.54  0.4447 0.5266
13       Comstock et al 1976   27 16886   29 17825   NA    NA    NA   NA -0.0173 0.0716

Any differences compared to dat1 are purely a result of the rounding of the or, ci.lb, ci.ub, and pval variables. However, the differences are negligible.

We can then fit a random-effects model to these data with:

res2 <- rma(yi, vi, data=dat2)
res2

Random-Effects Model (k = 13; tau^2 estimator: REML)
 
tau^2 (estimated amount of total heterogeneity): 0.3408 (SE = 0.1798)
tau (square root of estimated tau^2 value):      0.5838
I^2 (total heterogeneity / total variability):   92.18%
H^2 (total variability / sampling variability):  12.80
 
Test for Heterogeneity:
Q(df = 12) = 167.4533, p-val < .0001
 
Model Results:
 
estimate       se     zval     pval    ci.lb    ci.ub
 -0.7472   0.1867  -4.0015   <.0001  -1.1132  -0.3812      ***
 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

predict(res2, transf=exp, digits=2)

 pred ci.lb ci.ub pi.lb pi.ub
 0.47  0.33  0.68  0.14  1.57

The results are essentially the same as obtained earlier.

Some final notes:

The conversion of a p-value to the corresponding test statistic as shown above assumes that the exact p-value is reported. If authors only report that the p-value fell below a certain threshold (e.g., $p < .01$ or if authors only state that the test was significant – which typically implies $p < .05$), then a common approach is to use the value of the cutoff reported (e.g., if $p < .01$ is reported, then assume $p = .01$), which is conservative (since the actual p-value was below that assumed value by some unknown amount). The conversion will therefore tend to be much less accurate.

The conversion of the p-value shown above is only applicable if authors actually conducted a Wald-type test of the null hypothesis that the true odds ratio is equal to 1 (or equivalently, that the true log odds ratio is equal to 0). Similarly, the conversion of the confidence interval bounds to the standard error assumes that a Wald-type CI is reported.

The same approach can be used when working with (log) risk ratios. Just change measure="OR" above to measure="RR".

References

Colditz, G. A., Brewer, T. F., Berkey, C. S., Wilson, M. E., Burdick, E., Fineberg, H. V., & Mosteller, F. (1994). Efficacy of BCG vaccine in the prevention of tuberculosis: Meta-analysis of the published literature. Journal of the American Medical Association, 271(9), 698–702.