The metafor Package

A Meta-Analysis Package for R

User Tools

Site Tools


tips:model_selection_with_glmulti

Model Selection using the glmulti Package

Information-theoretic approaches provide methods for model selection and (multi)model inference that differ quite a bit from more traditional methods based on null hypothesis testing (e.g., Anderson, 2007; Burnham & Anderson, 2002). These methods can also be used in the meta-analytic context when model fitting is based on likelihood methods. Below, I illustrate how to use the metafor package in combination with the glmulti package that provides the necessary functionality for model selection and multimodel inference using an information-theoretic approach.

Data Preparation

For the example, I will use data from the meta-analysis by Bangert-Drowns et al. (2004) on the effectiveness of school-based writing-to-learn interventions on academic achievement (help(dat.bangertdrowns2004) provides a bit more background). The data can be loaded with:

library(metafor)
dat <- dat.bangertdrowns2004

(I copy the dataset into 'dat', which is a bit shorter and therefore easier to type further below). We can look at the first 10 and the last 10 rows of the dataset with:

rbind(head(dat, 10), tail(dat, 10))
   id          author year grade length minutes wic feedback info pers imag meta                subject  ni    yi    vi
1   1        Ashworth 1992     4     15      NA   1        1    1    1    0    1                Nursing  60  0.65 0.070
2   2           Ayers 1993     2     10      NA   1       NA    1    1    1    0          Earth Science  34 -0.75 0.126
3   3          Baisch 1990     2      2      NA   1        0    1    1    0    1                   Math  95 -0.21 0.042
4   4           Baker 1994     4      9      10   1        1    1    0    0    0                Algebra 209 -0.04 0.019
5   5          Bauman 1992     1     14      10   1        1    1    1    0    1                   Math 182  0.23 0.022
6   6          Becker 1996     4      1      20   1        0    0    1    0    0             Literature 462  0.03 0.009
7   7     Bell & Bell 1985     3      4      NA   1        1    1    1    0    1                   Math  38  0.26 0.106
8   8         Brodney 1994     1     15      NA   1        1    1    1    0    1                   Math 542  0.06 0.007
9   9          Burton 1986     4      4      NA   0        1    1    0    0    0                   Math  99  0.06 0.040
10 10       Davis, BH 1990     1      9      10   1        0    1    1    0    0         Social Studies  77  0.12 0.052
39 39 Ross & Faucette 1994     4     15      NA   0        1    1    0    0    0                Algebra  16  0.70 0.265
40 40           Sharp 1987     4      2      15   1        0    0    1    0    1                Biology 105  0.49 0.039
41 41         Shepard 1992     2      4      NA   0        1    1    0    0    0                   Math 195  0.20 0.021
42 42         Stewart 1992     3     24       5   1        1    1    1    0    1                Algebra  62  0.58 0.067
43 43         Ullrich 1926     4     11      NA   0        1    1    0    0    0 Educational Psychology 289  0.15 0.014
44 44 Weiss & Walters 1980     4     15       3   1        0    1    0    0    0             Statistics  25  0.63 0.168
45 45           Wells 1986     1      8      15   1        0    1    1    0    1                   Math 250  0.04 0.016
46 46          Willey 1988     3     15      NA  NA        0    1    1    0    1                Biology  51  1.46 0.099
47 47          Willey 1988     2     15      NA  NA        0    1    1    0    1         Social Studies  46  0.04 0.087
48 48       Youngberg 1989     4     15      10   1        1    1    0    0    0                Algebra  56  0.25 0.072

Variable yi contains the effect size estimates (standardized mean differences) and vi the corresponding sampling variances. There are 48 rows of data in this dataset.

For illustration purposes, the following variables will be examined as potential moderators of the treatment effect (i.e., the size of the treatment effect may vary in a systematic way as a function of one or more of these variables):

  • length: treatment length (in weeks)
  • wic: writing in class (0 = no; 1 = yes)
  • feedback: feedback (0 = no; 1 = yes)
  • info: writing contained informational components (0 = no; 1 = yes)
  • pers: writing contained personal components (0 = no; 1 = yes)
  • imag: writing contained imaginative components (0 = no; 1 = yes)
  • meta: prompts for metacognitive reflection (0 = no; 1 = yes)

More details about the meaning of these variables can be found in Bangert-Drowns et al. (2004). For the purposes of this illustration, it is sufficient to understand that we have 7 variables that are potentially (and a priori plausible) predictors of the size of the treatment effect. As we will fit various models to these data (containing all possible subsets of these 7 variables), and we need to keep the data being included in the various models the same across models, we will remove rows where at least one of the values of these 7 moderator variables is missing. We can do this with:

dat <- dat[!apply(dat[,c("length", "wic", "feedback", "info", "pers", "imag", "meta")], 1, anyNA),]

The dataset now includes 41 rows of data (nrow(dat)), so we have lost 7 data points for the analyses. One could consider methods for imputation to avoid this problem, but this would be the topic for another day. So, for now, we will proceed with the analysis of the 41 estimates.

Model Selection

We will now examine the fit and plausibility of various models, focusing on models that contain none, one, and up to seven (i.e., all) of these moderator variables. For this, we load the glmulti package and define a function that takes a model formula and a dataset as input and then fits a random/mixed-effects meta-regression model to the given data using maximum likelihood estimation:

library(glmulti)
 
rma.glmulti <- function(formula, data, ...)
   rma(formula, vi, data=data, method="ML", ...)

Next, we can fit all possible models with:

res <- glmulti(yi ~ length + wic + feedback + info + pers + imag + meta, data=dat,
               level=1, fitfunction=rma.glmulti, crit="aicc", confsetsize=128)

With level = 1, we stick to models with main effects only. This implies that there are $2^7 = 128$ possible models in the candidate set to consider. Since I want to keep the results for all these models (the default is to only keep up to 100 model fits), I set confsetsize=128 (or I could have set this to some very large value). With crit="aicc", we select the information criterion (in this example: the AICc or corrected AIC) that we would like to compute for each model and that should be used for model selection and multimodel inference. For more information about the AIC (and AICc), see, for example, the entry for the Akaike Information Criterion on Wikipedia. As the function runs, you should receive information about the progress of the model fitting. Fitting the 128 models should only take a few seconds.

A brief summary of the results can be obtained with:

print(res)
glmulti.analysis
Method: h / Fitting: rma.glmulti / IC used: aicc
Level: 1 / Marginality: FALSE
From 128 models:
Best IC: 13.5022470234854
Best model:
[1] "yi ~ 1 + imag"
Evidence weight: 0.0670568941723726
Worst IC: 27.6766820239261
10 models within 2 IC units.
77 models to reach 95% of evidence weight.

We can obtain a plot of the AICc values for all 128 models with:

plot(res)

The horizontal red line differentiates between models whose AICc is less versus more than 2 units away from that of the "best" model (i.e., the model with the lowest AICc). The output above shows that there are 10 models whose AICc is less than 2 units away from that of the best model. Sometimes this is taken as a cutoff, so that models with values more than 2 units away are considered substantially less plausible than those with AICc values closer to that of the best model. However, we should not get too hung up about such (somewhat arbitrary) divisions (and there are critiques of this rule; e.g., Anderson, 2007).

But let's take a look at the top 10 models:

tmp <- weightable(res)
tmp <- tmp[tmp$aicc <= min(tmp$aicc) + 2,]
tmp
                             model     aicc    weights
1                    yi ~ 1 + imag 13.50225 0.06705689
2             yi ~ 1 + imag + meta 13.50451 0.06698091
3         yi ~ 1 + feedback + imag 14.00315 0.05220042
4           yi ~ 1 + length + imag 14.13095 0.04896916
5  yi ~ 1 + feedback + imag + meta 14.43433 0.04207684
6       yi ~ 1 + wic + imag + meta 14.74833 0.03596327
7  yi ~ 1 + feedback + pers + imag 14.98665 0.03192348
8             yi ~ 1 + pers + imag 15.05819 0.03080169
9    yi ~ 1 + length + imag + meta 15.21061 0.02854148
10             yi ~ 1 + wic + imag 15.41073 0.02582384

We see that the "best" model is the one that only includes imag as a moderator. The second best includes imag and meta. And so on. The values under weights are the model weights (also called "Akaike weights"). From an information-theoretic perspective, the Akaike weight for a particular model can be regarded as the probability that the model is the best model (in a Kullback-Leibler sense of minimizing the loss of information when approximating full reality by a fitted model). So, while the "best" model has the highest weight/probability, its weight in this example is not substantially larger than that of the second model (and also the third, fourth, and so on). So, we shouldn't be all too certain here that we have really found the best model. Several models are almost equally plausible (in other examples, one or two models may carry most of the weight, but not here).

So, we could now examine the "best" model in closer detail with:

summary(res@objects[[1]])
Mixed-Effects Model (k = 41; tau^2 estimator: ML)
 
  logLik  deviance       AIC       BIC      AICc
 -3.4268   56.9171   12.8536   17.9943   13.5022
 
tau^2 (estimated amount of residual heterogeneity):     0.0088 (SE = 0.0084)
tau (square root of estimated tau^2 value):             0.0937
I^2 (residual heterogeneity / unaccounted variability): 20.92%
H^2 (unaccounted variability / sampling variability):   1.26
R^2 (amount of heterogeneity accounted for):            66.58%
 
Test for Residual Heterogeneity:
QE(df = 39) = 57.8357, p-val = 0.0265
 
Test of Moderators (coefficient(s) 2):
QM(df = 1) = 9.4642, p-val = 0.0021
 
Model Results:
 
         estimate      se    zval    pval   ci.lb   ci.ub
intrcpt    0.1439  0.0347  4.1477  <.0001  0.0759  0.2119  ***
imag       0.4437  0.1442  3.0764  0.0021  0.1610  0.7264   **
 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

And here, we see that imag is indeed a significant predictor of the treatment effect (and since it is a dummy variables, it can change the treatment effect from $.1439$ to $.1439 + .4437 = .5876$, which is also practically relevant (for standardized mean differences, some would interpret this as changing a small effect into at least a medium-sized one). However, now I am starting to mix the information-theoretic approach with classical null hypothesis testing, and I will probably go to hell for all eternity if I do so. Also, other models in the candidate set have model probabilities that are almost as large as the one for this model, so why only focus on this one model?

Variable Importance

Instead, it may be better to ask about the relative importance of the various predictors more generally, taking all of the models into consideration. We can obtain a plot of the relative importance of the various model terms with:

plot(res, type="s")

The importance value for a particular predictor is equal to the sum of the weights/probabilities for the models in which the variable appears. So, a variable that shows up in lots of models with large weights will receive a high importance value. In that sense, these values can be regarded as the overall support for each variable across all models in the candidate set. The vertical red line is drawn at .80, which is sometimes used as a cutoff to differentiate between important and not so important variables, but this is again a more or less arbitrary division.

Multimodel Inference

If we do want to make inferences about the various predictors, we may want to do so not in the context of a single model that is declared to be "best", but across all possible models (taking their relative weights into consideration). Multimodel inference can be used for this purpose. To get the glmulti package to handle rma.uni objects, we must register a getfit method for such objects. This can be done with:

setOldClass("rma.uni")
setMethod('getfit', 'rma.uni', function(object, ...) {
   if (object$test=="z") {
      cbind(estimate=coef(object), se=sqrt(diag(vcov(object))), df=Inf)
   } else {
      cbind(estimate=coef(object), se=sqrt(diag(vcov(object))), df=object$k-object$p)
   }
})

This method properly works with models that are fit with or without the Knapp and Hartung method (the default for rma() is test="z", but this could be set to test="knha", in which case standard errors are computed in a slightly different way, and tests and confidence intervals are based on the t-distribution).

Now we can carry out the computations for multimodel inference with:

round(coef(res), 4)
         Estimate Uncond. variance Nb models Importance +/- (alpha=0.05)
info      -0.0183           0.0027        64     0.2416           0.1012
wic       -0.0170           0.0015        64     0.2643           0.0771
pers       0.0132           0.0020        64     0.2913           0.0876
length     0.0023           0.0000        64     0.3255           0.0081
feedback   0.0366           0.0037        64     0.3671           0.1190
meta       0.0512           0.0059        64     0.4244           0.1503
imag       0.3512           0.0359        64     0.8478           0.3715
intrcpt    0.1084           0.0091       128     1.0000           0.1874

I rounded the results to 4 digits to make the results easier to interpret. Note that the table again includes the importance values. In addition, we get unconditional estimates of the model coefficients (first column). These are model-averaged parameter estimates, which are weighted averages of the model coefficients across the various models (with weights equal to the model probabilities). These values are called "unconditional" as they are not conditional on any one model (but they are still conditional on the 128 models that we have fitted to these data; but not as conditional as fitting a single model and then making all inferences conditional on that one single model). Moreover, we get estimates of the unconditional variances of these model-averaged values. These variance estimates take two sources of uncertainty into account: (1) uncertainty within a given model (i.e., the standard error of a particular model coefficient shown in the output when fitting a model; as an example, see the output from the "best" model shown earlier) and (2) uncertainty with respect to which model is actually the best approximation to reality (so this source of variability examines how much the size of a model coefficient varies across the set of candidate models). The model-averaged parameter estimates and the unconditional variances can be used for multimodel inference. For example, adding and subtracting the values in the last column from the model-averaged parameter estimates yields approximate 95% confidence intervals for each coefficient that are based not on any one model, but all models in the candidate set.

Multimodel Predictions

We can also use multimodel methods for computing a predicted value and corresponding confidence interval. Again, we do not want to base our inference on a single model, but all models in the candidate set. Doing so requires a bit more manual work, as I have not (yet) found a way to use the predict() function from the glmulti package in combination with metafor for this purpose. So, we have to loop through all models, compute the predicted value based on each model, and then we can compute a weighted average (using the model weights) of the predicted values across all models.

Let's consider an example. Suppose we want to compute the predicted value for studies with a treatment length of 15 weeks that used in class writing, where feedback was provided, where the writing did not contain informational or personal components, but the writing did contain imaginative components, and prompts for metacognitive reflection were given:

x <- c("length"=15, "wic"=1, "feedback"=1, "info"=0, "pers"=0, "imag"=1, "meta"=1)

Then we can obtain the predicted value for this combination of moderator values based on all 128 models with:

preds <- list()
 
for (j in 1:res@nbmods) {
 
   model <- res@objects[[j]]
   vars <- names(coef(model))[-1]
 
   if (length(vars) == 0) {
      preds[[j]] <- predict(model)
   } else {
      preds[[j]] <- predict(model, newmods=x[vars])
   }
 
}

The multimodel predicted value is then given by:

weights <- weightable(res)$weights
yhat <- sum(weights * sapply(preds, function(x) x$pred))
round(yhat, 3)
[1] 0.564

We can then compute the corresponding (unconditional) standard error with:

se <- sqrt(sum(weights * sapply(preds, function(x) x$se^2 + (x$pred - yhat)^2)))

Finally, a 95% confidence interval for the predicted average standardized mean difference is then given by:

round(yhat + c(-1,1)*qnorm(.975)*se, 3)
[1] 0.128 1.001

Models with Interactions

One can of course also include models with interactions in the candidate set. However, when doing so, the number of possible models quickly explodes (or even more so than when only considering main effects), especially when fitting all possible models that could be generated based on various combinations of main effects and interaction terms. In the present example, we can figure out how many models that would be with:

glmulti(yi ~ length + wic + feedback + info + pers + imag + meta, data=dat,
        level=2, method="d", fitfunction=rma.glmulti, crit="aicc")
Initialization...
TASK: Diagnostic of candidate set.
Sample size: 41
0 factor(s).
7 covariate(s).
0 f exclusion(s).
0 c exclusion(s).
0 f:f exclusion(s).
0 c:c exclusion(s).
0 f:c exclusion(s).
Size constraints: min =  0 max = -1
Complexity constraints: min =  0 max = -1
Your candidate set contains 268435456 models.
[1] 268435456

So, over $2 \times 10^8$ possible models. Fitting all of these models would not only test our patience (and would be a waste of valuable CPU cycles), it would also be a pointless exercise (even fitting the 128 models above could be critiqued by some as a mindless hunting expedition – although if one does not get too fixated on the best model, but considers all the models in the set as part of a multimodel inference approach, this critique loses some of its force). So, I won't consider this any further in this example.

Other Model Types

The same principle can of course be applied when fitting other types of models, such as those that can be fitted with the rma.mv() or rma.glmm() functions. One just has to write an appropriate rma.glmulti function and, for multimodel inference, a corresponding getfit method.

For multivariate/multilevel models fitted with the rma.mv() function, one can also consider model selection with respect to the random effects structure. Making this work would require a bit more work. Time permitting, I might write up an example illustrating this at some point in the future.

References

Anderson, D. R. (2007). Model based inference in the life sciences: A primer on evidence. New York: Springer.

Bangert-Drowns, R. L., Hurley, M. M., & Wilkinson, B. (2004). The effects of school-based writing-to-learn interventions on academic achievement: A meta-analysis. Review of Educational Research, 74(1), 29–58.

Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: A practical information-theoretic approach (2nd ed.). New York: Springer.

tips/model_selection_with_glmulti.txt · Last modified: 2018/12/09 23:10 by Wolfgang Viechtbauer