Logistic regression: model building

Exercise 11.3.2.

Statistics for Ecologists (Edition 2) Exercise 11.3.2

This exercise is concerned with how to build a logistic regression model (Section 11.3.2). The processes are very similar to those described in Section 11.1.2, where you saw a stepwise building process used with regular multiple linear regression. The data for this exercise are available as a separate file newt regression.RData. The file will open in R and create a data object, gcn.

Introduction

When you have several (or indeed many) predictor variables you want to find a regression model that best describes the relationship between the variables. You should not incorporate every variable that you’ve got. Eventually you’ll explain all the variability in your response variable simply because you’ve got so many explanatory predictors.

The process of model-building allows you to select the “best” variable to add to your current regression model. In the book you see how to carry out stepwise model building using a regular multiple regression (Section 11.1.2). In this exercise you can have a go at building a logistic regression model. The process is much the same as described in Section 11.1.2.

The example data

The data you’ll use comes from a survey of an amphibian (great crested newt) in southern England. You can get the data separately as newt regression.RData, which will load the gcn object to your console.

head(gcn)
  presence area dry water shade bird fish other.ponds land macro  HSI
1        0  100   1     2    10    2    1           2    1    30 0.53
2        0   40   1     2    90    1    3           1    1     0 0.47
3        0  150   1     3    20    2    3           4    1    80 0.69
4        0   25   1     2    90    1    3           4    1     0 0.51
5        1  800   1     2    90    1    4           5    3     0 0.70
6        0  275   1     3     5    2    1           2    4    50 0.72

The first column, presence, gives the presence or absence of newts as a binary variable (0 = absence, 1= presence). The researcher used standard survey methods to detect the presence (or otherwise) of newts at 200 ponds. The other variables are habitat factors:

area – the pond area in square metres.
dry – pond seasonality (1-4 with 1 being non-seasonal and 4 being most seasonal).
water – a subjective measure of water quality (1-4, 1 = bad).
shade – the shadiness of the pond as a %.
bird – presence of waterfowl (1-3, 1 = absent).
fish – presence of fish (1-4, 4 = absent).
ponds – number of other ponds within 1km.
land – terrestrial habitat quality (1-4, 4 = good).
macro – cover of macrophytes as a %.
HSI – habitat suitability index (0-1). This is a standard measure compiled from other habitat measures.

In this kind of survey, the various habitat factors are converted to an index, the indices are combined to make a final HSI (habitat suitability index). The HSI is used to make it easier to assess waterbodies for their potential to support populations of the great crested newt and to give a measure of reproducibility to surveys.

The example datafile is a “real” survey conducted on 200 ponds in Buckinghamshire in the UK. I’ve cut down the variables a little (the original data had the indices for each habitat variable) but essentially the dataset is unmodified.

The research task

The research task is twofold:

To find the most important (and significant) habitat factors affecting the presence or absence of the great crested newt.
To compare the “predictive power” of a habitat factor-based model against one using the HSI alone.

The first task will involve building a logistic regression model, which will determine that will show the habitat factors that are “most influential”.

The second task will be to make a simple regression model using the HSI variable as a single predictor. The multiple-factor model and the HSI model can then be compared to see if there is any substantial difference in deviance explained.

Building a multiple regression model

To build the model you’ll need to use the glm() command, which carries out the logistic regression (when you set family = binomial). You’ll also need to use the add1() command. The command takes an existing regression model and adds all the variables you specify, one at a time. The results are then displayed.

You then use the AIC values (a measure of how bad the model fit is) to select the next best variable to add to your model. You’ll want to pick the variable with the lowest AIC value. With many potential variables you can of course end up with a comprehensive list. It is therefore helpful to see the significant variables. You can tell the add1() command to carry out a significance test too.

For linear models (family = gaussian) you use test = “F”.
For logistic models (family = binomial) you use test = “Chisq” (or test = “LRT”).

Start with a blank model

The starting point for your regression model should be one with an intercept only and no terms except the response. In our example the response is the presence variable.

gcn.glm <- glm(presence ~ 1, data = gcn, family = binomial)

You could use the summary() command to look at the result but it would not be very informative at this stage as you have not incorporated any variables (model terms).

Add the first term

Now you have a blank model (intercept only) you will need to look at the effect of adding each variable in turn. The variable that has the lowest AIC value will be the one you want to incorporate (replace the intercept with the variable).

The add1() command will “interrogate” the regression model and return a list of variables (from a source you specify, usually the original dataset). To make the selection process easier you can carry out a significance test at the same time. You’ll want to include only variables that are significant.

add1(gcn.glm, scope = gcn, test = “Chisq”)
Single term additions
Model:
presence ~ 1
            Df Deviance    AIC    LRT  Pr(>Chi)
<none>           261.37 263.37
area         1   258.88 262.88  2.492 0.1144364
dry          1   261.31 265.31  0.054 0.8159993
water        1   255.07 259.07  6.295 0.0121104 *
shade        1   248.52 252.52 12.844 0.0003386 ***
bird         1   261.03 265.03  0.338 0.5611190
fish         1   257.16 261.16  4.207 0.0402523 *
other.ponds  1   252.05 256.05  9.317 0.0022700 **
land         1   260.80 264.80  0.568 0.4508771
macro        1   246.55 250.55 14.818 0.0001184 ***
HSI          1   228.67 232.67 32.701 1.075e-08 ***

Note that the HSI variable has the lowest AIC value. Ignore that for the moment (we’ll return to it later). For now, just focus on the habitat variables.

The macro variable has the lowest of the other AIC values (250.55) and is significant too. So you can use the up arrow on the R console to recall the earlier command and edit it to change the regression model. Replace the intercept with the macro variable.

gcn.glm <- glm(presence ~ macro, data = gcn, family = binomial)

Now you’ve got a model with a single term.

Add subsequent terms

To add more terms, you:

Run add1() on your existing model.
Choose the variable with the lowest AIC.
If the variable is significant then use it. If the variable is not significant then stop.
Edit the regression formula to include the new variable.
Repeat the steps until no significant variables remain.

If you follow the process, you’ll end up with a model like so:

gcn.glm <- glm(presence ~ macro + other.ponds + fish + shade, data = gcn, family = binomial)

The water variable shows up as being significant if you run add1() on this model but we’ll stop because the p-value is only 0.044. When building a multiple variable model, it is not a bad thing to set a more rigorous cut-off value.

The final model can be summarised:

summary(gcn.glm)

Call:
glm(formula = presence ~ macro + other.ponds + fish + shade,
family = binomial, data = gcn)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-1.8022  -0.9099  -0.5447   0.9757   2.0516

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.909742   0.704410  -4.131 3.62e-05 ***
macro        0.018963   0.006368   2.978  0.00290 **
other.ponds  0.241031   0.078685   3.063  0.00219 **
fish         0.494789   0.172173   2.874  0.00406 **
shade       -0.014619   0.005284  -2.767  0.00566 **

Signif. codes:  0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 261.37  on 199  degrees of freedom
Residual deviance: 222.57  on 195  degrees of freedom
AIC: 232.57
Number of Fisher Scoring iterations: 4

Explained deviance

The logistic regression does not return an R-squared value. What you see is Null deviance and Residual deviance. You can use these values to get a value that is equivalent to the R-squared, let’s call it the D-squared value:

1-(model$deviance/model$null.deviance)

Where model is your regression model result.

1 – (gcn.glm$deviance / gcn.glm$null.deviance)
[1] 0.1484474

You can see the result is about 0.15, which is not really great when you consider that there are 4 explanatory variables in the model!

D-squared calculator

You can make a simple custom command to calculate the D-squared value. Type the following into your console to set-up a custom command called d2:

d2 <- function(model, digits = 4) {
     round(1 - (model$deviance / model$null.deviance), digits = digits)
   }

Now you can use d2(model) to carry out the calculations simply and easily. Just give the command the name of your regression model (you can also use the digits parameter to view different numbers of decimal places).

Compare models

You can compare models in two main ways:

Are two models significantly different? Use the anova()
Compare the D-squared values. Is the explanatory power of one model better than another?

First of all, let’s make some alternative models to compare:

mod1 <- glm(presence ~ macro, data = gcn, family = binomial)
mod2 <- glm(presence ~ macro + other.ponds + fish + shade, data = gcn, family = binomial)
mod3 <- glm(presence ~ HSI, data = gcn, family = binomial)

The first model uses the single predictor macro (cover of macrophytes). This was the variable identified as the “best” of the habitat-based variables.

The second model is the multiple variable model using the 4 best habitat-based variables.

The third model uses the HSI variable alone. Recall that this is a composite index that uses information from several habitat factors in its construction.

Use anova() to compare models

The anova() command can compare models. The name of the command is slightly misleading as it is not really analysis of variance. Rather it compares deviance of models that are based on the same set of variables. You can get a “test” of significance; with the logistic regression you need test = “Chisq” as you did with the add1() command.

You can only really compare models that are based on the same subset of variables. You cannot really get meaningful results by comparing models that each contain only a single predictor variable. If you look at the logistic regression models for the newts:

anova(mod1, mod2, test = "Chisq")

Analysis of Deviance Table
Model 1: presence ~ macro
Model 2: presence ~ macro + other.ponds + fish + shade

  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       198     246.55
2       195     222.57  3   23.981 2.52e-05 ***

You can see that there is a significant difference between the two models. We would expect this of course, since we’ve been adding significant variables to the original model. Now look at the habitat-based model and the overall index-based one:

anova(mod2, mod3, test = "Chisq")

Analysis of Deviance Table
Model 1: presence ~ macro + other.ponds + fish + shade
Model 2: presence ~ HSI

  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       195     222.57
2       198     228.67 -3  -6.0985   0.1069

This time there is no significant difference. Neither model is “better” than the other. This is interesting because the first model is based on only 4 habitat variables but the HSI index is formed from 9 different habitat variables, each of which is itself converted to an index.

You cannot really use anova() to compare models with single variables:

anova(mod1, mod3, test = "Chisq")

Analysis of Deviance Table
Model 1: presence ~ macro
Model 2: presence ~ HSI

  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       198     246.55
2       198     228.67  0   17.883

You can get a “result” but there is no significance value as the models are not directly comparable.

Use D-squared values

You can explore differences between models using a different approach, their explanatory power. You can calculate a D-squared value (equivalent to the R-squared value of a regular regression). This approach is not strictly a statistical one but gives you an idea of “how much” better one model is than another.

Look at the D-squared values of each of the newt models:

d2(mod1) ; d2(mod2) ; d2(mod3)
[1] 0.0567
[1] 0.1484
[1] 0.1251

The first model (mod1, macrophytes only) shows a D-squared a bit over 5%. The second habitat-based model (mod2) shows a value of nearly 15%, which is not great but is nearly 3 times better than mod1.

The index-based model was not statistically different from the habitat-based model. The D-squared value of mod3 is 12.5%, which is a little smaller than the 15% of mod2.

Summary

The more variables you add to a regression model the greater the “explanatory power”, in other words the D-squared value will keep increasing. However, it is rarely smart to keep adding variables just because you can measure them. The point of regression is to help you explain variability in the data. If variables are not significant they will only add a small fraction to the overall D-squared.

The best regression models explain as much of the variability as possible with the fewest variables. In this dataset it seems clear that the habitat-based model is the best of the options. It explains the most variability in the data with the fewest variables. The HSI model looks simpler but the HSI is calculated from 9 other variables.

As an additional exercise you might look at a regression model containing all the habitat variables. You will get a D-squared value of 17% and the model is not significantly “better” than the 4-variable model or the HSI one.

As a final note: the regression models here are based on newt presence or absence. Effectively a model “success” is the presence of a newt. It might be better to think of the model as a better predictor of the absence of newts, rather than their presence. A pond may be perfectly suitable for newts but they are not there simply because they have not found it. This historical component probably explains the low D-squared values.

30th July 2019 aJfsfjlser3f S4E2e Exercises

Previous Next