Logistic regression: model prediction
Exercise 11.3.1.
Statistics for Ecologists (Edition 2) Exercise 11.3.1
These notes are about how to use the results of a regression model to predict the value of the response variable when you supply certain values of the predictor.
Predicting values from logistic regression and other models
Introduction
When you carry out a regression you are looking to describe the data in terms of the variables that form the relationships. When you’ve got your regression model you are able to describe the relationship using a mathematical model (which is what the regression model is).
You can use the regression model to make predicted values, which is where you use “new” values of the predictor (that is ones not observed in the original dataset) to predict the response variable.
These predicted values are especially important in logistic regression, where your response is binary, that is it only has two possibilities. The result you get when you “predict” response values in a logistic regression is a probability; the likelihood of getting a “positive” result when the predictor variable is set to a particular value.
These notes will show you how to use the predict() command in R to produce predicted values for regression models, particularly logistic regression.
The predict() command
The predict() command is used to compute predicted values from a regression model. The general form of the command is:
predict(model, newdata, type) |
|
model | A regression model, usually the result of lm() or glm(). |
newdata | A data.frame giving the values of the predictor(s) to use in the prediction of the response variable. |
type | The type of prediction, usually you want type = “response”. |
The model is simply the result of a regression model. You need to specify type = “response” so that your prediction predicts the response variable (note that this is not necessarily the default, so you must specify it).
The newdata parameter is where you specify the values of the predictor(s) that you want to use to predict the response variable. You can make a data.frame object containing the variables (with variable names to match the original dataset), or can specify it “on the fly”. You’ll see examples shortly.
Prediction in logistic regression
Logistic regression is carried out in cases where your response variable can take one of only two forms (i.e. it is binary). There are two general forms your response variable can take:
- Presence/absence, that is, 0 or 1 (or some other binary form).
- A success/failure matrix, where you have two frequencies for each observation (the “successes” and the “failures”).
The way your data are arranged does not make much difference to how you carry out the predictions but because of the different forms it is useful to see an example of each.
Logisitc regression and presence/absence
When you have the response variable as presence/absence (1 or 0) it is easy to see that your response is in binary form. The logistic regression converts the 1s and 0s to a likelihood (under the various levels of your predictor variables), so your result is in that form. So, when you use the predict() command you can get the probability of getting a success (a presence: 1).
The example file (Predict regressions.RData) contains a dataset that looks at the presence or absence of an amphibian (great crested newt) in ponds. The data are called gcn. Various habitat factors were measured. One factor is the percentage cover of macrophytes. You can use the logistic regression to explore the relationship between the presence (or absence) of newts and the cover of macrophytes.
Once you have the regression model you can use predict() to predict the likelihood of finding a newt given any value for the cover of macrophytes.
Make a regression model
Make a regression model result:
gcn.glm <- glm(presence ~ macro, data = gcn, family = binomial)
The result is that the macro variable is statistically significant:
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.135512 0.217387 -5.223 1.76e-07 *** macro 0.022095 0.005901 3.744 0.000181 ***
You can present a graph of the model (see the book text for the code):
Predict outcomes
Now you could look at the line on the plot (made using fitted values) and estimate the probability of finding a newt (presence = 1) for any percentage cover of macrophyte. Alternatively you can use the predict() command.
nd = data.frame(macro = c(40, 50, 60)) nd macro 1 40 2 50 3 60 predict(gcn.glm, newdata = nd, type = “response”) 1 2 3 0.4374069 0.4923162 0.5474114
See that you first make a data.frame with a variable called macro, to match the original data. This new data is used to predict the response. However, the value you get is not exactly a response but it is the likelihood of “success”, that is finding a newt. You could have set the original response variable (presence) to have been 1 for absence and 0 for presence. Apart from being a little odd you’d be assuming that a “success” was the absence of the newts.
In any event your prediction is a probability of success. You might have chosen to create the prediction newdata data.frame directly in the predict() command, which is what you’ll see in the following example.
Logistic regression and success/failure matrix
Your data may be in a subtly different form to the 0 or 1 model. In the following example you have a predictor variable, latitude, and a response but the response is split into two frequencies:
cbh latitude Mpi90 Mpi100 p.Mpi100 1 48.1 47 139 0.7473118 2 45.2 177 241 0.5765550 3 44.0 1087 1183 0.5211454 4 43.7 187 175 0.4834254 5 43.5 397 671 0.6282772 6 37.8 40 14 0.2592593 7 36.6 39 17 0.3035714 8 34.3 30 0 0.0000000
Here you have two versions of an allele in the Californian beach hopper. The data are in the file (Predict regressions.RData) and are called cbh. For each sampling latitude a number of animals were collected. They can have one form of the allele or the other. So your response variable is split into the frequency of each. If you take those two variables you have a success/failure matrix.
The logistic regression can constrict the likelihood from the success/failure. It does not matter which allele you choose to be the “success” but here I’ve used the Mpi100 allele (because the proportion increases with latitude). The proportion of the “success” is shown in the final column. It won’t be used in the regression except to draw a plot.
Make a regression model
Start by making a regression model. You need to make a success/failure matrix but this can be done as part of the glm() command formula.
cbh.glm <- glm(cbind(cbh$Mpi100, cbh$Mpi90) ~ latitude, data = cbh, family = binomial)
The result shows that latitude is a significant variable:
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -7.64686 0.92487 -8.268 <2e-16 *** latitude 0.17864 0.02104 8.490 <2e-16 ***
You can draw this as a plot (because of the replicates we can add error bars):
Predict outcomes
As before you could use the plot to predict the likelihood of “success”, the Mpi100 allele. You can also use the predict() command:
predict(cbh.glm, newdata = data.frame(latitude = c(35, 40, 45)), type = "response") 1 2 3 0.1986945 0.3772411 0.5967457
Note that this time the newdata parameter used a data.frame created on the fly.
Prediction with multiple predictors
When you have multiple predictor variables you have to specify them all in the newdata parameter of the predict() command. It is easiest to make a new data.frame object to hold these “new” data unless you only have one or two values.
In the following data example there are four predictors and one response (Length).
names(mf) [1] "Length" "Speed" "Algae" "NO3" "BOD"
The data give the lengths of a freshwater invertebrate (a mayfly species) and some habitat conditions at the sampling locations where each observation took place. The data are in the file (Predict regressions.RData) and are called mf.
Your new data.frame has to have the same variable names but they do not have to be in any order. You only have to include variables that are in the regression model.
Make the regression model
Make the model:
mf.lm <- lm(Length ~ BOD + Algae, data = mf)
The result shows the coefficients (one variable is not significant but we’ll leave it in).
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 22.34681 4.09862 5.452 1.78e-05 *** BOD -0.03779 0.01517 -2.492 0.0207 * Algae 0.04809 0.03504 1.373 0.1837
Predict the result
Because there are two predictors it is not practical to draw a plot (apart from diagnostics, see Exercise 11.1). So the predict() command is the way to go.
predict(mf.lm, newdata = data.frame(BOD = 20, Algae = 10), type = “response”) 1 22.07199
You can of course make a separate data.frame containing BOD and Algae columns and populate it with values for which you want to get a prediction.
Data examples
The data examples used here are available as part of the support data for the book. I’ve also added a file Predict regressions.RData, which contains just the three datasets mentioned on this page.
Comments are closed.