Manipulating an R formula
The formula()
is an integral part of the R language. You use a formula
to specify relationships between variables for tasks such as graphics and analysis (e.g. linear modelling). Manipulating an R formula is a useful skill.
When you’ve created some kind of analysis model in R you will have specified the variables in some kind of formula
. R “recognises” formula
objects, which have their own class
"formula"
. If, for example you used the lm()
command to create a regression result you will be able to extract the formula from the result.
mod <- lm(Fertility ~ ., data = swiss)
formula(mod)
Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality
It can be useful to be able to extract the components of the model formula. For example you may want to examine how the R2 value alters as you add variables to the model.
Extract the predictor variables
To access the parts of a formula you need the terms()
command:
terms(formula(mod))
The result contains various components; you want the term.labels
.
attr(terms(formula(mod)), which = "term.labels")
[1] "Agriculture" "Examination" "Education" "Catholic"
[5] "Infant.Mortality"
You now have the variables, that is the predictor variables, from the formula. The next step is to get the response variable.
Extract the response variable
The response variable can be seen using the terms() command and the variables component, like so:
attr(terms(formula(mod)), which = "variables")
list(Fertility, Agriculture, Examination, Education, Catholic, Infant.Mortality)
The result looks slightly odd but essentially it is a list and the 2nd component is the response.
vv <- attr(terms(formula(mod)), which = "variables")
rr <- as.character(vv[[2]]) # The response variable name
rr
[1] "Fertility"
Now you have the response variable, and the predictors from earlier, which you can use to “build” a formula.
Building a formula
In its most basic sense a formula is simply a character string that “conforms” to the formula syntax: y ~ x + z
for example. You can build a formula with the paste()
command by joining the response, a ~
character and the predictors you want (these themselves separated by +
characters).
The following example uses the swiss
dataset, which is built into base R.
mod <- lm(Fertility ~ ., data = swiss)
# Get the (predictor) variables
vars <- attr(terms(formula(mod)), which = "term.labels")
# Get the response
vv <- attr(terms(formula(mod)), which = "variables")
rr <- as.character(vv[[2]]) # The response variable name
# Now the predictors
pp <- paste(vars, collapse = " + ") # All
pp <- paste(vars[1], collapse = " + ") # 1st
pp <- paste(vars[1:3], collapse = " + ") # 1,2,3
# Build a formula
fml <- paste(rr, " ~ ", pp)
fml
[1] "Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality"
Once you have your formula
as a character
object you can use it in place of a regular formula
in commands.
Using a “built” formula
The character string representing a formula
can be used exactly as you would a “regular” formula:
lm(fml, data = swiss)
Call:
lm(formula = fml, data = swiss)
Coefficients:
(Intercept) Agriculture Examination Education
66.9152 -0.1721 -0.2580 -0.8709
Catholic Infant.Mortality
0.1041 1.0770
One use for building a formula is in model testing. For example you create your regression model containing five predictors but maybe only the first three are really necessary. You can re-build the formula
term by term and extract the R2 value for example. This would show you how the explained variance alters as you add more variables.
See more articles in our Tips and Tricks pages.