Dr. Mark Gardener

Data Analysis Publications Courses About

On this page...

Regression Introduction

Regression models

Regression coefficients

Beta coefficients

R squared values

Graphing the regression

Regression step-by-step

Home > Data Analysis > Using R - Multiple Regression

Using R for statistical analyses - Multiple Regression

This page is intended to be a help in getting to grips with the powerful statistical program called R. It is not intended as a course in statistics (see here for details about those). If you have an analysis to perform I hope that you will be able to find the commands you need here and copy/paste them into R to get going.

On this page learn about multiple regression analysis including: how to set-up models, extracting the coefficients, beta coefficients and R squared values. There is a short section on graphing – see the main graph page for more detailed information about graphics in general.

What is R? | Topic Navigation Index| R Tips, Tricks and Hints | MonogRaphs | Go to 1st Topic


I run courses in using R; these may be held at various locations:

If you are interested then see our Courses page or contact us for details.


My publications about R and Data Science

See my books about R and Data Science on my Publications page
 
 

I have more projects in hand - visit my Publications page from time to time. You might also like my random essays on selected R topics in MonogRaphs. See also my Writer's Bloc page, details about my latest writing project including R scripts developed for the book.


Skip directly to the 1st topic

R is Open Source

R is Free

Get R at the R Project page.

What is R?

R is an open-source (GPL) statistical environment modeled after S and S-Plus. The S language was developed in the late 1980s at AT&T labs. The R project was started by Robert Gentleman and Ross Ihaka (hence the name, R) of the Statistics Department of the University of Auckland in 1995. It has quickly gained a widespread audience. It is currently maintained by the R core-development team, a hard-working, international team of volunteer developers. The R project web page is the main site for information on R. At this site are directions for obtaining the software, accompanying packages and other sources of documentation.

R is a powerful statistical program but it is first and foremost a programming language. Many routines have been written for R by people all over the world and made freely available from the R project website as "packages". However, the basic installation (for Linux, Windows or Mac) contains a powerful set of tools for most purposes.

Because R is a programming language it can seem a bit daunting; you have to type in commands to get it to work. However, it does have a Graphical User Interface (GUI) to make things easier. You can also copy and paste text from other applications into it (e.g. word processors). So, if you have a library of these commands it is easy to pop in the ones you need for the task at hand. That is the purpose of this web page; to provide a library of basic commands that the user can copy and paste into R to perform a variety of statistical analyses.


Top

Navigation index

Introduction

Getting started with R:

Top
What is R?
Introduction
Data files
Inputting data
Seeing your data in R
What data are loaded?
Removing data sets
Help and Documentation


Data2

More about manipulating data and entering data without using a spreadsheet:

Making Data
Combine command
Types of Data
Entering data with scan()
Multiple variables
More types of data
Variables within data
Transposing data
Making text columns
Missing values
Stacking data
Selecting columns
Naming columns
Unstacking data


Help and Documentation

A short section on how to find more help with R

 

Basic Statistics

Some statistical tests:

Basic stats
Mean
Variance
Quantile
Length

T-test
Variance unequal
Variance Equal
Paired t-test
T-test Step by Step

U-test
Two sample test
Paired test
U-test Step by Step

Paired tests
T-test: see T-test
Wilcoxon: see U-test

Chi Squared
Yates Correction for 2x2 matrix
Chi-Squared Step by Step

Goodness of Fit test
Goodness of Fit Step by Step


Non-Parametric stats

Stats on multiple samples when you have non-parametric data.

Kruskal Wallis test
Kruskal-Wallis Stacked
Kruskal Post-Hoc test
Studentized Range Q
Selecting sub-sets
Friedman test
Friedman post-hoc
Rank data ANOVA

 

Correlation

Getting started with correlation and a basic graph:

Correlation
Correlation and Significance tests
Graphing the Correlation
Correlation step by step


Regression

Multiple regression analysis:

Multiple Regression
Linear regression models
Regression coefficients
Beta coefficients
R squared
Graphing the regression
Regression step by step


ANOVA

Analysis of variance:

ANOVA analysis of variance
One-Way ANOVA
Simple Post-hoc test
ANOVA Models
ANOVA Step by Step

 

Graphs

Getting started with graphs, some basic types:

Introduction
Bar charts
Multi-category
Stacked bars
Frequency plots
Horizontal bars

Histograms

Box-whisker plots
Single sample
Multi-sample
Horizontal plot


Graphs2

More graphical methods:

Scatter plot

Stem-Leaf plots

Pie charts


Graphs3

More advanced graphical methods:

Line Plots
Plot types
Time series
Custom axes


Multiple regression (linear modelling) is carried out using the lm() command

Top

Navigation Index

Multiple Regression

R can perform multiple regression quite easily. The basic function is: lm(model, data)

The first stage is to arrange your data in a .CSV file. Use a column for each variable and give it a meaningful name. Don't forget that variable names in R can contain letters and numbers but the only punctuation allowed is a period.

The second stage is to read your data file into memory and give it a sensible name.

Finally you need to define the model and run the analysis.


Top

Navigation Index

 

R has a powerful model syntax. This is used in other analyses too e.g. anova

Linear Regression Models

The basic form of a linear regression is: y = m1x1 + m2x2 + m3x3... + c

Given a series of ys and a series of x1, x2 etc. you can determine the coefficients (the m) and the intercept (c). You can also determine the relative strength of the factors and how well correlated each factor (or combination) is.

The general form of models in R is:

y ~ x1 + x2...

There are a number of options, depending upon your data set. Let's consider the situation where you have a single dependent variable (y) and 3 factors that you think are important in determining y; we'll call them x1, x2 and x3. In reality we'd give them more meaningful names. You can set up your model in a number of ways:

Model Meaning
y ~ x1 y is modelled by x1 only, a simple regression
y ~ x1 + x2 y is modelled by x1 and x2 as in a multiple regression
y ~ x1 + x2 + x3 y is modelled by x1, x2 and x3 as in a multiple regression
y ~ x1 * x2 y is modelled by x1, x2 and also by the interaction between them

To run an analysis you use the lm() function on your data e.g.

> lm(y ~ x1 + x2 +x3, data = your.data)

It is good practice to use a variable to "hold" the result of the analysis; you can do other things on the result that would be tedious to type in every time. In this instance for e.g.

> field.lm = lm(y ~ x1 + x2 + x3)

If you now type the name of your new variable you see the result; something like this:

> field.lm
 
Call:
lm(formula = y ~ x1 + x2 + x3)
       
Coefficients:
     
(Intercept)
x1
x2
x3
4.8401
1.3196
0.8252
0.5266
       
>      

The summary() command will produce a tidy output of the results of a lm() command

Top

Navigation Index

This is fine but the information is a bit thin. To get a bit more information you can use summary(our lm result). In this case you would see:

> summary(field.lm)
 
Call:
lm(formula = y ~ x1 + x2 + x3)
         
Residuals:
Min
1Q
Median
3Q
Max
-5.7190
-2.4540
-0.9873
2.9214
7.8078
         
Coefficients:        
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.8401 8.9407 0.541 0.59906
x1 1.13196 0.3346 3.944 0.00230**
x2 0.8252 0.6320 1.306 0.21832
x3 0.5266 0.6999 0.752 0.46756
---        
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.63 on 11 degrees of freedom
Multiple R-Squared: 0.6581, Adjusted R-squared: 0.5648
F-statistic: 7.056 on 3 and 11 DF, p-value: 0.0065
         
>        

This is much more useful; you can see the coefficients and which factors are significant. You can also see the overall R-squared value for your model. The final stat shows you the significance of the overall model. In the example above only x1 proved to be of significance.


You can use the basic coefficients from a lm() result to calculate Beta coefficients, which are standardised against one another

With Beta coefficients you can determine r-squared for each variable

Top

Navigation Index

Regression coefficients

Once you have the basic information you may wish to delve further and examine more components of your rgression model. The basic lm() function gives us a list of the basic coefficients but you may wish to utilize them individually. To see an individual coefficient you can use:

> variable = lm(model)
> variable$coef["factor"]

In the example above the basic model was called field.lm

To see the coefficient for x1 you would type

> field.lm$coeff["x1"]


Beta coefficients

You can now determine the beta coefficients; that is the coefficients that are standardized against one another to show you the relative strengths. A beta coefficient is determined mathematically as:

beta = coeff * SD(x) / SD(y)

Where SD is the standard deviation.

First assign a variable for each coefficient e.g. > coeff.x1 = your.lm$coeff["factor"]

Next determine the beta coefficient e.g. > beta.x1 = coeff.x1 * sd(x1) / sd(y)


R squared

If you use > summary(our.lm) you can see extra information like the R squared value, which tells you how strong the fit is (the proportion of the explained variance). However, R only shows us the value for the overall model. We can find the individual R-squared values once you know the beta coefficients. Mathematically each R-squared value is: R2 = beta * r (where r is the correlation between y and the x factor). To get r you use cor(y, x)

So, in R you type:

> R2.x1 = beta.x1 * cor(y, x1)

You merely alter the names of the variables to suit your data.


Top

Navigation Index

 

The plot() command can accept input in two forms:


plot(x, y)
plot(y ~ x, data = your.data)

Graphing the regression

Now you have a basic numeric summary of the regression. What would be nice would be to have a graphical summary. There are two basic graphs that you can call on quickly to summarize our data.

> pairs(your.data)
> plot(x.var, y.var)

The first graph draws a scatterplot for each pair of variables. This is a useful quick summary but can be rather messy if you have lots of factors. You run the pairs plot on the original data not the actual linear model.

The 2nd plot will produce a scatter graph of any two pairs of variables; a tidier method is to use the formula syntax:

> plot(y.var ~ x.var, data = data)

You might want to add a best-fit line to the scatter plot.

> abline(lm(y.var ~ x.var))

When you draw a graph in R the graph appears in a separate window, You can resize this and also copy to the clipboard for use in another program. You can also print directly from R.


Top

Navigation Index

Regression step-by-step

Here is a step by step guide to performing a regression. Just copy the commands you need (one at a time) and paste into R. Edit as required for your data set and variable names.

Step-by-step Regression
First create your data file. Use a spreadsheet and make each column a variable. Each row is a replicate. The first row should contain the variable names. Save this as a .CSV file  
Read the data into R and save as some name your.data = read.csv(file.choose())
Allow the factors within the data to be accessible to R   attach(your.data)
Have a first look at the data as a pairs graph (plots all combinations as scatter plots)   pairs(your.data)
Decide on the model, run it and assign the result to a new variable your.lm = lm(y.var ~ x1.var + x2.var + x3.var)
See the basic coefficients of your regression   your.lm
A more detailed summary of your regression   summary(your.lm)
Examine an individual coefficient   your.lm$coeff["x1.var"]
Calculate the beta coefficients (you will need to do one for each x factor) beta.x1 = your.lm$coeff["x1.var"] * sd(x1.var) / sd(y.var)
Display all your beta coefficients   cat(beta.x1, beta.x2, beta.x3)
Calculate the R-squared components (you will need to do one for each x factor) R2.x1 = beta.x1 * cor(y.var, x1.var)
Display all your R-squared values   cat(R2.x1, R2.x2, R2.x3)
Plot a graph of two variables from your regresion   plot(x.var, y.var, xlab="x-label", ylab="y-label"))
Add a line of best fit   abline(lm(y.var ~ x.var)
Detach the data   detach(your.data)

 
Data Analysis Home Back to Correlation | R Tips & Tricks | MonogRaphs | Forward to Analysis of Variance
Top
Main Hompage