Dr. Mark Gardener

Data Analysis Publications Courses About

On this page...

ANOVA Introduction

One-way ANOVA

Post-hoc testing

ANOVA models

ANOVA step-by-step


Using R for statistical analyses - ANOVA

This page is intended to be a help in getting to grips with the powerful statistical program called R. It is not intended as a course in statistics (see here for details about those). If you have an analysis to perform I hope that you will be able to find the commands you need here and copy/paste them into R to get going.

On this page learn how to conduct analysis of variance including, one-way anova, post-hoc testing and more complex anova models.

What is R? | Topic Navigation Index| R Tips, Tricks and Hints | MonogRaphs | Go to 1st Topic


I run courses in using R; these may be held at various locations:

If you are interested then see our Courses page or contact us for details.


My publications about R and Data Science

 
See my books about R and Data Science on my Publications page
 

I have more projects in hand - visit my Publications page from time to time. You might also like my random essays on selected R topics in MonogRaphs. See also my Writer's Bloc page, details about my latest writing project including R scripts developed for the book.


 
Skip directly to the 1st topic

R is Open Source
R is Free

Get R at the R Project page.

What is R?

R is an open-source (GPL) statistical environment modeled after S and S-Plus. The S language was developed in the late 1980s at AT&T labs. The R project was started by Robert Gentleman and Ross Ihaka (hence the name, R) of the Statistics Department of the University of Auckland in 1995. It has quickly gained a widespread audience. It is currently maintained by the R core-development team, a hard-working, international team of volunteer developers. The R project web page is the main site for information on R. At this site are directions for obtaining the software, accompanying packages and other sources of documentation.

R is a powerful statistical program but it is first and foremost a programming language. Many routines have been written for R by people all over the world and made freely available from the R project website as "packages". However, the basic installation (for Linux, Windows or Mac) contains a powerful set of tools for most purposes.

Because R is a programming language it can seem a bit daunting; you have to type in commands to get it to work. However, it does have a Graphical User Interface (GUI) to make things easier. You can also copy and paste text from other applications into it (e.g. word processors). So, if you have a library of these commands it is easy to pop in the ones you need for the task at hand. That is the purpose of this web page; to provide a library of basic commands that the user can copy and paste into R to perform a variety of statistical analyses.


Top

Navigation index

Introduction

Getting started with R:

Top
What is R?
Introduction
Data files
Inputting data
Seeing your data in R
What data are loaded?
Removing data sets
Help and Documentation


Data2

More about manipulating data and entering data without using a spreadsheet:

Making Data
Combine command
Types of Data
Entering data with scan()
Multiple variables
More types of data
Variables within data
Transposing data
Making text columns
Missing values
Stacking data
Selecting columns
Naming columns
Unstacking data


Help and Documentation

A short section on how to find more help with R

 

Basic Statistics

Some statistical tests:

Basic stats
Mean
Variance
Quantile
Length

T-test
Variance unequal
Variance Equal
Paired t-test
T-test Step by Step

U-test
Two sample test
Paired test
U-test Step by Step

Paired tests
T-test: see T-test
Wilcoxon: see U-test

Chi Squared
Yates Correction for 2x2 matrix
Chi-Squared Step by Step

Goodness of Fit test
Goodness of Fit Step by Step


Non-Parametric stats

Stats on multiple samples when you have non-parametric data.

Kruskal Wallis test
Kruskal-Wallis Stacked
Kruskal Post-Hoc test
Studentized Range Q
Selecting sub-sets
Friedman test
Friedman post-hoc
Rank data ANOVA

 

Correlation

Getting started with correlation and a basic graph:

Correlation
Correlation and Significance tests
Graphing the Correlation
Correlation step by step


Regression

Multiple regression analysis:

Multiple Regression
Linear regression models
Regression coefficients
Beta coefficients
R squared
Graphing the regression
Regression step by step


ANOVA

Analysis of variance:

ANOVA analysis of variance
One-Way ANOVA
Simple Post-hoc test
ANOVA Models
ANOVA Step by Step

 

Graphs

Getting started with graphs, some basic types:

Introduction
Bar charts
Multi-category
Stacked bars
Frequency plots
Horizontal bars

Histograms

Box-whisker plots
Single sample
Multi-sample
Horizontal plot


Graphs2

More graphical methods:

Scatter plot

Stem-Leaf plots

Pie charts


Graphs3

More advanced graphical methods:

Line Plots
Plot types
Time series
Custom axes


Top

Navigation Index

ANOVA - analysis of variance

The analysis of variance is a commonly used method to determine differences between several samples. R provides a function to conduct ANOVA so: aov(model, data)

  1. The first stage is to arrange your data in a .CSV file. Use a column for each variable and give it a meaningful name. Don't forget that variable names in R can contain letters and numbers but the only punctuation allowed is a period. You need to set out your data file so that each column represents a factor in your analysis. Usually the 1st column will be your dependent variable (i.e. what you are actually measuring) and subsequent columns would be the independent factors (e.g. site, treatment).
  2. The second stage is to read your data file into memory and give it a sensible name.
  3. The next stage is to attach your data set so that the individual variables are read into memory.
  4. Next you need to define the model and run the analysis.
  5. Finally detach the data set.

Top

Navigation Index

 

R uses a powerful model syntax e.g. y ~ x1 * x2 that alloows you to specify complex analyses.

ANOVA One-way

Analysis of variance and regression have much in common. Both examine a dependent variable and determine the variability of this variable in response to various factors. The simplest ANOVA would be where you have a single dependent variable and one single factor. For example, you may have raised broods of flies on various sugars. You measure the size of the individual flies and record the diet for each. Your data file would consist of two columns; one for growth and one for sugar. e.g.

Example data file for 1-way anova
growth sugar
75
C
72
C
73
C
61
F
67
F
64
F
62
S
63
S

... and so on. In this case you have a column for the dependent variable (growth) and a column for the dependent factor (sugar). The first column contains numeric data but the second contains letters. You could assign a number to each diet but it is more meaningful to assign a character string. It does not matter to R which form you have your dependent factors but it will be easier to interpret the results if you use meaningful names. Remember though that the only non-letter (i.e. punctuation) can be a period.

The next step is to run the analysis. It is always a good idea to assign a variable to the result of the analysis so:

> your.aov = aov(growth ~ sugar)

Notice here the funny symbol (a tilde) in the model. This means take growth as the dependent variable, it depends on sugar. You will see more complex models later but the form is similar to that used in multiple regression.

To see the result of the analysis type in the name of the variable you gave it e.g.

> your.aov

Call:
aov(formula = growth ~ sugar)

Terms:

 
sugar
Residuals
Sum of Squares
1146.0539
264.9286
Deg. of Freedom
5
51

Residual standard error: 2.279184
Estimated effects may be unbalanced
>

The basic result doen not give a great deal of information. You need to view the summary so try:

> summary(your.aov)

 
Df
Sum Sq
Mean Sq
F value
Pr(>F)
 
sugar
5
1146.05
229.21
44.124
< 2.2e-16
***
Residuals
51
264.93
5.19
 

---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>

This is rather more useful as we can now see the F-value and the level of significance.


Top

Navigation Index

 

Tukey HSD is the most commonly used post-hoc test.

Post-hoc testing

So far you have conducted a simple one-way anova. In this instance you see that there is a significant effect of diet upon growth. However, there are 6 treatments. You would like to know which of these treatments are significantly different from the controls and from other treatments. You need a post-hoc test. R provides a simple function to carry out the Tukey HSD test.

> TukeyHSD(your.aov)

This will show all the paired comparisons like so:

> TukeyHSD(fly.aov)
Tukey multiple comparisons of means
95% family-wise confidence level

Fit: aov(formula = growth ~ sugar)

$sugar

 
diff
upr
lwr
p adj
F-C
-11.900000
-14.917873
-8.882127
0.0000000
F.G-C
-12.100000
-15.117873
-9.082127
0.0000000
G-C
-10.800000
-13.817873
-7.782127
0.0000000
S-C
-6.000000
-9.017873
-2.982127
0.0000045
test-C
-4.814286
-8.139820
-1.488751
0.0010850
F.G-F
-0.200000
-3.217873
2.817873
0.9999573
G-F
1.100000
-1.917873
4.117873
0.8873810
S-F
5.900000
2.882127
8.917873
0.0000064
test-F
7.085714
3.760180
10.411249
0.0000010
G-F.G
1.300000
-1.717873
4.317873
0.7968101
S-F.G
6.100000
3.082127
9.117873
0.0000032
test-F.G
7.285714
3.960180
10.611249
0.0000005
S-G
4.800000
1.782127
7.817873
0.0002701
test-G
5.985714
2.660180
9.311249
0.0000322
test-S
1.185714
-2.139820
4.511249
0.8963383

>

The table/output shows you the difference between pairs, the 95% confidence interval(s) and the p-value of the pairwise comparisons. All we need to know!


Top

Navigation Index

 

Use the model syntax to specify complex analyses in R

ANOVA models

So far you have only cinsidered a simple one-way analysis. However, you will often have a more complex situation with several factors. The interaction between factors may also be important. Fortunately R has a model syntax that works for many sorts of analysis. Look at the section on Linear Regression Models for examples. When conducting an anova you have a single dependent variable and a number of explanatory factors.

You set-up your anova in a general way: dependent ~ explanatory1... explanatory2...

The model can take a variety of forms:

Model Meaning
y ~ x1 y is explained by x1 only, a one-way anova
y ~ x1 + x2 y is explained by x1 and x2, a two-way anova
y ~ x1 + x2 + x3 y is explained by x1, x2 and x3, a 3-way anova
y ~ x1 * x2 y is explained by x1, x2 and also by the interaction between them

In reality you would give the variables more meanigful names. However, we can see that is pretty simple to alter your basic model to cope with more complex analyses.


Top

Navigation Index

ANOVA Step by Step

First create your data file. Use a spreadsheet and make each column a variable. Each row is a replicate. The first row should contain the variable names. Save this as a .CSV file
Read your data into R and assign a variable to it. This opens up a window and you select your file. your.data = read.csv(file.choose())
Allow R to read the variables within the data file.   attach(your.data)
Decide on the anova model and run the analysis your.aov = aov(dependent ~ explanatory)
View the result   summary(your.aov)
Carry out pairwise post-hoc testing using Tukey HSD test   TukeyHSD(your.aov)
Close your data file for tidiness   detach(your.data)

 
Data Analysis Home Back to Regression | R Tips & Tricks | MonogRaphs | Forward to Graphs 1 (Bar, Hist, Box)
Top
Main Homepage