Dr. Mark Gardener

Data Analysis Publications Courses About

On this page...

Kruskal Wallis test
Kruskal-Wallis Stacked
Kruskal Post-Hoc test
Studentized Range Q
Selecting sub-sets
Friedman test
Friedman post-hoc
Rank data ANOVA

Home > Data Analysis > Using R - Non-parametric Statistics

Using R for statistical analyses - Non-parametric statistics

This page is intended to be a help in getting to grips with the powerful statistical program called R. It is not intended as a course in statistics (see here for details about those). If you have an analysis to perform I hope that you will be able to find the commands you need here and copy/paste them into R to get going.

On this page learn how to perform some non-parametric statistics on multiple variables. Routines covered include Kruskal-Wallis and Friedman tests. Learn also how to carry out a post-hoc analysis on the Kruskal-Wallis test. Learn about the studentized range - a useful statistic used in many post-hoc testing situations.

What is R? | Topic Navigation Index| R Tips, Tricks and Hints | MonogRaphs | Go to 1st Topic


I run courses in using R; these may be held at various locations:

If you are interested then see our Courses page or contact us for details.


My publications about R and Data Science

See my books about R and Data Science on my Publications page
 
 

I have more projects in hand - visit my Publications page from time to time. You might also like my random essays on selected R topics in MonogRaphs. See also my Writer's Bloc page, details about my latest writing project including R scripts developed for the book.


Skip directly to the 1st topic

R is Open Source
R is Free

Get R at the R Project Page

What is R?

R is an open-source (GPL) statistical environment modeled after S and S-Plus. The S language was developed in the late 1980s at AT&T labs. The R project was started by Robert Gentleman and Ross Ihaka (hence the name, R) of the Statistics Department of the University of Auckland in 1995. It has quickly gained a widespread audience. It is currently maintained by the R core-development team, a hard-working, international team of volunteer developers. The R project web page is the main site for information on R. At this site are directions for obtaining the software, accompanying packages and other sources of documentation.

R is a powerful statistical program but it is first and foremost a programming language. Many routines have been written for R by people all over the world and made freely available from the R project website as "packages". However, the basic installation (for Linux, Windows or Mac) contains a powerful set of tools for most purposes.

Because R is a programming language it can seem a bit daunting; you have to type in commands to get it to work. However, it does have a Graphical User Interface (GUI) to make things easier. You can also copy and paste text from other applications into it (e.g. word processors). So, if you have a library of these commands it is easy to pop in the ones you need for the task at hand. That is the purpose of this web page; to provide a library of basic commands that the user can copy and paste into R to perform a variety of statistical analyses.


Top

Navigation index

Introduction

Getting started with R:

Top
What is R?
Introduction
Data files
Inputting data
Seeing your data in R
What data are loaded?
Removing data sets
Help and Documentation


Data2

More about manipulating data and entering data without using a spreadsheet:

Making Data
Combine command
Types of Data
Entering data with scan()
Multiple variables
More types of data
Variables within data
Transposing data
Making text columns
Missing values
Stacking data
Selecting columns
Naming columns
Unstacking data


Help and Documentation

A short section on how to find more help with R

 

Basic Statistics

Some statistical tests:

Basic stats
Mean
Variance
Quantile
Length

T-test
Variance unequal
Variance Equal
Paired t-test
T-test Step by Step

U-test
Two sample test
Paired test
U-test Step by Step

Paired tests
T-test: see T-test
Wilcoxon: see U-test

Chi Squared
Yates Correction for 2x2 matrix
Chi-Squared Step by Step

Goodness of Fit test
Goodness of Fit Step by Step


Non-Parametric stats

Stats on multiple samples when you have non-parametric data.

Kruskal Wallis test
Kruskal-Wallis Stacked
Kruskal Post-Hoc test
Studentized Range Q
Selecting sub-sets
Friedman test
Friedman post-hoc
Rank data ANOVA

 

Correlation

Getting started with correlation and a basic graph:

Correlation
Correlation and Significance tests
Graphing the Correlation
Correlation step by step


Regression

Multiple regression analysis:

Multiple Regression
Linear regression models
Regression coefficients
Beta coefficients
R squared
Graphing the regression
Regression step by step


ANOVA

Analysis of variance:

ANOVA analysis of variance
One-Way ANOVA
Simple Post-hoc test
ANOVA Models
ANOVA Step by Step

 

Graphs

Getting started with graphs, some basic types:

Introduction
Bar charts
Multi-category
Stacked bars
Frequency plots
Horizontal bars

Histograms

Box-whisker plots
Single sample
Multi-sample
Horizontal plot


Graphs2

More graphical methods:

Scatter plot

Stem-Leaf plots

Pie charts


Graphs3

More advanced graphical methods:

Line Plots
Plot types
Time series
Custom axes


The kruskal.test() command cariies out the Kruskal-Wallis test

Top

Navigation Index

Kruskal-Wallis test

When you have more than two samples to compare you would usually attempt to use analysis of variance (see the section on anova). However, if the data are not normally distributed (i.e. not parametric) then an alternative must be sought. This is where the Kruskal-Wallis test comes in. It is designed to test for significant differences in population medians when you have more than two samples (otherwize you would use the U-test). You can think of the K-W test as a non-parametric version of one-way anova.

In order to carry out a Kruskal-Wallis test we need some data. The simplest form of data would be one containing several columns, one for each sample. Usually it is best to create your data in a spreadsheet and then save as a CSV file for reading into R. See the sections on creating data and reading data into R in the introduction for more details.

Here is a data file. The numbers represent the growth of an insect fed upon a variety of sugar diets. Each column is a sample of a single diet.

> carbs

 
C
G
F
F.G
S
test
1
75
57
58
58
62
63
2
67
58
61
59
66
64
3
70
60
56
58
65
66
4
75
59
58
61
63
65
5
65
62
57
57
64
67
6
71
60
56
56
62
68
7
67
60
61
58
65
64
8
67
57
60
57
65
NA
9
76
59
57
57
62
NA
10
68
61
58
59
67
NA

The Kruskal-Wallis test is carried out using the kruskal.test() function. In this case you type:

> carbs.kw = kruskal.test(carbs)
> carbs.kw

Kruskal-Wallis rank sum test

data: carbs
Kruskal-Wallis chi-squared = 46.0217, df = 5, p-value = 8.99e-09

You can see from the result that there is a significant difference in growth with diet. However, at this stage you cannot tell which of the treatments is different from which. You can get an overview by creating a simple boxplot of the data:

> boxplot(carbs)


You can specify the data as a formula of form y ~ groups or simply specify the data if the samples are in separate columns

Top

Navigation Index

Kruskal-Wallis and stacked data

The Kruskal-Wallis test can be run on data that are arranged by sample in columns. However, your data may be arranged in a different configuration. The most useful configuration for our data would be two columns, one containing the numeric values and the other containing the group (or factor). This configuration is usually used for analysis of variance.

It is generally best to create your data in a spreadsheet and then save as a CSV file for reading into R. See the sections on creating data and reading data into R in the introduction for more details.

Here is an example of part of a data file (it is the same as the one used above). The data represent the growth of an insect when fed upon a range of diets containing different sugars. The first column gives the growth whilst the second gives the grouping.

> carbs

growth
sugar
75
C
72
C
73
C
61
F
67
F
64
F
62
S
63
S

The Kruskal-Wallis test is carried out using the kruskal.test() function as before. You can type the command in one of two forms. For the above data you might use:

> attach(carbs)
> carbs.kw = kruskal.test(growth, sugar)

Alternatively you can use the model syntax that is also used for anova and regression amongst others.

> carbs.kw = kruskal.test(growth ~ sugar, data= carbs)

The second method is to be preferred as it is not necessary to attach the data file. The variables are found by the data= carbs parameter in the model syntax but this will not work for the first method. Whichever method is chosen the output is the same:

> carbs.kw

Kruskal-Wallis rank sum test

data: growth by sugar
Kruskal-Wallis chi-squared = 46.0217, df = 5, p-value = 8.99e-09

Of course you get an identical result to that obtained before. If you want to create a boxplot of the data this time you use a slightly different syntax:

> boxplot(growth ~ sugar, data= carbs)


You can carry out a post-hoc test using the Studentized Range, Q in a method similar to that of Tukey

Top

Navigation Index

Post-Hoc test for Kruskal Wallis

Opinions vary amongst statisticians about the best way to conduct a post-hoc test on non-parametric data. Here I will describe an approach based upon the Tukey method and described in Sokal and Rohlf (1995).

There is no in-built function in R that will conduct the post-hoc analysis. You will have to do it 'long-hand'. The basic idea is that you conduct a pairwize U-test and then compare to a new statistic based upon the Studentized range (this is used in 'regular' post-hoc analyses e.g. Tukey HSD).

After you have carried out the pairwize U-test you calculate a critical U value using the studentized range (Q) in the formula below:

If your value of U is greater than this calculated critical value then the pairwize comparison is significant. If you have unequal sample sizes then you must calculate the harmonic mean of the samples sizes (n) like so:

Of course you may easily re-arrange the equation to work out a critical value of Q:

Now that you know the maths to use you need to work out how to obtain values of Q.


Top

Navigation Index

Studentized Range - Q

The studentized range statistic is commonly used in post-hoc analyses. The distribution function is built-in to R and you can access it in one of two ways.

ptukey(q, nmeans, df)
qtukey(p, nmeans, df)

In the first case you input a confidence level and get the corresponding Q value. In the second case you input a Q value and get the corresponding confidence level. In both cases the nmeans parameter is a numerical value that corresponds to the number of samples that were in the original analysis. The df parameter is the degrees of freedom, for the post-hoc test you use infinity (i.e. df= Inf).

With respect to our Kruskal-Wallis post-hoc test the easiest way to proceed would be to calculate the value of Q that results when using the U value calculated by the pairwize U-test (using the formula as shown above).

Next enter the values into the ptukey() command using:

q= the value of Q you just found.
nmeans= the number of samples in the original K-W test (e.g. 6 for our carbs data set)
df= Inf

The result is the Confidence Interval not a p-value. You need to use 1-our.result to get a p-value.


An alternative method would be to work out the critical value of Q first of all. Use:

qtukey(CI, nmeans, df= Inf)

Where CI is your chosen confidence interval (e.g. 0.95) and nmeans= the number of samples in the original K-W test (e.g. 6 for our carbs data).

Then you would calculate the critical value of U using the equation as shown above.

Finally you would carry out the pairwize U-test and compare the U value to your critical value (of course you would ignore the actual result of the U-test as you are only interested in the U value).

Which U-value?

However, there is a potential problem. When a U-test is carried out there are two possible U values. When calculating U by hand you ignore the larger one and compare the smaller against a table of critical values. For your post-hoc test you actually want the larger of the two U values. However, R only displays one value and whether it is the largest or the smallest depends upon which order you entered the variables for comparison.

If you used a data frame with multiple sample columns then you will have to repeat the test with the variables in the reverse order so that you can decide which is the largest U value (alternatively you can work it out by the fact that U1 + U2 = n1 * n2).

If however, your data are in the stacked form then the wilcox.test() displays the U value that arises from the order of variables in the data set. Since you cannot repeat the test with the variables in any other order you will have to work out the other one and select the largest. Since the stacked data set contains all the samples you will have to select which ones to include in the pairwize comparison. How to do this is detailed below.


The subset instruction allows you to select subsets from a larger dataset

Top

Navigation Index

Selecting two samples from a larger data set

If you have a stacked data set containing several samples you may wish to analyze only two of them. You need to select a subset for analysis e.g. in the carbs example above we had 6 samples but to perform post-hoc tests we need to carry out U-tests on pairs of samples.

The basic wilcox.test() command allows you to select a subset quite easily. For the carb example you might use something like the following:

> wilcox.test(growth ~ sugar, data= carbs, subset= sugar %in% c("test", "C"))

To get the subset you use the subset= parameter. Here you have used a subset of the variable sugar. The list of samples follows the %in% part and is in the c(item1, item2) format that you have met before. The variable names must be in quotes (double quotes are usual but single quotes are okay).

The subset= parameter works with other functions too, for example you may wish to run our Kruskal-Wallis test on all the monosaccharides e.g.

> kruskal.test(growth ~ sugar, data= carbs, subset= sugar %in% c("G", "F", F.G"))

This command would run the test and analyze differences between the three samples (G, F and F.G) only.


The friedman.test() command carries out the Friedman test for replicated block designs

Top

Navigation Index

Friedman test (in lieu of two-way anova)

The Friedman test is essentially a 2-way analysis of variance used on non-parametric data. The test only works when you have completely balanced design. Here is an example of a data file.

> survey

 
count
month
year
1
2
1
2004
2
48
1
2005
3
40
1
2006
4
3
2
2004
5
120
2
2005
6
81
2
2006
7
2
3
2004
8
16
3
2005
9
36
3
2006
10
7
4
2004
11
21
4
2005
12
17
4
2006
13
2
5
2004
14
14
5
2005
15
17
5
2006

What you have here are data on surveys of amphibians. The first column (count) represents the number of individuals captured. The final column is the year that the survey was conducted. The middle column (month) shows that for each year there were 5 survey events in each year. What you have here is a replicated block design. Each year is a "treatment" (or "group") whilst the month variable represents a "block". This is a common sort of experimental design; the blocks are set up to take care of any possible variation and to provide replication for the treatment. In this instance you wish to know if there is any significant difference due to year.

The Friedman test allows you to carry out a test on these data. You need to determine which variable is the group and which the block. The friedman.test() function allows you to perform the test, there are two ways to specify it:

> attach(survey)
> friedman.test(count, year, month)

Friedman rank sum test

data: count, year and month
Friedman chi-squared = 7.6, df = 2, p-value = 0.02237

> detach(survey)

Alternatively you can use a model syntax:

> friedman.test(count ~ year | month, data= survey)

Friedman rank sum test

data: count and year and month
Friedman chi-squared = 7.6, df = 2, p-value = 0.02237

>

In the first case you had to attach(data) the data so that the variables could be read (e.g. data, groups, blocks). In the second case this is not necessary, you specify the data ~ groups | blocks as a formula and add data= data at the end.


Top

Navigation Index

Post-hoc testing for Friedman tests

There is a real drop in statistical power when using a Friedman test compared to anova. Although there are methods that enable post-hoc tests (similar to the K-W post-hoc test discussed above) the power is such that obtaining significance is well nigh impossible. The best you can do is to present a boxplot of the data (dependent ~ group).


Top

Navigation Index

Ranked data anova

There may be occasions when you simply need to run an ANOVA but the data don't quite fit with a Friedman or Kruskal test. What you may consider is to replace all the original data with the ranks instead. Simply create a new dependent variable using new.variable= rank(old.variable) and then perform your analysis of variance from there. This is far from ideal but may be the only thing you can do. Watch this space...


Data Analysis Home Back to Basic Statistics | R Tips & Tricks | MonogRaphs | Forward to Correlation
Top
Main Hompage