Basic statistical tests Using R

Basic statistics

Really simple summary stats

R has a range of functions for carrying out summary statistics. The following table shows a few of the functions that operate on single variables.

Statistic Function
Sum sum(x)
Mean mean(x)
Median median(x)
Largest max(x)
Smallest min(x)
Standard Deviation sd(x)
Variance var(x)
Number of elements length(x)

 

Some functions produce more than one result. The quantile() function for example:

> data1
[1] 3 5 7 5 3 2 6 8 5 6 9

> quantile(data1)
 0%  25%  50%  75% 100%
2.0  4.0  5.0  6.5  9.0

The functions operate over a one-dimensional object (called a vector in R-speak). If your data are a column of a larger dataset then you’ll need to use attach() or the $ so that R can “find” your data.

> mf
  Length Speed Algae  NO3 BOD
1     20    12    40 2.25 200
2     21    14    45 2.15 180
3     22    12    45 1.75 135
4     23    16    80 1.95 120

> mean(Speed)
Error in mean(Speed) : object ‘Speed’ not found

> mean(mf$Speed)
[1] 15.8

> attach(mf)
> quantile(Algae)
0%  25%  50%  75% 100%
25   40   65   75   85
> detach(mf)

If your data contain NA elements, the result may show as NA. You can overcome this in many (but not all) commands by adding na.rm = TRUE as a parameter.

> nad
[1] NA  3  5  7  5  3  2  6  8  5  6  9 NA NA
> mean(nad)
[1] NA
> mean(nad, na.rm = TRUE)
[1] 5.363636

T-test

Student’s t-test is a classic method for comparing mean values of two samples that are normally distributed (i.e. they have a Gaussian distribution). Such samples are described as being parametric and the t-test is a parametric test. In R the t.test() command will carry out several versions of the t-test.

t.test(x, y, alternative, mu, paired, var.equal, …)
  • x – a numeric sample.
  • y – a second numeric sample (if this is missing the command carries out a 1-sample test).
  • alternative – how to compare means, the default is “two.sided”. You can also specify “less” or “greater”.
  • mu – the true value of the mean (or mean difference). The default is 0.
  • paired – the default is paired = FALSE. This assumes independent samples. The alternative paired = TRUE is used for matched pair tests.
  • equal – the default is var.equal = FALSE. This treats the variance of the two samples separately. If you set var.equal = TRUE you conduct a classic t-test using pooled variance.
  • … – there are additional parameters that we aren’t concerned with here.

In most cases you want to compare two independent samples:

> mow
[1] 12 15 17 11 15
> unmow
[1] 8 9 7 9

> t.test(mow, unmow)

Welch Two Sample t-test

data:  mow and unmow
t = 4.8098, df = 5.4106, p-value = 0.003927
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.745758 8.754242
sample estimates:
mean of x mean of y
14.00      8.25

If you specify a single variable you can carry out a 1-sample test by specifying the mean to compare against:

> data1
[1] 3 5 7 5 3 2 6 8 5 6 9
> t.test(data1, mu = 5)

One Sample t-test

data:  data1
t = 0.55902, df = 10, p-value = 0.5884
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
3.914249 6.813024
sample estimates:
mean of x
5.363636

If you have matched pair data you can specify paired = TRUE as a parameter (see more in 14.4).

U-test

The U-test is used for comparing the median values of two samples. You use it when the data are not normally distributed, so it is described as a non-parametric test. The U-test is often called the Mann-Whitney U-test but is generally attributed to Wilcoxon (Wilcoxon Rank Sum test), hence in R the command is wilcox.test().

wilcox.test(x, y, alternative, mu, paired, …)

  • x – a numeric sample.
  • y – a second numeric sample (if this is missing the command carries out a 1-sample test).
  • alternative – how to compare means, the default is “two.sided”. You can also specify “less” or “greater”.
  • mu – the true value of the median (or median difference). The default is 0.
  • paired – the default is paired = FALSE. This assumes independent samples. The alternative paired = TRUE is used for matched pair tests.
  • … – there are additional parameters that we aren’t concerned with here.
> wilcox.test(Grass, Heath)

Wilcoxon rank sum test with continuity correction

data:  Grass and Heath
W = 20.5, p-value = 0.03625
alternative hypothesis: true location shift is not equal to 0

If there are tied ranks you may get a warning message about exact p-values. You can safely ignore this!

Paired tests

The t-test and the U-test can both be used when your data are in matched pairs. Sometimes this kind of test is also called a repeated measures test (depending on circumstance). You can run the test by adding paired = TRUE to the appropriate command.

Here is an example where the data show the effectiveness of greenhouse sticky traps in catching whitefly. Each trap has a white side and a yellow side. To compare white and yellow we can use a matched pair.

> mpd
white yellow
1     4      4
2     3      7
3     4      2
4     1      2
5     6      7
6     4     10
7     6      5
8     4      8

> attach(mpd)
> t.test(white, yellow, paired = TRUE, var.equal = TRUE)

Paired t-test

data:  white and yellow
t = -1.6567, df = 7, p-value = 0.1415
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.9443258  0.6943258
sample estimates:
mean of the differences
-1.625

> detach(mpd)

You can do a similar thing with the wilcox.test().

Chi Squared tests

Tests for association are easily carried out using the chisq.test() command. Your data need to be arranged as a contingency table. Here is an example:

> bird
              Garden Hedgerow Parkland Pasture Woodland
Blackbird         47       10       40       2        2
Chaffinch         19        3        5       0        2
Great Tit         50        0       10       7        0
House Sparrow     46       16        8       4        0
Robin              9        3        0       0        2
Song Thrush        4        0        6       0        0

In this dataset the columns form one set of categories (habitats) and the rows form another set (bird species). In the original spreadsheet (CSV file) the first column contains the bird species names; these are “converted” to row names when the data are imported:

> bird = read.csv(file = “birds.csv”, row.names = 1)

You can carry out a test for association with the chisq.test() function. This time though, you should assign the result to a named variable.

> cs = chisq.test(bird)
Warning message:
In chisq.test(bird) : Chi-squared approximation may be incorrect

> cs

Pearson’s Chi-squared test
data:  bird
X-squared = 78.274, df = 20, p-value = 7.694e-09

In this instance you get a warning message (this is because there are expected values < 5). The basic “result” shows the overall significance but there are other components that may prove useful:

> names(cs)
[1] “statistic” “parameter” “p.value”   “method”    “data.name” “observed”
[7] “expected”  “residuals” “stdres”

You can view the components using the $ like so:

> cs$expected
                 Garden  Hedgerow  Parkland   Pasture  Woodland
Blackbird     59.915254 10.955932 23.623729 4.4508475 2.0542373
Chaffinch     17.203390  3.145763  6.783051 1.2779661 0.5898305
Great Tit     39.745763  7.267797 15.671186 2.9525424 1.3627119
House Sparrow 43.898305  8.027119 17.308475 3.2610169 1.5050847
Robin          8.305085  1.518644  3.274576 0.6169492 0.2847458
Song Thrush    5.932203  1.084746  2.338983 0.4406780 0.2033898

Now you can see the expected values. Other useful components are $residuals and $stdres, which are the Pearson residuals and the standardized residuals respectively.

You can also use square brackets to get part of the result e.g.

> cs$stdres[1, ]
    Garden   Hedgerow   Parkland    Pasture   Woodland
-3.2260031 -0.3771774  4.7468743 -1.4651818 -0.0471460

Now you can see the standardized residuals for the Blackbird row.

Yates’ correction

When using a 2 x 2 contingency table it is common practice to reduce the Observed – Expected differences by 0.5. To do this add correct=TRUE to the original function e.g.

> your.chi = chisq.test(your.data, correct=TRUE)

If your table is larger than 2 x 2 then the correction will not be applied (the basic test will run instead) even if you request it.

Goodness of Fit test

A goodness of fit test is a special kind of test of association. You use it when you have a set of data in categories that you want to compare to a “known” standard. A classic example is in genetics, where the theory suggests a particular ratio of phenotypes:

> pea
[1] 116  40  31  13

> ratio
[1] 9 3 3 1

Here you have counts of pea plants (there are 4 phenotypes) from an experiment in cross-pollination. The genetic theory suggests that your results should be in the ratio 9:3:3:1. Are these results in that ratio? The goodness of fit test will tell you.

> gfit = chisq.test(pea, p = ratio, rescale.p = TRUE)
> gfit

Chi-squared test for given probabilities
data:  pea
X-squared = 1.4222, df = 3, p-value = 0.7003

Notice that the rescale.p = TRUE parameter was used to ensure that the expected probabilities sum to one. The final result is not significant, meaning that the peas observed are not statistically significantly different from the expected ratio.

The result contains the same components as the regular chi squared test:

> gfit$expected
[1] 112.5  37.5  37.5  12.5
> gfit$stdres
[1]  0.4988877  0.4529108 -1.1775681  0.1460593
Comments are closed.