Basic statistical tests Using R
R can carry out a wide range of statistical analyses. Some of the simpler ones include:
- Summary statistics (e.g. mean, standard deviation).
- Two-sample differences tests (e.g. t-test).
- Non-parametric tests (e.g. U-test).
- Matched pairs tests (e.g. Wilcoxon).
- Association tests (e.g. Chi squared).
- Goodness of Fit tests.
These are the subject of this section.
Basic statistics
Really simple summary stats
R has a range of functions for carrying out summary statistics. The following table shows a few of the functions that operate on single variables.
Statistic | Function |
Sum | sum(x) |
Mean | mean(x) |
Median | median(x) |
Largest | max(x) |
Smallest | min(x) |
Standard Deviation | sd(x) |
Variance | var(x) |
Number of elements | length(x) |
Some functions produce more than one result. The quantile() function for example:
> data1 [1] 3 5 7 5 3 2 6 8 5 6 9 > quantile(data1) 0% 25% 50% 75% 100% 2.0 4.0 5.0 6.5 9.0
The functions operate over a one-dimensional object (called a vector in R-speak). If your data are a column of a larger dataset then you’ll need to use attach() or the $ so that R can “find” your data.
> mf Length Speed Algae NO3 BOD 1 20 12 40 2.25 200 2 21 14 45 2.15 180 3 22 12 45 1.75 135 4 23 16 80 1.95 120 > mean(Speed) Error in mean(Speed) : object ‘Speed’ not found > mean(mf$Speed) [1] 15.8 > attach(mf) > quantile(Algae) 0% 25% 50% 75% 100% 25 40 65 75 85 > detach(mf)
If your data contain NA elements, the result may show as NA. You can overcome this in many (but not all) commands by adding na.rm = TRUE as a parameter.
> nad [1] NA 3 5 7 5 3 2 6 8 5 6 9 NA NA > mean(nad) [1] NA > mean(nad, na.rm = TRUE) [1] 5.363636
T-test
Student’s t-test is a classic method for comparing mean values of two samples that are normally distributed (i.e. they have a Gaussian distribution). Such samples are described as being parametric and the t-test is a parametric test. In R the t.test() command will carry out several versions of the t-test.
t.test(x, y, alternative, mu, paired, var.equal, …)
- x – a numeric sample.
- y – a second numeric sample (if this is missing the command carries out a 1-sample test).
- alternative – how to compare means, the default is “two.sided”. You can also specify “less” or “greater”.
- mu – the true value of the mean (or mean difference). The default is 0.
- paired – the default is paired = FALSE. This assumes independent samples. The alternative paired = TRUE is used for matched pair tests.
- equal – the default is var.equal = FALSE. This treats the variance of the two samples separately. If you set var.equal = TRUE you conduct a classic t-test using pooled variance.
- … – there are additional parameters that we aren’t concerned with here.
In most cases you want to compare two independent samples:
> mow [1] 12 15 17 11 15 > unmow [1] 8 9 7 9 > t.test(mow, unmow) Welch Two Sample t-test data: mow and unmow t = 4.8098, df = 5.4106, p-value = 0.003927 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 2.745758 8.754242 sample estimates: mean of x mean of y 14.00 8.25
If you specify a single variable you can carry out a 1-sample test by specifying the mean to compare against:
> data1 [1] 3 5 7 5 3 2 6 8 5 6 9 > t.test(data1, mu = 5) One Sample t-test data: data1 t = 0.55902, df = 10, p-value = 0.5884 alternative hypothesis: true mean is not equal to 5 95 percent confidence interval: 3.914249 6.813024 sample estimates: mean of x 5.363636
If you have matched pair data you can specify paired = TRUE as a parameter (see more in 14.4).
U-test
The U-test is used for comparing the median values of two samples. You use it when the data are not normally distributed, so it is described as a non-parametric test. The U-test is often called the Mann-Whitney U-test but is generally attributed to Wilcoxon (Wilcoxon Rank Sum test), hence in R the command is wilcox.test().
wilcox.test(x, y, alternative, mu, paired, …)
- x – a numeric sample.
- y – a second numeric sample (if this is missing the command carries out a 1-sample test).
- alternative – how to compare means, the default is “two.sided”. You can also specify “less” or “greater”.
- mu – the true value of the median (or median difference). The default is 0.
- paired – the default is paired = FALSE. This assumes independent samples. The alternative paired = TRUE is used for matched pair tests.
- … – there are additional parameters that we aren’t concerned with here.
> wilcox.test(Grass, Heath) Wilcoxon rank sum test with continuity correction data: Grass and Heath W = 20.5, p-value = 0.03625 alternative hypothesis: true location shift is not equal to 0
If there are tied ranks you may get a warning message about exact p-values. You can safely ignore this!
Paired tests
The t-test and the U-test can both be used when your data are in matched pairs. Sometimes this kind of test is also called a repeated measures test (depending on circumstance). You can run the test by adding paired = TRUE to the appropriate command.
Here is an example where the data show the effectiveness of greenhouse sticky traps in catching whitefly. Each trap has a white side and a yellow side. To compare white and yellow we can use a matched pair.
> mpd white yellow 1 4 4 2 3 7 3 4 2 4 1 2 5 6 7 6 4 10 7 6 5 8 4 8 > attach(mpd) > t.test(white, yellow, paired = TRUE, var.equal = TRUE) Paired t-test data: white and yellow t = -1.6567, df = 7, p-value = 0.1415 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.9443258 0.6943258 sample estimates: mean of the differences -1.625 > detach(mpd)
You can do a similar thing with the wilcox.test().
Chi Squared tests
Tests for association are easily carried out using the chisq.test() command. Your data need to be arranged as a contingency table. Here is an example:
> bird Garden Hedgerow Parkland Pasture Woodland Blackbird 47 10 40 2 2 Chaffinch 19 3 5 0 2 Great Tit 50 0 10 7 0 House Sparrow 46 16 8 4 0 Robin 9 3 0 0 2 Song Thrush 4 0 6 0 0
In this dataset the columns form one set of categories (habitats) and the rows form another set (bird species). In the original spreadsheet (CSV file) the first column contains the bird species names; these are “converted” to row names when the data are imported:
> bird = read.csv(file = “birds.csv”, row.names = 1)
You can carry out a test for association with the chisq.test() function. This time though, you should assign the result to a named variable.
> cs = chisq.test(bird) Warning message: In chisq.test(bird) : Chi-squared approximation may be incorrect > cs Pearson’s Chi-squared test data: bird X-squared = 78.274, df = 20, p-value = 7.694e-09
In this instance you get a warning message (this is because there are expected values < 5). The basic “result” shows the overall significance but there are other components that may prove useful:
> names(cs) [1] “statistic” “parameter” “p.value” “method” “data.name” “observed” [7] “expected” “residuals” “stdres”
You can view the components using the $ like so:
> cs$expected Garden Hedgerow Parkland Pasture Woodland Blackbird 59.915254 10.955932 23.623729 4.4508475 2.0542373 Chaffinch 17.203390 3.145763 6.783051 1.2779661 0.5898305 Great Tit 39.745763 7.267797 15.671186 2.9525424 1.3627119 House Sparrow 43.898305 8.027119 17.308475 3.2610169 1.5050847 Robin 8.305085 1.518644 3.274576 0.6169492 0.2847458 Song Thrush 5.932203 1.084746 2.338983 0.4406780 0.2033898
Now you can see the expected values. Other useful components are $residuals and $stdres, which are the Pearson residuals and the standardized residuals respectively.
You can also use square brackets to get part of the result e.g.
> cs$stdres[1, ] Garden Hedgerow Parkland Pasture Woodland -3.2260031 -0.3771774 4.7468743 -1.4651818 -0.0471460
Now you can see the standardized residuals for the Blackbird row.
Yates’ correction
When using a 2 x 2 contingency table it is common practice to reduce the Observed – Expected differences by 0.5. To do this add correct=TRUE to the original function e.g.
> your.chi = chisq.test(your.data, correct=TRUE)
If your table is larger than 2 x 2 then the correction will not be applied (the basic test will run instead) even if you request it.
Goodness of Fit test
A goodness of fit test is a special kind of test of association. You use it when you have a set of data in categories that you want to compare to a “known” standard. A classic example is in genetics, where the theory suggests a particular ratio of phenotypes:
> pea [1] 116 40 31 13 > ratio [1] 9 3 3 1
Here you have counts of pea plants (there are 4 phenotypes) from an experiment in cross-pollination. The genetic theory suggests that your results should be in the ratio 9:3:3:1. Are these results in that ratio? The goodness of fit test will tell you.
> gfit = chisq.test(pea, p = ratio, rescale.p = TRUE) > gfit Chi-squared test for given probabilities data: pea X-squared = 1.4222, df = 3, p-value = 0.7003
Notice that the rescale.p = TRUE parameter was used to ensure that the expected probabilities sum to one. The final result is not significant, meaning that the peas observed are not statistically significantly different from the expected ratio.
The result contains the same components as the regular chi squared test:
> gfit$expected [1] 112.5 37.5 37.5 12.5 > gfit$stdres [1] 0.4988877 0.4529108 -1.1775681 0.1460593
Comments are closed.