Analysis of Variance (ANOVA) using R

The analysis of variance is a commonly used method to determine differences between several samples. R provides a method to conduct ANOVA via the aov() function. Your data need to be arranged in a particular manner for the aov() command to “work” (see 13). Your data need to be set out so that each column represents a single variable like so:

> head(fly)
  size diet
1   75    C
2   67    C
3   70    C
4   75    C
5   65    C
6   71    C

In this example the first column (size) is the response variable (dependent variable) and the second column (diet) is the predictor (independent) variable. ANOVA is a method of comparing samples that are grouped. The simplest arrangement is where you have a single grouping variable (predictor); this is called one-way ANOVA. If you have two predictor variables the analysis is called two-way ANOVA and so on.

The aov() command requires a formula that describes the experimental setup. In general terms you have:

response ~ predictor

This describes a 1-way anova.

ANOVA One-way

In one-way anova you have a response (dependent) variable and one predictor (grouping or independent) variable. It is usually best to assign a named object to “hold” the result of aov(), as there are multiple components you can make use of.

> mod = aov(size ~ diet, data = fly)
> summary(mod)
            Df Sum Sq Mean Sq F value   Pr(>F)
diet         4 1077.3  269.33   49.37 6.74e-16 ***
Residuals   45  245.5    5.46
—
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

The summary() command produces an anova table, a standard way to present such results. The table gives all the information you require.

Post hoc testing

The basic ANOVA produces an overall result. In the previous example you can see a significant result; the different diets produce different sizes of fly wings! However, this is not the complete story, are all the groups different from one another or are some different and some not? What you need is a post hoc test. This carries out pair by pair analyses and takes into account the fact that you are running multiple tests (and so more likely to get a significant result by chance).

There are a number of different methods but the Tukey HSD test is the most common, and the TukeyHSD() command is built-in to R. You simply run the command on the result of aov(). In this case I used ordered = TRUE to make the results more readable:

 

> TukeyHSD(mod, ordered = TRUE)
Tukey multiple comparisons of means
95% family-wise confidence level
factor levels have been ordered

Fit: aov(formula = size ~ diet, data = fly)

$diet
      diff       lwr       upr     p adj
F-F.G  0.2 -2.768072  3.168072 0.9996878
G-F.G  1.3 -1.668072  4.268072 0.7256157
S-F.G  6.1  3.131928  9.068072 0.0000052
C-F.G 12.1  9.131928 15.068072 0.0000000
G-F    1.1 -1.868072  4.068072 0.8291029
S-F    5.9  2.931928  8.868072 0.0000100
C-F   11.9  8.931928 14.868072 0.0000000
S-G    4.8  1.831928  7.768072 0.0003242
C-G   10.8  7.831928 13.768072 0.0000000
C-S    6.0  3.031928  8.968072 0.0000072

The ordered = TRUE parameter makes sure that the pairs are compared in an order that ensures that the differences in means are always positive. So for example, F-C gives 11.9 as a mean difference but C-F would be -11.9. This simply makes the result easier to read.

the result shows the difference in mean value as well as 95% confidence intervals. The final column gives an adjusted p-value (the significance). You can see that some of the comparisons are not significant (values > 0.05), whilst others are (values < 0.05).

ANOVA models

The aov() command uses a formula to describe the setup. The formula is a powerful syntax that allows you to specify complicated anova models. Here are just a few examples:

 

Model Meaning
y ~ a A one-way anova where a is a predictor (grouping variable).
y ~ a + b A two-way anova where a and b are different predictors acting independently.
y ~ a * b A two-way anova with a and b as predictors but where there is also an interaction between them.
y ~ a + b + a:b The same as above but with the interaction written explicitly.
y ~ a + b + Error(c) A two-way anova with a and b as predictors. Another variable, c, forms the error term, that is the within group variation. This is the way to describe repeated measures anova models.
y ~ a * b + Error(c/b) This is a two-way repeated measures anova where b is nested in c and the predictors a and b have interaction.

 

So, the formula syntax can describe quite complicated situations! The preceding table shows but a few of the options.

Basic statistical tests Using R

Basic statistics

Really simple summary stats

R has a range of functions for carrying out summary statistics. The following table shows a few of the functions that operate on single variables.

Statistic Function
Sum sum(x)
Mean mean(x)
Median median(x)
Largest max(x)
Smallest min(x)
Standard Deviation sd(x)
Variance var(x)
Number of elements length(x)

 

Some functions produce more than one result. The quantile() function for example:

> data1
[1] 3 5 7 5 3 2 6 8 5 6 9

> quantile(data1)
 0%  25%  50%  75% 100%
2.0  4.0  5.0  6.5  9.0

The functions operate over a one-dimensional object (called a vector in R-speak). If your data are a column of a larger dataset then you’ll need to use attach() or the $ so that R can “find” your data.

> mf
  Length Speed Algae  NO3 BOD
1     20    12    40 2.25 200
2     21    14    45 2.15 180
3     22    12    45 1.75 135
4     23    16    80 1.95 120

> mean(Speed)
Error in mean(Speed) : object ‘Speed’ not found

> mean(mf$Speed)
[1] 15.8

> attach(mf)
> quantile(Algae)
0%  25%  50%  75% 100%
25   40   65   75   85
> detach(mf)

If your data contain NA elements, the result may show as NA. You can overcome this in many (but not all) commands by adding na.rm = TRUE as a parameter.

> nad
[1] NA  3  5  7  5  3  2  6  8  5  6  9 NA NA
> mean(nad)
[1] NA
> mean(nad, na.rm = TRUE)
[1] 5.363636

T-test

Student’s t-test is a classic method for comparing mean values of two samples that are normally distributed (i.e. they have a Gaussian distribution). Such samples are described as being parametric and the t-test is a parametric test. In R the t.test() command will carry out several versions of the t-test.

t.test(x, y, alternative, mu, paired, var.equal, …)
  • x – a numeric sample.
  • y – a second numeric sample (if this is missing the command carries out a 1-sample test).
  • alternative – how to compare means, the default is “two.sided”. You can also specify “less” or “greater”.
  • mu – the true value of the mean (or mean difference). The default is 0.
  • paired – the default is paired = FALSE. This assumes independent samples. The alternative paired = TRUE is used for matched pair tests.
  • equal – the default is var.equal = FALSE. This treats the variance of the two samples separately. If you set var.equal = TRUE you conduct a classic t-test using pooled variance.
  • … – there are additional parameters that we aren’t concerned with here.

In most cases you want to compare two independent samples:

> mow
[1] 12 15 17 11 15
> unmow
[1] 8 9 7 9

> t.test(mow, unmow)

Welch Two Sample t-test

data:  mow and unmow
t = 4.8098, df = 5.4106, p-value = 0.003927
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.745758 8.754242
sample estimates:
mean of x mean of y
14.00      8.25

If you specify a single variable you can carry out a 1-sample test by specifying the mean to compare against:

> data1
[1] 3 5 7 5 3 2 6 8 5 6 9
> t.test(data1, mu = 5)

One Sample t-test

data:  data1
t = 0.55902, df = 10, p-value = 0.5884
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
3.914249 6.813024
sample estimates:
mean of x
5.363636

If you have matched pair data you can specify paired = TRUE as a parameter (see more in 14.4).

U-test

The U-test is used for comparing the median values of two samples. You use it when the data are not normally distributed, so it is described as a non-parametric test. The U-test is often called the Mann-Whitney U-test but is generally attributed to Wilcoxon (Wilcoxon Rank Sum test), hence in R the command is wilcox.test().

wilcox.test(x, y, alternative, mu, paired, …)

  • x – a numeric sample.
  • y – a second numeric sample (if this is missing the command carries out a 1-sample test).
  • alternative – how to compare means, the default is “two.sided”. You can also specify “less” or “greater”.
  • mu – the true value of the median (or median difference). The default is 0.
  • paired – the default is paired = FALSE. This assumes independent samples. The alternative paired = TRUE is used for matched pair tests.
  • … – there are additional parameters that we aren’t concerned with here.
> wilcox.test(Grass, Heath)

Wilcoxon rank sum test with continuity correction

data:  Grass and Heath
W = 20.5, p-value = 0.03625
alternative hypothesis: true location shift is not equal to 0

If there are tied ranks you may get a warning message about exact p-values. You can safely ignore this!

Paired tests

The t-test and the U-test can both be used when your data are in matched pairs. Sometimes this kind of test is also called a repeated measures test (depending on circumstance). You can run the test by adding paired = TRUE to the appropriate command.

Here is an example where the data show the effectiveness of greenhouse sticky traps in catching whitefly. Each trap has a white side and a yellow side. To compare white and yellow we can use a matched pair.

> mpd
white yellow
1     4      4
2     3      7
3     4      2
4     1      2
5     6      7
6     4     10
7     6      5
8     4      8

> attach(mpd)
> t.test(white, yellow, paired = TRUE, var.equal = TRUE)

Paired t-test

data:  white and yellow
t = -1.6567, df = 7, p-value = 0.1415
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.9443258  0.6943258
sample estimates:
mean of the differences
-1.625

> detach(mpd)

You can do a similar thing with the wilcox.test().

Chi Squared tests

Tests for association are easily carried out using the chisq.test() command. Your data need to be arranged as a contingency table. Here is an example:

> bird
              Garden Hedgerow Parkland Pasture Woodland
Blackbird         47       10       40       2        2
Chaffinch         19        3        5       0        2
Great Tit         50        0       10       7        0
House Sparrow     46       16        8       4        0
Robin              9        3        0       0        2
Song Thrush        4        0        6       0        0

In this dataset the columns form one set of categories (habitats) and the rows form another set (bird species). In the original spreadsheet (CSV file) the first column contains the bird species names; these are “converted” to row names when the data are imported:

> bird = read.csv(file = “birds.csv”, row.names = 1)

You can carry out a test for association with the chisq.test() function. This time though, you should assign the result to a named variable.

> cs = chisq.test(bird)
Warning message:
In chisq.test(bird) : Chi-squared approximation may be incorrect

> cs

Pearson’s Chi-squared test
data:  bird
X-squared = 78.274, df = 20, p-value = 7.694e-09

In this instance you get a warning message (this is because there are expected values < 5). The basic “result” shows the overall significance but there are other components that may prove useful:

> names(cs)
[1] “statistic” “parameter” “p.value”   “method”    “data.name” “observed”
[7] “expected”  “residuals” “stdres”

You can view the components using the $ like so:

> cs$expected
                 Garden  Hedgerow  Parkland   Pasture  Woodland
Blackbird     59.915254 10.955932 23.623729 4.4508475 2.0542373
Chaffinch     17.203390  3.145763  6.783051 1.2779661 0.5898305
Great Tit     39.745763  7.267797 15.671186 2.9525424 1.3627119
House Sparrow 43.898305  8.027119 17.308475 3.2610169 1.5050847
Robin          8.305085  1.518644  3.274576 0.6169492 0.2847458
Song Thrush    5.932203  1.084746  2.338983 0.4406780 0.2033898

Now you can see the expected values. Other useful components are $residuals and $stdres, which are the Pearson residuals and the standardized residuals respectively.

You can also use square brackets to get part of the result e.g.

> cs$stdres[1, ]
    Garden   Hedgerow   Parkland    Pasture   Woodland
-3.2260031 -0.3771774  4.7468743 -1.4651818 -0.0471460

Now you can see the standardized residuals for the Blackbird row.

Yates’ correction

When using a 2 x 2 contingency table it is common practice to reduce the Observed – Expected differences by 0.5. To do this add correct=TRUE to the original function e.g.

> your.chi = chisq.test(your.data, correct=TRUE)

If your table is larger than 2 x 2 then the correction will not be applied (the basic test will run instead) even if you request it.

Goodness of Fit test

A goodness of fit test is a special kind of test of association. You use it when you have a set of data in categories that you want to compare to a “known” standard. A classic example is in genetics, where the theory suggests a particular ratio of phenotypes:

> pea
[1] 116  40  31  13

> ratio
[1] 9 3 3 1

Here you have counts of pea plants (there are 4 phenotypes) from an experiment in cross-pollination. The genetic theory suggests that your results should be in the ratio 9:3:3:1. Are these results in that ratio? The goodness of fit test will tell you.

> gfit = chisq.test(pea, p = ratio, rescale.p = TRUE)
> gfit

Chi-squared test for given probabilities
data:  pea
X-squared = 1.4222, df = 3, p-value = 0.7003

Notice that the rescale.p = TRUE parameter was used to ensure that the expected probabilities sum to one. The final result is not significant, meaning that the peas observed are not statistically significantly different from the expected ratio.

The result contains the same components as the regular chi squared test:

> gfit$expected
[1] 112.5  37.5  37.5  12.5
> gfit$stdres
[1]  0.4988877  0.4529108 -1.1775681  0.1460593

Working with data in R

In this section you’ll learn a bit about:

  • Making data items.
  • Importing data from disk.
  • Rearranging and managing datasets.

Combine values command

The c() command is used for joining things together to make longer things!

> data1 = c(3, 5, 7, 5, 3, 2, 6, 8, 5, 6, 9)
> data1
[1] 3 5 7 5 3 2 6 8 5 6 9

The items you want to join are listed in the parentheses and separated with commas. The items are joined in the order they appear in the command.

Typing in values using scan()

The c() command is useful but it can be a pain to type all the commas! The scan() command allows you to enter data from the keyboard without commas. Here are some of the possible parameters:

scan(what, sep, dec = “.”)
  • what – the default is to expect numbers but you can also choose what = “character” for text.
  • sep – the default is to use a space to separate elements, use sep = “,” for commas or sep = “\t” for tabs.
  • dec – the decimal point character defaults to a full stop (period).

To use scan() you start by assigning a name to “hold” the result. At the same time you tell the command the type of data to expect and if the elements are separated by anything other than spaces.

> mydata = scan()
1:

R now “waits” and you will see the 1: instead of the regular cursor. You can now type your data (in this example it will be expecting numbers).

1: 2 5 6.2 33 25 1.3 8

If you press <enter> then R continues on a new line but still waits:

8:

Type in more data if you like:

8: 111

To stop adding values you press <enter> on a blank line (in other words press <enter> twice).

9:
Read 8 items

You can now see your data by typing the name you assigned to it:

> mydata
[1]   2.0   5.0   6.2  33.0  25.0   1.3   8.0 111.0

If you want to type text values, then use what = “character” as a parameter:

> myscan = scan(what = "character")
1: jan feb mar apr
5: may jun
7:
Read 6 items
> myscan
[1] "jan" "feb" "mar" "apr" "may" "jun"

Note that you don’t have to use quotes around the items as you type (this is way you used what = “character”).

Read the clipboard

The scan() command can read the clipboard, which means you can copy and paste items from other programs (or web pages). Use the sep parameter to match the separator.

Importing data

The next step is to get your data into R. If you have saved your data in a .CSV file then you can use the read.csv(filename) command to get the information. You need to tell R where to store the data and to do this you assign it a name. All names must have at least one letter (otherwize it is a number of course!). Remember, you can use a period (e.g. test.1) or underscore (e.g. test_1) but no other punctuation marks. R is case sensitive so the variable test is different from Test or teSt.

The read.csv() command has various parameters that allow you control over how and what is imported, here are some of the essentials:

read.csv(file, header = TRUE, sep = ",", dec = ".", row.names)
  • file – the filename of the data you want to read in. Give this in quotes, with the full path e.g. “My documents/mydata.csv”. Alternatively you can use choose() to bring up a file explorer, allowing you to choose the file.
  • header – if set to TRUE (the default) the first row of the CSV is used as column labels. You can set this to FALSE, then R will assume all the data are data and assign standard names for the columns (e.g. V1, V2, V3).
  • sep – this sets the separator between “fields”, given as a character in quotes. For a CSV this is of course a comma (this is the default). If you have tab separated data use sep = “\t”
  • dec – this sets the decimal point character. The default is a full stop (period) dec = “.” but other characters are possible.
  • names – this allows a column of data to be used as labels (i.e. row names). Usually this would be the first column e.g. row.names = 1, but any number is allowed.

Here are some examples:

To get a file into R with basic columns of data and their labels use:

> mydata = read.csv(file.choose(), header = TRUE)

To get a file into R with column headings and row headings use:

> mydata = read.csv(file.choose(), row.names = 1)

To read a CSV file from a European colleague where commas were used as decimal points and the data were separated by semi-colon.

> mydata = read.csv(file.choose(), dec = ",", sep = ";")

On Linux machines the file.choose() capability is absent, so you’ll need to use the full file and path name (in quotes):

> mydata = read.csv(file = "My Documents/myfile.CSV")

N.B. There are occasions when R won’t like your data file! Check the file carefully. In some cases, the addition of an extra linefeed at the end will sort out the issue. To do this open the file in a word processor and make sure that non-printing characters are displayed. Add the extra carriage return and save the file.

Seeing your data in R

To view your data once it is imported you simply type the name you assigned to it.

> bird = read.csv(file = file.choose(), row.names = 1)
> bird
              Garden Hedgerow Parkland Pasture Woodland
Blackbird         47       10       40       2        2
Chaffinch         19        3        5       0        2
Great Tit         50        0       10       7        0
House Sparrow     46       16        8       4        0
Robin              9        3        0       0        2
Song Thrush        4        0        6       0        0

In this case the first column of data was set to act as row names, so when you view the data you see those names. In other cases you do not have explicit row names, in which case R assigns a simple index number just to help you navigate the dataset.

> mf = read.csv(file = file.choose())
> mf
  Length Speed Algae  NO3 BOD
1     20    12    40 2.25 200
2     21    14    45 2.15 180
3     22    12    45 1.75 135
4     23    16    80 1.95 120
5     21    20    75 1.95 110
6     20    21    65 2.75 120
7     19    17    65 1.85  95
8     16    14    65 1.75 168

Notice that each column has a name (taken from the CSV) but that simple numbers are used as a “row index”.

View individual data columns

The variable name you assign to data that you import “covers” the overall data. However, R cannot pick out individual columns (variables) in your data without a bit of help.

> Algae
Error: object ‘Algae’ not found

There are several ways to overcome this, here are two of them:

  • Use the $ and specify the overall data and column.
  • Use the attach() command to “open up” the data.

The $ is a general part of the R language and allows you to specify a variable inside another named object like so:

> mf$Algae
[1] 40 45 45 80 75 65 65 65

The attach() command allows the names of the columns in your data to be “seen” by R:

> attach(mf)
> Speed
[1] 12 14 12 16 20 21 17 14

But beware! You may have other data with the same names. To avoid “confusion” you should use detach() after you are done:

> detach(mf)
> Algae
Error: object ‘Algae’ not found

View individual data rows

The $ syntax allows you to “pick out” named elements of a dataset (e.g. the columns) but cannot help you extract individual rows. Instead you can use square brackets to extract one or more elements from a larger item.

data[rows, columns]

If your data are two-dimensional (i.e. have rows and columns) then you always specify the rows first and then the columns. If you leave out one of these, you get “everything”.

> mf
  Length Speed Algae  NO3 BOD
1     20    12    40 2.25 200
2     21    14    45 2.15 180
3     22    12    45 1.75 135
4     23    16    80 1.95 120
5     21    20    75 1.95 110
6     20    21    65 2.75 120
> mf[1, ]
  Length Speed Algae  NO3 BOD
1     20    12    40 2.25 200

If you want several rows, then indicate that in the []:

> mf[2:5, ]
  Length Speed Algae  NO3 BOD
2     21    14    45 2.15 180
3     22    12    45 1.75 135
4     23    16    80 1.95 120
5     21    20    75 1.95 110

The c() command can be helpful here if you want non-sequential rows:

> mf[c(1,3,5), ]
  Length Speed Algae  NO3 BOD
1     20    12    40 2.25 200
3     22    12    45 1.75 135
5     21    20    75 1.95 110

You can also specify certain columns too, in a similar manner.

Transposing data frames

Sometimes you need to rotate your data so that the rows and columns are switched; the t() command can do this:

> fw
         count speed
Taw          9     2
Torridge    25     3
Ouse        15     5
Exe          2     9
Lyn         14    14
Brook       25    24
Ditch       24    29
Fal         47    34

 

> t(fw)
      Taw Torridge Ouse Exe Lyn Brook Ditch Fal
count   9       25   15   2  14    25    24  47
speed   2        3    5   9  14    24    29  34

Missing values

When you import data from a file (e.g. CSV) R “tidies up” and makes the resulting data a nice rectangular shape. If any data column is shorter than the others it will be padded with NA items.

> grass2
  mow unmow
1  12     8
2  15     9
3  17     7
4  11     9
5  15    NA

In this example the mow column has 5 elements but the unmow column contains only 4. R has “interpreted” the empty cell as a missing value and replaced it with NA (think of it as “not available”).

The NA item is important in data analysis. In this case the NA is obviously simply the result of a “short” column. In other datasets NA may well represent missing values (R will replace any missing cells of a spreadsheet with NA). Some statistical tests will return a result of NA if the data contains NA items. In most cases you may ignore the NA values by including the parameter na.rm= TRUE.

Stacking data

Sometimes you need to reorganize your data and convert from one kind of recording layout to another. For example:

> flies
    C  F F.G  G  S
1  75 58  58 57 62
2  67 61  59 58 66
3  70 56  58 60 65
4  75 58  61 59 63
5  65 57  57 62 64
6  71 56  56 60 62
7  67 61  58 60 65
8  67 60  57 57 65
9  76 57  57 59 62
10 68 58  59 61 67

This dataset shows five columns. Each column shows repeated measurements of the size of fly wings under a specific diet. If you want to compare the diets, you’ll need to carry out analysis of variance but R will need the data in a different form.

  size diet
1   75    C
2   67    C
3   70    C
4   75    C
5   65    C
6   71    C

In this layout the numerical data are all in one column, called size. This is the response variable (sometimes called dependent variable). The second column is headed diet, and is a grouping variable (each size measurement is related to a group), where the names of the groups are related to the column headings in the previous layout. This column is the predictor variable (also called independent variable).

The stack() command allows you to reassemble data in multi-sample layout to a stacked layout.

> fly = stack(flies)
fly
   values ind
1      75   C
2      67   C
3      70   C
4      75   C
5      65   C
6      71   C

Notice that the numerical variable is called values and the grouping variable is called ind.

Selecting columns

You do not have to use all the columns when you stack() the data; use the select parameter to choose:

> stack(flies, select = c(“F”, “G”, “S”))
   values ind
1      58   F
2      61   F
3      56   F
4      58   F
5      57   F
6      56   F

Naming the stacked columns

You can carry out your analyses using the names R has used when it did the stack(). If you want to use different names then use the names() command:

> names(fly) = c(“size”, “diet”)
   size diet
1   75    C
2   67    C
3   70    C

Note that you give the names wrapped in a c() command and the names themselves in quotes.

Unstack

The opposite of the stack() command is unstack(). This looks to reassemble the data so that each grouping level is matched up with the appropriate values. If there are equal replicates in each grouping, then your result is a neat data frame:

> unstack(fly)
    C  F F.G  G  S
1  75 58  58 57 62
2  67 61  59 58 66
3  70 56  58 60 65
4  75 58  61 59 63

If the replication is unbalanced the result is several individual elements bundled together in a list.

> grass
  rich graze
1   12   mow
2   15   mow
3   17   mow
4   11   mow
5   15   mow
6    8 unmow
7    9 unmow
8    7 unmow
9    9 unmow
> grasses = unstack(grass)
> grasses
$mow
[1] 12 15 17 11 15

$unmow
[1] 8 9 7 9

In this example the grass dataset shows that there are 5 replicates for mow and 4 for unmow.

What data are loaded?

After a while you may well end up with quite a few items in R. It is easy to lose track of what data are loaded (in computer memory) into your R console (the name of the main window where you type stuff). The ls() command shows you what items are currently available:

> ls()
 [1] "algae"       "all.samples" "ans1"        "ans2"        "ans3"
 [6] "ans4"        "ans5"        "ans6"        "b"           "barley"
[11] "barplot.eb"  "bats"        "bbel"        "mf"

Note how the “index” on the left helps you to see how many things there are.

You can also see what items are contained in each object by using ls() and specifying an object:

> ls(mf)
[1] "Algae"  "BOD"    "Length" "NO3"    "Speed"

Removing data items

The rm() command allows you to remove items from the computer memory and so also from the R console. Just place the names of the items to “delete” in the parentheses, separated by commas:

rm(name1, name2, name3, …)

On Windows and Mac computers the GUI also has toolbar menus you can click to remove everything. Use this with caution as there is no undo button!

Getting started with R

On this page you’ll learn some of the basics:

  • Getting started with R.
  • How to do basic maths and calculations.
  • How to store results.

Getting started with R – The basics

Once you have installed R and run the program you will see an opening window and a message along these lines:

R : Copyright 2006, The R Foundation for Statistical Computing
Version 2.3.1 (2006-06-01)
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type ‘license()’ or ‘licence()’ for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type ‘contributors()’ for more information and
‘citation()’ on how to cite R or R packages in publications.

Type ‘demo()’ for some demos, ‘help()’ for on-line help, or
‘help.start()’ for an HTML browser interface to help.
Type ‘q()’ to quit R.

[Previously saved workspace restored]

>

The > is the “prompt”, this is the point where you type in commands (or paste them in from somewhere else). The window you see is part of the GUI and some operations are possible from the menus (including quit). You will generally be asked if you wish to save the workspace. R stores a list of commands and any data sets that are loaded. It can be pretty useful to say “yes” and to save the workspace. The command history is available by using the up and down arrows. You can easily scroll back through previous commands and edit them if needed. You can copy items from previous commands or in fact from any window on the screen and paste them into the current command line. You can also use the left and right arrow keys to move through the current command.

Simple maths

You can think of R as like a big calculator. You type in some maths and get the results.

> 1 + 2 * 3
[1] 7

Note that the calculations are carried out in the “standard” manner, with multiplications and divisions before additions or subtractions. Anything in brackets (parentheses) is computed before things outside:

> (1 + 2) * 3
[1] 9

Your result is displayed immediately. Often you will want to store the results to use later, so you need to assign a name to the result.

> result1 = 12 + 17 / 2
> result1
[1] 20.5

Notice that you do not see the result right away, you must type the name to view it. The result can be used for more calculations:

> result2 = result1 * 4 / 7
> result2
[1] 11.71429

Object names

You can use all the “regular” letters of the alphabet a-z (or A-Z) as well as numbers 0-9. R is case sensitive so the following names are different!

result1
Result1
RESULT1

You also use the full stop (period) and underscore characters. You must start a name with a letter and not a number.

Assignment

In the examples here I used = to “define” a result. You will often see something like this <- which looks a bit like an arrow. This is the “preferred” way to assign names to results and you’ll see this in the R help entries and examples.

> name1 = 17 * 41 / 94
> name2 <- 17 * 41 / 94
> name1
[1] 7.414894
> name2
[1] 7.414894

Comparing diversity

Exercise 12.1.4.

Statistics for Ecologists (Edition 2) Exercise 12.1.4

This exercise is concerned with comparing diversity (Section 12.1.4) and in particular how to carry out a modified version of the t-test designed to compare the Shannon diversity index of two samples.

Comparing the Shannon diversity index of two samples using Hutcheson t-test

Introduction

When you have two samples of community data you can calculate a diversity index for each one. The Shannon diversity index is a commonly used measure of diversity. However, you cannot compare the two index values using classic hypothesis tests because you do not have replicated data.

The Hutcheson t-test is a modified version of the classic t-test that provides a way to compare two samples. The key is the formula that determines the variance of the Shannon index. These notes will show you how to conduct the Hutcheson t-test and so get a statistical significance of the difference in Shannon diversity between two samples. There is also a spreadsheet calculator, that you can download and use for your own data.

The Hutcheson t-test

The Hutcheson t-test was developed as a method to compare the diversity of two community samples using the Shannon diversity index (Hutcheson 1970, J. Theor. Biol. 29 p.151). The basic formula is similar in appearance to the classic t-test formula.

Formula for the Hutcheson t-test.

In the formula H represents the Shannon diversity index for each of the two samples (subscripted a and b). The bottom of the formula refers to the variance of each of the samples.

Computing variance for the Shannon diversity index

Computing the variance of the Shannon diversity is done using the formula shown below.

Formula to calculate variance in Shannon diversity. S is the species richness; N is the total number of individuals.

In the formula S is the total number of species, whilst N is the total abundance. The p is the proportion that each species makes towards the total. The formula is fairly easy to evaluate; it looks fairly horrid but the components are easily computed.

Getting a t-value

It is easiest to use a spreadsheet for the calculations. The computations are relatively straightforward. You can use the example spreadsheet Shannon diversity t-test calculator.xlsx rather than make your own! Here is an example of how the calculations appear in the spreadsheet.

Example of calculations in Shannon diversity t-test spreadsheet.

Site Name: E1
Taxon Count P Ln(P) P*Ln(P) P*Ln(P)2
388 0.543 -0.611 -0.332 0.203
1 0.001 -6.572 -0.009 0.060
2 0.003 -5.879 -0.016 0.097
13 0.018 -4.007 -0.073 0.292
3 0.004 -5.474 -0.023 0.126
1 0.001 -6.572 -0.009 0.060
59 0.083 -2.495 -0.206 0.514
3 0.004 -5.474 -0.023 0.126
2 0.003 -5.879 -0.016 0.097
2 0.003 -5.879 -0.016 0.097
210 0.294 -1.225 -0.360 0.441
4 0.006 -5.186 -0.029 0.150
3 0.004 -5.474 -0.023 0.126
1 0.001 -6.572 -0.009 0.060
1 0.001 -6.572 -0.009 0.060
21 0.029 -3.528 -0.104 0.366
1 0.001 -6.572 -0.009 0.060
0
0
0
Total 715 1 -83.972 -1.267 2.934
Richness 17
SS 2.934
SQ 1.606
H 1.267
S2H 0.002

 

The species names are not essential but the spreadsheet has room for them. You need the total abundance in order to calculate the proportions. To compute the Shannon diversity, you need the next column, labelled Ln(P), but to assist in the variance calculations it is convenient to calculate two others, which you can see in the table.

Using the spreadsheet for calculations

The spreadsheet Shannon diversity t-test calculator.xlsx will carry out the calculations for you. There are worksheets for two samples. You can enter a site name and your abundance values as well as species names if you like. The spreadsheet is protected (although there is no password) but you can add additional rows to accommodate extra data (row 23 is empty and that’s where you can start adding).

I advise that you don’t remove the worksheet protection, as it is easy to inadvertently change a formula! I’ve designed the spreadsheet so you can add extra rows without needing to remove the protection.

Enter the data for the two samples and the final t-test results will appear in the Results worksheet.

Calculating degrees of freedom

Once you have a value for t you need to determine if it is statistically significant. In order to do that you’ll need to work out the degrees of freedom. This is computed using the following formula.

Formula to calculate the degrees of freedom in the Hutcheson t-test.

In the formula you need the variance for each sample and the total abundance for each sample. The final value is close to the total abundance for the two samples added together.

Assessing statistical significance

Once you have your value for t and you have the appropriate degrees of freedom you can determine if the result is statistically significant. You can use the degrees of freedom to look up a critical value. If your calculated t-value exceeds the critical then your result is significant.

You can use Excel to calculate a critical value using the =TINV function. You need the level of significance (usually 0.05) and the degrees of freedom.

You can also determine the probability directly using the =TDIST function. You need your calculated t-value and the degrees of freedom.

The Shannon diversity t-test calculator.xlsx spreadsheet will display all the t-test results in the Results worksheet. The spreadsheet will also calculate confidence intervals and produce a couple of graphs (which you can edit).

Computing confidence intervals

It is useful to be able to calculate confidence intervals to your results. The confidence intervals allow you to compare multiple samples and to visualize how “different” samples are to one another.

You can compute a reasonable confidence interval using the standard deviation of the Shannon index (that is the square root of the variance) and multiplying by 2. With most communities you will have fairly large degrees of freedom and the critical value (for t) will approach 2 (the critical value for t at infinity is 1.96).

Once you have your confidence interval(s) you can add them to a graph to show the variability and possible overlap between samples. The Shannon diversity t-test calculator.xlsx will produce two charts (a bar chart and a point chart) with error bars based on the confidence intervals. You can edit the charts.

Graphing the results

It is useful to chart your results, the Shannon diversity t-test calculator.xlsx will produce two graphs, which you can edit. The error bars are based on the confidence interval.

Two ways to represent the difference between Shannon diversity. Error bars are 95% confidence intervals.

The bar chart is a “classic” form of graph to show differences between samples. The point chart is a good alternative, as it allows you to see difference between samples more easily, especially if you customize the y-axis scale. The chart shown here would benefit from having the axis run from 1.0 to 1.6 for example.

Abundance-based dissimilarity metrics

Exercise 12.2.2.

Statistics for Ecologists (Edition 2) Exercise 12.2.2

In these notes you’ll see some of the more commonly used measures of dissimilarity used when you have abundance data.

Abundance-based dissimilarity metrics

Introduction

When your community data samples include abundance information (as opposed to simple presence-absence) you have a wider choice of metrics to use in calculating (dis)similarity. When you have presence-absence data you use the number of shared species (J) and the species richness of each sample (A & B). Measures of (dis)similarity obtained are therefore slightly “crude”.

When you have abundance data your measures of (dis)similarity are a bit more “refined” and you have the potential to pick up patterns in the data that you would otherwise not see using presence-absence data.

There are many metrics that you might use to explore (dis)similarity, in this exercise you’ll see four of the more commonly used ones:

  • Bray-Curtis
  • Canberra
  • Manhattan (City Block)
  • Euclidean

The exercise shows you how you can carry out the calculations using Excel (in the book you also see how to do this using R). You can get the sample spreadsheet here: Distance metrics.xlsx.

Notation

When you have presence-absence data you use the species richness of each sample, calling these values A and B. The number of shared species is designated J. When you have abundance data you use a slightly different notation. The abundance of an individual species is given as x. Now you are comparing two samples so you call one i and one j. Subscripts are used to differentiate the abundances between the two samples, so you get xi and xj.

Generally speaking it does not matter which sample is i and which is j.

You’ll also see the ∑ symbol (meaning sum, you add things together) and the vertical bars e.g. |x – y|, which indicate that you ignore any negative sign and simply use the magnitude of the result.

Bray-Curtis

The Bray-Curtis metric uses two components: |xi – xj| and (xi + xj).

In the first case you subtract the abundance of one species in a sample from its counterpart in the other sample but ignore the sign. The second component is the abundance of a species in one sample added to the abundance of its counterpart in the second sample. If a species is absent, then its abundance should be recorded as 0 (zero).

The equation for the Bray-Curtis dissimilarity is shown below:

Bray-Curtis dissimilarity is easy to calculate in Excel but there are no shortcuts to working out the main components, |xi – xj| and (xi + xj).

You might have your data arranged in rows or columns; in the following example the data are in columns.

Species Si Sj |xi – xj| (xi + xj)
Spp1 6 4 2 10
Spp2 5 3 2 8
Spp3 7 4 3 11
Spp4 2 6 4 8
Spp5 3 0 3 3

 

So, the 4th column is easily calculated by subtracting values in the 3rd column from the 2nd. Use the =ABS function to get the absolute value (i.e. ignore the sign). So for example the 4th data row would be 2 – 6 = -4 but the ABS function ignores the -.

The last column is even easier, as it is the sum of a value in the 2nd column with its counterpart in the 3rd column.

You don’t need to compute the sum of the 4th or 5th columns because your final calculation can do that in one go:

=SUM(column4) / SUM(column5)

You can use the mouse to highlight the values in the columns in place of the column4 and column5 items in the formula (which represent the cell ranges).

If you calculate the Bray-Curtis dissimilarity for these data, you get a result of 0.350.

If your data were arranged in rows the calculations are more or less the same (the spreadsheet shows both methods).

Canberra

The Canberra dissimilarity uses the same components as Bray-Curtis but the components are summed differently:

In the Canberra metric each |xi – xj| result is divided by the corresponding (xi + xj) value.

In the following example the data are shown in rows but the calculations are more or less the same for data in columns (the spreadsheet shows both).

Calc Spp1 Spp2 Spp3 Spp4 Spp5
Si 6 5 7 2 3
Sj 4 3 4 6 0
|xi – xj| 2 2 3 4 3
(xi + xj) 10 8 11 8 3

 

This time you need to evaluate each |xi – xj| ÷ (xi + xj) then adding them together.

This means you need to work out 2/10 + 2/8 + 3/11 + 4/8 + 3/3 for the current example. There is no shortcut for this (obviously you will use the cell references) so when you have a lot of species the formula is quite long. You can get around this by making a new row (or column, depending how your data are arranged) for the |xi – xj| ÷ (xi + xj) values. You can then copy/paste the results and then do a final =SUM to get the Canberra distance.

In this case the result works out to be 2.223.

Manhattan (City Block)

The Manhattan (or City Block) dissimilarity uses the |xi – xj| component only.

This is the simplest dissimilarity metric to compute:

Manhattan (City Block) dissimilarity.

You can use the =ABS function to ignore any negative signs (and retain the value only). Then the =SUM funtion can simply total them to give the final result. In this case you get: 2 + 2 + 3 + 4 + 3 = 14.

As you can see, City Block dissimilarity is easy to work out using Excel.

Euclidean

The Euclidean dissimilarity metric uses a different component (xi – xj) from the other metrics.

Here the abundance of a species from one sample is subtracted from its counterpart in the other sample. Instead of ignoring the sign, the result is squared (which gives a positive value):

In this case you can use the =SUMXMY2 function to eliminate any intermediate calculation steps (although you can of course do those intermediate steps if you like).

=SUMXMY2(range1, range2)

What the function does is to take two ranges of cells, it subtracts one value from its counterpart in the other and squares the result, then gives the total. This is exactly what you want to calculate the Euclidean dissimilarity. All you need to do is to use =SUMXMY2 and then SQRT the result. You can of course do this in one formula.

The spreadsheet gives an example of the calculation in action for data in columns and in rows (there is little difference).

If you already had calculated |xi – xj| values, then you can use them. Simply square each one, add the squares together, there is a function to do that =SUMSQ. Then take a square root. You might as well use the =SUMXMY2 function!

In the example here the |xi – xj| values are 2, 2, 3, 4, 3.

Their squares would be 4, 4, 9, 16, 9 giving a total of 42.

The final Euclidean dissimilarity is √42 = 6.481.

Which metric to use?

Good question! There is no definite “correct” metric to use in most cases. As a rule of thumb you should use the metric that gives the best separation of samples. This runs a little counter to the idea in statistics that you decide on the proper test to apply before you begin. However, this is not a statistical “test” but a method of separating samples into meaningful groupings. Your distance values should be visualized using a dendrogram (see this exercise on drawing dendrograms in Excel). You choose the metric that gives the most useful results.

The choice of dissimilarity metrics is something that occasionally sparks low-level holy wars amongst ecologists. Hopefully you will use several and the results will all give broadly the same ecological meaning!

Example data and files

You can get examples from the book from the support pages. There are also some specific examples used for the exercises in addition to those shown in the book. You can also download the spreadsheet file Distance metrics.xlsx, which shows how to carry out the calculations using Excel.

Visualizing similarity

Exercise 12.2.1.

Statistics for Ecologists (Edition 2) Exercise 12.2.1

This exercise is concerned with looking at similarity between ecological communities (Section 12.2). This exercise shows you how to visualize the similarity between several communities using a dendrogram drawn using Excel.

Visualizing similarity. Using Excel to build a (dis)similarity dendrogram

Introduction

When you have two or more ecological communities you can use the presence-absence of species (or abundance information, if you have it) to determine measures of similarity (or the corollary, dissimilarity). In the case of presence-absence data you use the species richness of each sample and the number of shared species to calculate an index of (dis)similarity. Once you have a matrix of (dis)similarity you can visualize the relationship between the community samples using a dendrogram. Think of it as being like a family tree, with communities most similar being “near” one another in the diagram.

You can carry out calculations for (dis)similarity using Excel, although things can get rather tedious when you have more than a few samples. There is no built-in chart type that will create a dendrogram in Excel so you must use other drawing tools. R is able to carry out the calculations and dendrogram rather easily but it is a worthwhile exercise to use Excel as it helps you understand how the (dis)similarity is “converted” to a dendrogram and therefore helps you to understand more clearly what you are looking at.

You can get the sample data here: Dendrogram Exercise.xlsx. There are three worksheets:

  • Raw Data – the raw data as a list of plant species for several sites.
  • Pivot and Similarity – contains a Pivot Table showing the presence-absence of species in each sample as well as the species richness and the Jaccard similarity calculations, with results as a similarity matrix. There is also a dissimilarity matrix (1-Jaccard), which will be used to draw the dendrogram.
  • Dissimilarity – the dissimilarity matrix by itself, you will use this as the basis for the construction of the dendrogram.

Although this exercise focusses on the Jaccard dissimilarity (presence-absence data) the principles apply to any matrix of (dis)similarity.

The raw data

The raw data (Dendrogram Exercise.xlsx) are arranged simply in biological recording layout, with a column for the sample/site name and a column for the species name.

Raw data for the dendrogram exercise.

Site Species
ML1 Achillea millefolium
ML1 Centaurea nigra
ML1 Lathyrus pratensis
ML1 Leucanthemum vulgare
ML1 Lotus corniculatus
ML1 Plantago lanceolata
ML1 Prunella vulgaris
ML1 Ranunculus acris
ML1 Trifolium repens

 

In this example the data are simply lists of species, that is presence-absence. If you had abundance information this would be in a separate column.

Rearrange data into samples

Using Excel, the Pivot Table tool is an easy way to rearrange your data. In the example spreadsheet the Pivot Table has already been created for you (the Pivot and Similarity worksheet). If you want to have a go for yourself then:

  1. Click once in the block of data, there is no need to highlight/select any data, just click once.
  2. Click the Insert > Pivot Table
  3. Choose to place the Pivot Table in a new worksheet and click OK. This opens the Pivot Table Fields dialogue window/box.
  4. Drag the fields from the list at the top to the appropriate sections at the bottom to create the table.

The table in the example requires the fields to be arranged like so:

This will produce a table with a list of species (the Rows) down the left. Each column will be a separate sample/site (the Columns). If you had abundance data you’d use this in the Values section, but you don’t so place the Species field again to get a count (frequency), which essentially places a 1 if the species is present and blank if it is absent.

You can use the Pivot Table Tools menus to help reformat and present the Pivot Table. Column totals give the species richness for each sample.

In the example shown here some of the rows have been hidden so you can see the table structure. The totals for the columns give species richness. The row totals are less useful, essentially you get the frequency of occurrence for each species (so the first species occurs in 50% of the samples: 5/10). Where a species is present you see a 1. If a species is absent you see a blank (but you can alter the settings/options to display 0 if you prefer).

Similarity calculations

Now you have the data rearranged by sample you can calculate the similarity (and/or dissimilarity) for each pair of samples. This can get a bit tedious in Excel because you cannot simply drag a formula across rows or columns (you need both). However, it is possible and the example spreadsheet shows the results. Have a look at the example and see how the calculations were performed.

It is easiest to make a matrix (table) of the shared species first, then a second matrix of similarity values from that: recall that the Jaccard index is J/(A+B-J), where J = shared species and A, B are the species richness of the two samples being compared.

It is best to end up with a matrix of dissimilarity (1-similairty) as this will be the y-axis of the dendrogram.

A dissimilarity matrix (Jaccard).

The final dissimilarity matrix is what you’ll use to construct the dendrogram. In the example spreadsheet there is a copy in the Dissimilarity worksheet.

Excel tools for drawing

There are no built-in charts that you can coerce Excel to use for your dendrogram. You must use the matrix of (dis)similarity and the drawing tools. This is somewhat tedious but the results can be very good. The exercise also gives you a good insight into what the relationships between samples are.

The tools you need are in the Shapes button of the Insert menu (Excel 2013 and 2010).

The Insert > Shapes button provides the tools you need to construct a dendrogram in Excel.

You are going to need some kind of text box or shape (that you can enter the sample name into) to display the sample names and elbow connectors to join them together. Because of the way the elbow connectors work you’ll also have to use a small shape as a “node”; a kind of joining block if you like.

The dendrogram

Now you are ready to start constructing the Dendrogram.

Joining the first two samples

You’ll need to identify the first pair of samples to join together. These will be the pair that have the smallest (dis)similarity. In our current example this is 0.42, which represents the Jaccard dissimilarity between ML2 & MU2.

  • Make a text box or shape for the ML2 sample and place it somewhere near the middle/bottom of the screen. If you choose a plain text box you can display the sample name only, otherwise you can have a shape with the name inside.
  • Click on the box you created to activate the Drawing Tools You can then format the object as you like.
  • Once you are happy you can copy to the clipboard and paste, to make a duplicate.
  • Edit the name of the sample to MU2 (click in the shape or right-click and select Edit Text).

Now you need to make a small shape to use as a node. A diamond works well.

  • Use the Insert > Shapes button to make a small shape.
  • Use the Drawing Tools menu to format the shape how you want.
  • Drag the node shape to a spot between the first two boxes and a little above them.

Now you need to join the node and the two sample boxes.

  • Use the Insert > Shapes button and select a plain elbow connector.
  • Hover over the ML2 shape and you’ll see “edge markers”, click the top one and hold down the mouse button.
  • Drag the mouse to the node shape, when you “arrive” you’ll see the edge markers for the node. You can then release the mouse button to “fix” the connector.

The elbow connector should automatically take the shape you want (an inverted L shape) but if not you can use the mouse to tweak the shape. Repeat the process to join the node to the other (MU2) shape with a fresh elbow connector.

You’ll find that if you click the node you can use the arrow keys (or the mouse) to reposition the node; the elbow connectors move too so that the connection remains intact. You are going to use the “crossbar” through the node to indicate the index of (dis)similarity. This will be done later but it is worth keeping in mind as you build the dendrogram.

I chose a rounded rectangle as a shape in tis example but you could use a text box and display the label only (i.e. without a rectangle). Once you have the first pair “entered” it is a good idea to “mark off” the dissimilarity in the matrix of values. You can simply click the 0.42 value in the spreadsheet and make the text a different colour or alter the format in some other way.

Second connection

Now you’ve got the first pair joined you want to scan the dissimilarity matrix for the next largest value. There are subtly different methods of selecting which items to select – you are going for the simplest, which is called single linkage.

If the next largest value would join two completely different samples then proceed as before and join these items together. You can place them to one side or the other of the first pair and the crossbar will be a little higher, because it represents a larger dissimilarity. One of the difficulties about dendrograms is that you do not know how much room each item will require, so you may well have to move items around. You’ll see this situation later…

If the next largest dissimilarity joins a new sample to an existing pair, then you make a new shape for the sample label and join that to the pair already joined. You will still need another node shape. In this case that is exactly what you need to do as the next largest dissimilarity is 0.46, which is the MU1 & ML2 pairing. The MU1 sample is “new” but ML2 is already in the dendrogram. So:

  • Copy one of the sample shapes to the clipboard and paste it to a convenient spot.
  • Edit the text to represent the MU1 sample.
  • Drag the new MU1 shape to the left of the existing pair (you could go the other side but I choose left). You can place the MU1 shape to be in-line with the other pair or slightly higher. It is easier to line then up.
  • Now make a new node shape; just copy an existing one and paste it to a convenient spot. You want to move the node so that it lies between the new shape and the other pair, and a bit higher than the other crossbar.

Now you need to add elbow connectors:

  • Use Insert > Shapes and select an elbow connector.
  • Hover over the MU1 box and then click and hold over the top edge marker.
  • Drag the mouse to the new node and connect.
  • Now make a new elbow connector and join the new node to the previous one.

You can click the new node and move it to get the lines how you want.

The height of the “crossbars” will represent the level of dissimilarity. A new sample item is joined to an existing pair via the joining node(s).

Hopefully, you will now see why the nodes are required; you cannot join an elbow connector to another elbow connector.

Now you can “mark off” the dissimilarity (the 0.46 item) in the matrix and carry on building.

You can click the new node and move it to get the lines how you want.

The height of the “crossbars” will represent the level of dissimilarity. A new sample item is joined to an existing pair via the joining node(s).

Hopefully, you will now see why the nodes are required; you cannot join an elbow connector to another elbow connector.

Now you can “mark off” the dissimilarity (the 0.46 item) in the matrix and carry on building.

Existing pairings

Now look for the next largest dissimilarity, it is the 0.48 between MU1 & MU2. However, these items are already on the dendrogram. So, you can mark off the 0.48 item and move on to another.

This method is the so-called single linkage method and is the simplest to operate. An alternative is to find the average dissimilarity between potential pairings. This generally has little effect except to alter the position of the crossbars. Another method is “complete” linkage. In this case you look for all potential joins and join pairs when you can, then join these pairings with the “spare” dissimilarities. It is easy to make errors with these alternative methods and the single linkage method is the simplest. If you are using Excel (like here) then stick to single linkage. If you use R then the “complete” method is the default method, although you can easily specify single linkage.

Third connection (fourth item)

The next largest dissimilarity is 0.5, the MU1 & ML1 pairing. Since ML1 is not on the dendrogram you can add it in the same way you added the MU1 item.

The fourth item extends the dendrogram. Note that the crossbars denote dissimilarity.

Now you can mark off the 0.5 dissimilarity and keep on building.

Making room

At some point it is almost inevitable that you’ll need to move items around to make space for new ones. You can click a shape with the mouse to select it. If you hold the control key, you can select multiple objects. The trick is to be systematic; select the sample name shapes first then all the nodes. Then go for the elbow connectors, as you select these you should see some small (probably green) selection “knobs” appear at the ends of the elbow lines, these help to spot that you’ve selected them. Once you have got all the items you want try an arrow key and the items should move together.

Fourth connection

The next dissimilarity is 0.55 but this is between ML1 and MU2, which are already represented. Similarly, 0.59 (ML2 & ML1) is represented. The next “available” dissimilarity is 0.61 between SU1 and SU2. This will make a completely new pair.

Make new shapes for the SU1 and SU2 samples and join them as you did for the very first pair. Place them to the right of the existing block of samples. The crossbar needs to be a bit higher, as the dissimilarity is larger.

The latest pair does not connect to any existing samples. Eventually the blocks will be connected.

You can now check off the 0.61 item from the dissimilarity matrix.

More connections

The next largest dissimilarity is 0.66 between SL1 & SU1. This means that you’ll need to make a new shape for SL1 and join it to the pair on the right.

You can now check-off the 0.66 dissimilarity.

The next largest dissimilarity is 0.69, between SL2 & SU1. You need to make a new shape for SL2 and join it to the right-hand block.

Now check-off the 0.69 dissimilarity.

The next largest dissimilarity is 0.72, between PU2 and MU1. Now, MU1 is already represented so you’ll need to make a new shape for PU2 and join it to the left-hand block.

Now check-off the 0.72 dissimilarity.

The next largest is 0.75, which is there twice (if you are looking at 2 decimal places). Both pairings are already represented. Keep going, you’ll check off 0.76 and 0.77.

The next largest is 0.79, between PL2 and SL1. This means you’ll need to make a new shape for PL2, which you can place on the right and join to the right-hand block.

PL2 is the final sample you need to add but you still have two blocks of samples in the dendrogram. Use the top two node shapes and join them with a single elbow connector. Alternatively, make a new node in the middle and two connectors. This is not necessary but makes it a little easier to shift things around if you need to tweak the positions.

Now the dendrogram is completed. You can check to see that the remaining dissimilarity values are represented but you can see that there are 10 samples in the data and 10 in the dendrogram.

Final touches

The dendrogram is completed but it would be good to have an axis to show the dissimilarity. The simplest way is to draw a vertical line and add some text annotations to show the dissimilarity. The gridlines in the worksheet are useful aids to help you align the crossbars.

Rearranging branches

The important thing about the dendrogram is the connections between the sample blocks and the nodes, as well as the position of the crossbars. You could draw the dendrogram with the samples in a different order. Essentially you could “reflect” or rotate the branches around any of the nodes, as long as the objects have the “correct” joins the appearance can alter.

Look at the following dendrogram, drawn using R.

Dendrogram produced using R.

It is essentially the same as the example done with Excel but the branches are arranged subtly differently. This makes it appear different

Logistic regression: model building

Exercise 11.3.2.

Statistics for Ecologists (Edition 2) Exercise 11.3.2

This exercise is concerned with how to build a logistic regression model (Section 11.3.2). The processes are very similar to those described in Section 11.1.2, where you saw a stepwise building process used with regular multiple linear regression. The data for this exercise are available as a separate file newt regression.RData. The file will open in R and create a data object, gcn.

Model building in logistic regression

Introduction

When you have several (or indeed many) predictor variables you want to find a regression model that best describes the relationship between the variables. You should not incorporate every variable that you’ve got. Eventually you’ll explain all the variability in your response variable simply because you’ve got so many explanatory predictors.

The process of model-building allows you to select the “best” variable to add to your current regression model. In the book you see how to carry out stepwise model building using a regular multiple regression (Section 11.1.2). In this exercise you can have a go at building a logistic regression model. The process is much the same as described in Section 11.1.2.

The example data

The data you’ll use comes from a survey of an amphibian (great crested newt) in southern England. You can get the data separately as newt regression.RData, which will load the gcn object to your console.

head(gcn)
  presence area dry water shade bird fish other.ponds land macro  HSI
1        0  100   1     2    10    2    1           2    1    30 0.53
2        0   40   1     2    90    1    3           1    1     0 0.47
3        0  150   1     3    20    2    3           4    1    80 0.69
4        0   25   1     2    90    1    3           4    1     0 0.51
5        1  800   1     2    90    1    4           5    3     0 0.70
6        0  275   1     3     5    2    1           2    4    50 0.72

The first column, presence, gives the presence or absence of newts as a binary variable (0 = absence, 1= presence). The researcher used standard survey methods to detect the presence (or otherwise) of newts at 200 ponds. The other variables are habitat factors:

  • area – the pond area in square metres.
  • dry – pond seasonality (1-4 with 1 being non-seasonal and 4 being most seasonal).
  • water – a subjective measure of water quality (1-4, 1 = bad).
  • shade – the shadiness of the pond as a %.
  • bird – presence of waterfowl (1-3, 1 = absent).
  • fish – presence of fish (1-4, 4 = absent).
  • ponds – number of other ponds within 1km.
  • land – terrestrial habitat quality (1-4, 4 = good).
  • macro – cover of macrophytes as a %.
  • HSI – habitat suitability index (0-1). This is a standard measure compiled from other habitat measures.

In this kind of survey, the various habitat factors are converted to an index, the indices are combined to make a final HSI (habitat suitability index). The HSI is used to make it easier to assess waterbodies for their potential to support populations of the great crested newt and to give a measure of reproducibility to surveys.

The example datafile is a “real” survey conducted on 200 ponds in Buckinghamshire in the UK. I’ve cut down the variables a little (the original data had the indices for each habitat variable) but essentially the dataset is unmodified.

The research task

The research task is twofold:

  • To find the most important (and significant) habitat factors affecting the presence or absence of the great crested newt.
  • To compare the “predictive power” of a habitat factor-based model against one using the HSI alone.

The first task will involve building a logistic regression model, which will determine that will show the habitat factors that are “most influential”.

The second task will be to make a simple regression model using the HSI variable as a single predictor. The multiple-factor model and the HSI model can then be compared to see if there is any substantial difference in deviance explained.

Building a multiple regression model

To build the model you’ll need to use the glm() command, which carries out the logistic regression (when you set family = binomial). You’ll also need to use the add1() command. The command takes an existing regression model and adds all the variables you specify, one at a time. The results are then displayed.

You then use the AIC values (a measure of how bad the model fit is) to select the next best variable to add to your model. You’ll want to pick the variable with the lowest AIC value. With many potential variables you can of course end up with a comprehensive list. It is therefore helpful to see the significant variables. You can tell the add1() command to carry out a significance test too.

  • For linear models (family = gaussian) you use test = “F”.
  • For logistic models (family = binomial) you use test = “Chisq” (or test = “LRT”).

Start with a blank model

The starting point for your regression model should be one with an intercept only and no terms except the response. In our example the response is the presence variable.

gcn.glm <- glm(presence ~ 1, data = gcn, family = binomial)

You could use the summary() command to look at the result but it would not be very informative at this stage as you have not incorporated any variables (model terms).

Add the first term

Now you have a blank model (intercept only) you will need to look at the effect of adding each variable in turn. The variable that has the lowest AIC value will be the one you want to incorporate (replace the intercept with the variable).

The add1() command will “interrogate” the regression model and return a list of variables (from a source you specify, usually the original dataset). To make the selection process easier you can carry out a significance test at the same time. You’ll want to include only variables that are significant.

add1(gcn.glm, scope = gcn, test = “Chisq”)
Single term additions
Model:
presence ~ 1
            Df Deviance    AIC    LRT  Pr(>Chi)
<none>           261.37 263.37
area         1   258.88 262.88  2.492 0.1144364
dry          1   261.31 265.31  0.054 0.8159993
water        1   255.07 259.07  6.295 0.0121104 *
shade        1   248.52 252.52 12.844 0.0003386 ***
bird         1   261.03 265.03  0.338 0.5611190
fish         1   257.16 261.16  4.207 0.0402523 *
other.ponds  1   252.05 256.05  9.317 0.0022700 **
land         1   260.80 264.80  0.568 0.4508771
macro        1   246.55 250.55 14.818 0.0001184 ***
HSI          1   228.67 232.67 32.701 1.075e-08 ***

Note that the HSI variable has the lowest AIC value. Ignore that for the moment (we’ll return to it later). For now, just focus on the habitat variables.

The macro variable has the lowest of the other AIC values (250.55) and is significant too. So you can use the up arrow on the R console to recall the earlier command and edit it to change the regression model. Replace the intercept with the macro variable.

gcn.glm <- glm(presence ~ macro, data = gcn, family = binomial)

Now you’ve got a model with a single term.

Add subsequent terms

To add more terms, you:

  1. Run add1() on your existing model.
  2. Choose the variable with the lowest AIC.
  3. If the variable is significant then use it. If the variable is not significant then stop.
  4. Edit the regression formula to include the new variable.
  5. Repeat the steps until no significant variables remain.

If you follow the process, you’ll end up with a model like so:

gcn.glm <- glm(presence ~ macro + other.ponds + fish + shade, data = gcn, family = binomial)

The water variable shows up as being significant if you run add1() on this model but we’ll stop because the p-value is only 0.044. When building a multiple variable model, it is not a bad thing to set a more rigorous cut-off value.

The final model can be summarised:

summary(gcn.glm)

Call:
glm(formula = presence ~ macro + other.ponds + fish + shade,
family = binomial, data = gcn)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-1.8022  -0.9099  -0.5447   0.9757   2.0516

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.909742   0.704410  -4.131 3.62e-05 ***
macro        0.018963   0.006368   2.978  0.00290 **
other.ponds  0.241031   0.078685   3.063  0.00219 **
fish         0.494789   0.172173   2.874  0.00406 **
shade       -0.014619   0.005284  -2.767  0.00566 **

Signif. codes:  0 ‘0.001’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 261.37  on 199  degrees of freedom
Residual deviance: 222.57  on 195  degrees of freedom
AIC: 232.57
Number of Fisher Scoring iterations: 4

Explained deviance

The logistic regression does not return an R-squared value. What you see is Null deviance and Residual deviance. You can use these values to get a value that is equivalent to the R-squared, let’s call it the D-squared value:

1-(model$deviance/model$null.deviance)

Where model is your regression model result.

1 – (gcn.glm$deviance / gcn.glm$null.deviance)
[1] 0.1484474

You can see the result is about 0.15, which is not really great when you consider that there are 4 explanatory variables in the model!

D-squared calculator

You can make a simple custom command to calculate the D-squared value. Type the following into your console to set-up a custom command called d2:

d2 <- function(model, digits = 4) {
     round(1 - (model$deviance / model$null.deviance), digits = digits)
   }

Now you can use d2(model) to carry out the calculations simply and easily. Just give the command the name of your regression model (you can also use the digits parameter to view different numbers of decimal places).

Compare models

You can compare models in two main ways:

  • Are two models significantly different? Use the anova()
  • Compare the D-squared values. Is the explanatory power of one model better than another?

First of all, let’s make some alternative models to compare:

mod1 <- glm(presence ~ macro, data = gcn, family = binomial)
mod2 <- glm(presence ~ macro + other.ponds + fish + shade, data = gcn, family = binomial)
mod3 <- glm(presence ~ HSI, data = gcn, family = binomial)

The first model uses the single predictor macro (cover of macrophytes). This was the variable identified as the “best” of the habitat-based variables.

The second model is the multiple variable model using the 4 best habitat-based variables.

The third model uses the HSI variable alone. Recall that this is a composite index that uses information from several habitat factors in its construction.

Use anova() to compare models

The anova() command can compare models. The name of the command is slightly misleading as it is not really analysis of variance. Rather it compares deviance of models that are based on the same set of variables. You can get a “test” of significance; with the logistic regression you need test = “Chisq” as you did with the add1() command.

You can only really compare models that are based on the same subset of variables. You cannot really get meaningful results by comparing models that each contain only a single predictor variable. If you look at the logistic regression models for the newts:

anova(mod1, mod2, test = "Chisq")

Analysis of Deviance Table
Model 1: presence ~ macro
Model 2: presence ~ macro + other.ponds + fish + shade

  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       198     246.55
2       195     222.57  3   23.981 2.52e-05 ***

You can see that there is a significant difference between the two models. We would expect this of course, since we’ve been adding significant variables to the original model. Now look at the habitat-based model and the overall index-based one:

anova(mod2, mod3, test = "Chisq")

Analysis of Deviance Table
Model 1: presence ~ macro + other.ponds + fish + shade
Model 2: presence ~ HSI

  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       195     222.57
2       198     228.67 -3  -6.0985   0.1069

This time there is no significant difference. Neither model is “better” than the other. This is interesting because the first model is based on only 4 habitat variables but the HSI index is formed from 9 different habitat variables, each of which is itself converted to an index.

You cannot really use anova() to compare models with single variables:

anova(mod1, mod3, test = "Chisq")

Analysis of Deviance Table
Model 1: presence ~ macro
Model 2: presence ~ HSI

  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       198     246.55
2       198     228.67  0   17.883

You can get a “result” but there is no significance value as the models are not directly comparable.

Use D-squared values

You can explore differences between models using a different approach, their explanatory power. You can calculate a D-squared value (equivalent to the R-squared value of a regular regression). This approach is not strictly a statistical one but gives you an idea of “how much” better one model is than another.

Look at the D-squared values of each of the newt models:

d2(mod1) ; d2(mod2) ; d2(mod3)
[1] 0.0567
[1] 0.1484
[1] 0.1251

The first model (mod1, macrophytes only) shows a D-squared a bit over 5%. The second habitat-based model (mod2) shows a value of nearly 15%, which is not great but is nearly 3 times better than mod1.

The index-based model was not statistically different from the habitat-based model. The D-squared value of mod3 is 12.5%, which is a little smaller than the 15% of mod2.

Summary

The more variables you add to a regression model the greater the “explanatory power”, in other words the D-squared value will keep increasing. However, it is rarely smart to keep adding variables just because you can measure them. The point of regression is to help you explain variability in the data. If variables are not significant they will only add a small fraction to the overall D-squared.

The best regression models explain as much of the variability as possible with the fewest variables. In this dataset it seems clear that the habitat-based model is the best of the options. It explains the most variability in the data with the fewest variables. The HSI model looks simpler but the HSI is calculated from 9 other variables.

As an additional exercise you might look at a regression model containing all the habitat variables. You will get a D-squared value of 17% and the model is not significantly “better” than the 4-variable model or the HSI one.

As a final note: the regression models here are based on newt presence or absence. Effectively a model “success” is the presence of a newt. It might be better to think of the model as a better predictor of the absence of newts, rather than their presence. A pond may be perfectly suitable for newts but they are not there simply because they have not found it. This historical component probably explains the low D-squared values.

Preparing & managing community data

Exercise 12.0.0.

Statistics for Ecologists (Edition 2) Exercise 12.0.0

This exercise relates to all of Chapter 12 (community ecology), and is primarily aimed at helping you to prepare data and assemble it in a form that allows you to carry out further investigation. This follows on from an earlier exercise in Section 3.2.7, where you used an Excel Pivot Table.

Preparing & managing community data

Introduction

I am text block. Click edit button to change this text. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.

It is important that your data are arranged and set out in a manner that allows you to carry out the analyses that you require. In general, a scientific recording layout is a good starting point (see Section 2.2). In the scientific recording layout (a.k.a. biological recording layout) you have a column for each variable (e.g. site name, species name, abundance). For most purposes this is the most “robust” way to record your data, as you can use the data most flexibly. For community analyses however, you’ll generally want to have the data arranged by site and species, with the rows being site names and the columns the species (with the body of the table being the abundance).

This exercise shows you how to take a dataset that is in recording layout and convert to community layout. Get the datafile for this exercise: Preparing and managing community data exercise.xlsx.

The exercise data

The exercise data are in the file Preparing and managing community data exercise.xlsx. There are several worksheets, with the Data tab being the main data. The other worksheets contain a note about the Domin scale (an ordinal scale of plant abundance), a completed Pivot Table and a dataset ready for export as CSV.

The data show the abundance of some terrestrial plant species at ten sites in Shropshire. These data are part of a series of NVC surveys carried out by student groups. Here are the first few rows of the dataset:

The first few rows of the example dataset.

Site Species Const Cover Imp
ML1 Achillea millefolium 5 6 6
ML1 Centaurea nigra 5 4 4
ML1 Lathyrus pratensis 5 5 5
ML1 Leucanthemum vulgare 5 5 5
ML1 Lotus corniculatus 5 7 7
ML1 Plantago lanceolata 5 5 5

 

The data are as follows:

  • Site = Site names as a simple label.
  • Species = Plant species names.
  • Const = Frequency of occurrence (a value from 0-5).
  • Cover = The maximum abundance in Domin scale (a value from 1-10) from the 5 quadrats at each site.
  • Imp = A combination of Const and Cover (Const * Cover ÷ 5) to give an abundance score.

These data are set out in scientific recording layout, that is, each column is a single variable and each row is a “record”. This gives the most flexibility and maximises the usefulness of the data. However, for community analysis we’ll need the data in community layout like so:

The example data laid out as a community dataset. Each column is a separate sample.

Species ML1 ML2 MU1 MU2 PL2 PU2 SL1 SL2 SU1 SU2
Achillea millefolium 6 3 5 4 0 3.2 0 0 0 0
Aegopodium podagraris 0 0 0 0 0 0 0 1.6 0 0
Agrostis capillaris 0 8 8 5.6 8 0 5 2 3.2 0
Agrostis stolonifera 0 0 0 0 0 5 0 0.4 0 0
Anthriscus sylvestris 0 0 0 0 0 0 0 0 1.2 3
Arctium minus 0 0 0 0 0 0 0 0 0 0.4

 

In this case the columns are the sites. It is easy to transpose the data, either in Excel or later on if you have exported the data to CSV.

Although this is the “ideal” layout for the purpose of community analysis there are other ways to manage the data, which you’ll see as you work through the exercise. The manipulations are illustrated using Excel 2013 (Windows) but you can follow along with most versions of Excel as well as Open Office or Libre Office.

Make a Pivot Table

The starting point for your data arrangement and management is a Pivot Table. This will allow you to arrange and rearrange your data in many (useful) ways. You can also use a Pivot Table to get summary information and to make exploratory graphs.

In Excel 2013 the Pivot Table button is in the Insert ribbon menu. In other spreadsheets you may find the appropriate function in the Data menu. Start by making a new blank Pivot Table in a fresh worksheet:

  1. Click once anywhere in the block of data; there is no need to highlight or select any data.
  2. Click the Insert > Pivot Table
  3. The Create Pivot Table dialogue box should open and you’ll see that the data are automatically selected. The default is to place the new pivot table in a new worksheet so you can simply click OK.

You should now see the new (empty) pivot table and the Pivot Table Fields dialogue box. In most versions of Excel you drag the items you want from the top (the list of variables) to the appropriate section at the bottom of the dialogue window (in some spreadsheets you drag items directly to the pivot table).

Now you are ready to build your dataset. There are many options and in the following exercise you’ll get to see several ways to present the data.

Make species lists

There are various simple lists you can make.

All species present

A good starting point would be to make a list of all the species present across the survey sites.

  1. Click on the pivot table (currently empty) to ensure that the Pivot Table Field dialogue box opens.
  2. Now drag the Species field from the list at the top to the area labelled Rows at the bottom of the dialogue box.

Now you have a simple overall species check-list. This is useful as a data validation tool, as you can look through it for spelling mistakes (two entries that are similar, one will probably be a mistake).

Species listed by site

It would also be useful to have a list of the species at each site. There are two ways you could try:

  • List all sites with their species
  • List sites and their species sequentially

Listing all the sites seems the most obvious.

  1. Click the Pivot Table to open the Pivot Table Field dialogue box.
  2. Drag the Site field item from the top to the Rows area at the bottom. Make sure that the Site item is above the Species item (you can drag to rearrange them anytime).

Now you have a compact table showing the species lists for all the sampling sites:

Species lists per sample site.

Try dragging the Site field so that it is underneath the Species field in the Rows area. You now get a list of which sites each species was present in.

You might also want to display a single site, rather than all of them. There are two simple options:

  • Filter the Site field to display one (or more) of the sites.
  • Move the Site field to the Filters area (sometimes called Report Filter), where you can apply the filter.

The first option allows you to keep the species lists separate per site.

  1. Click the pivot table to activate the Pivot Table Field dialogue box.
  2. Move the mouse over the Site field at the top of the Pivot Table list. You’ll see a triangle at the right end of the highlight, click that to open a sort & filter dialogue box.
  3. Click the box(es) beside the items you want to view or not. A ticked box will be displayed and an empty box will be not displayed!
  4. Click OK when you are done to apply the filter. Note that you’ll see a funnel icon by the Site field in the list area.

The second option allows you to combine lists across sites.

  1. If you applied a filter earlier clear it now. Click the Pivot Table then the triangle in the Site field to open the sort & filter dialogue.
  2. Click the Clear Filter from “Site” button to remove the filter.
  3. Now drag the Site field from the Row area at the bottom to the Filter area.
  4. Now you see the top of the pivot table read Site and to the right a label (All) and a grey triangle.
  5. Click the triangle to open the filter dialogue.
  6. If you click a single item, you’ll display the species for that site. If you click several sites, you’ll display a list encompassing all those sites.

This approach could be useful if you have “grouped” samples, perhaps from different habitats.

Presence-absence lists

Simple species lists are useful but for calculations of species richness or similarity you’ll need to make a presence-absence dataset. That is one where each species has a 1 for presence or 0 for absence.

This is easily done with the pivot table.

  1. If you have filters applied they will usually be cleared if you move a field to a new location. Alternatively, you can remove the fields and start afresh (drag the fields out of the lower part of the Pivot Table Field dialogue box or un-tick the fields in the list at the top).
  2. Drag the Species field to the Rows area.
  3. Drag the Sites field to the Columns area.
  4. Now you have a table arranged in a sensible fashion but no data! Drag the Species field from the list at the top to the Values area at the bottom (of the Pivot Table Field dialogue box).

Now you’ll have a table where the presence of species is shown by a 1. The absences are blank spaces. At the moment you’ll also have row and column totals.

  • Row totals show the frequency of each species across the samples.
  • Column totals show the species richness for each site.

If you need to carry out analysis, then you may well wish to remove the totals and also to replace the blank cells with 0. To do this you’ll need to use the Pivot Table Tools menus.

  1. Click once in the pivot table to activate the Pivot Table Tools ribbon menus. In Excel 2013 there are two: Analyze and Design. In older versions there were three menus but you should not have too much difficulty finding the right buttons.
  2. Remove the totals using the Design Click Grand Totals > Off for Rows and Columns.
  3. To replace blanks with 0 you need the Analyze Click Pivot Table > Options to open the Pivot Table Options dialogue box.
  4. On the Layout & Format tab (this should be in view by default) enter a 0 in the box labelled For empty cells show. The tick-box beside the label needs to be ticked but this is usually already activated.
  5. Now click OK to apply the changes.

Now your data are in a form that could be exported as CSV, although you’d need to remove the first three rows from the final CSV.

If you need to have the sites as rows, then you can simply rearrange the fields between the Rows and Columns areas. If you are going to use R then there is no need to do this as you can easily transpose data using the t() command.

Abundance lists

Since these data are accompanied by abundance data it is most likely that you’ll want to use the abundance instead of presence-absence. In the dataset the Imp variable is a measure of abundance derived from the frequency and maximum (Domin) cover. Each sample site is a combination of 5 quadrats but the individual quadrat data is not available.

To make an abundance-based community dataset you simply rearrange the pivot table:

  1. Click once on the pivot table.
  2. Now rearrange the fields so that Species are in the Rows area.
  3. Make sure the Site field is in the Columns area.
  4. Remove the item in the Values area; just drag it away and it’ll disappear.
  5. Drag the Imp field from the list at the top to the Values area at the bottom.

The Values item should read Sum of Imp. Since there is only one value per species/site combination this is fine. If the field does not show Sum, then click on it and use Value Field Settings to alter the summary to Sum.

You can arrange the Site and Species fields in different ways, just as you did when making simple lists. Try it out and see what you get. The most useful arrangement for community analysis is where you have Sites as columns and Species as rows. If you need to have the sites as rows, then you can simply rearrange the fields between the Rows and Columns areas. If you are going to use R then there is no need to do this as you can easily transpose data using the t() command once you have the data in R.

Constancy tables

A constancy table is used to display the survey results of a British NVC plant survey. For each site/sample the species are listed. The abundance data are given as a pair of values:

  • The constancy – that is the frequency of occurrence (1-5, absences are not shown).
  • The cover – that is the maximum abundance from the 5 sample quadrats) in domin scale (1-10).

Even if you are not reporting an NVC survey the process will help to show you how flexible the pivot table can be.

  1. Start by removing any previous data from the Pivot Table.
  2. Drag the Species field to the Rows area.
  3. Drag the Site field to the Columns area.
  4. Now drag the Const field to the Values area. You now have a table showing the frequency of each species by site.
  5. Now drag the Cover field to the Values area; drop it underneath the Sum of Const item that’s already there.

Now you have a table showing the Constancy (Frequency) and Cover (max abundance) for each species and site. This is not the most “readable” of tables, so try some modifications.

Try dragging the Site item in the Columns area so that it is underneath the ∑Values item. Now you have two tables side by side, one showing constancy, the other cover.

Move the Site item again, this time place it above the Species item in the Rows area. Now you have a much more useable table, with the sites being shown atop one another.

With a bit of tweaking you can recreate the original data layout!

  1. Click once on the pivot table to activate the Pivot Table Tools
  2. Click the Design > Report Layout button and select Show in Tabular Form.
  3. Now click the Report Layout button again and select Repeat All Item Labels.
  4. Click the Design > Subtotals button and select Do Not Show Subtotals.

Obviously it is a little silly to recreate the original dataset; this just goes to show how you can arrange and rearrange the data easily using a pivot table.

There is a lot more to Pivot Tables (as well as general data management) in my book Managing Data Using Excel!

Logistic regression: model prediction

Exercise 11.3.1.

Statistics for Ecologists (Edition 2) Exercise 11.3.1

These notes are about how to use the results of a regression model to predict the value of the response variable when you supply certain values of the predictor.

Predicting values from logistic regression and other models

Introduction

When you carry out a regression you are looking to describe the data in terms of the variables that form the relationships. When you’ve got your regression model you are able to describe the relationship using a mathematical model (which is what the regression model is).

You can use the regression model to make predicted values, which is where you use “new” values of the predictor (that is ones not observed in the original dataset) to predict the response variable.

These predicted values are especially important in logistic regression, where your response is binary, that is it only has two possibilities. The result you get when you “predict” response values in a logistic regression is a probability; the likelihood of getting a “positive” result when the predictor variable is set to a particular value.

These notes will show you how to use the predict() command in R to produce predicted values for regression models, particularly logistic regression.

The predict() command

The predict() command is used to compute predicted values from a regression model. The general form of the command is:

predict(model, newdata, type)
model A regression model, usually the result of lm() or glm().
newdata A data.frame giving the values of the predictor(s) to use in the prediction of the response variable.
type The type of prediction, usually you want type = “response”.

 

The model is simply the result of a regression model. You need to specify type = “response” so that your prediction predicts the response variable (note that this is not necessarily the default, so you must specify it).

The newdata parameter is where you specify the values of the predictor(s) that you want to use to predict the response variable. You can make a data.frame object containing the variables (with variable names to match the original dataset), or can specify it “on the fly”. You’ll see examples shortly.

Prediction in logistic regression

Logistic regression is carried out in cases where your response variable can take one of only two forms (i.e. it is binary). There are two general forms your response variable can take:

  • Presence/absence, that is, 0 or 1 (or some other binary form).
  • A success/failure matrix, where you have two frequencies for each observation (the “successes” and the “failures”).

The way your data are arranged does not make much difference to how you carry out the predictions but because of the different forms it is useful to see an example of each.

Logisitc regression and presence/absence

When you have the response variable as presence/absence (1 or 0) it is easy to see that your response is in binary form. The logistic regression converts the 1s and 0s to a likelihood (under the various levels of your predictor variables), so your result is in that form. So, when you use the predict() command you can get the probability of getting a success (a presence: 1).

The example file (Predict regressions.RData) contains a dataset that looks at the presence or absence of an amphibian (great crested newt) in ponds. The data are called gcn. Various habitat factors were measured. One factor is the percentage cover of macrophytes. You can use the logistic regression to explore the relationship between the presence (or absence) of newts and the cover of macrophytes.

Once you have the regression model you can use predict() to predict the likelihood of finding a newt given any value for the cover of macrophytes.

Make a regression model

Make a regression model result:

gcn.glm <- glm(presence ~ macro, data = gcn, family = binomial)

The result is that the macro variable is statistically significant:

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.135512   0.217387  -5.223 1.76e-07 ***
macro        0.022095   0.005901   3.744 0.000181 ***

You can present a graph of the model (see the book text for the code):

Predict outcomes

Now you could look at the line on the plot (made using fitted values) and estimate the probability of finding a newt (presence = 1) for any percentage cover of macrophyte. Alternatively you can use the predict() command.

nd = data.frame(macro = c(40, 50, 60))
nd
  macro
1    40
2    50
3    60

predict(gcn.glm, newdata = nd, type = “response”)
        1         2         3
0.4374069 0.4923162 0.5474114

See that you first make a data.frame with a variable called macro, to match the original data. This new data is used to predict the response. However, the value you get is not exactly a response but it is the likelihood of “success”, that is finding a newt. You could have set the original response variable (presence) to have been 1 for absence and 0 for presence. Apart from being a little odd you’d be assuming that a “success” was the absence of the newts.

In any event your prediction is a probability of success. You might have chosen to create the prediction newdata data.frame directly in the predict() command, which is what you’ll see in the following example.

Logistic regression and success/failure matrix

Your data may be in a subtly different form to the 0 or 1 model. In the following example you have a predictor variable, latitude, and a response but the response is split into two frequencies:

cbh
  latitude Mpi90 Mpi100  p.Mpi100
1     48.1    47    139 0.7473118
2     45.2   177    241 0.5765550
3     44.0  1087   1183 0.5211454
4     43.7   187    175 0.4834254
5     43.5   397    671 0.6282772
6     37.8    40     14 0.2592593
7     36.6    39     17 0.3035714
8     34.3    30      0 0.0000000

Here you have two versions of an allele in the Californian beach hopper. The data are in the file (Predict regressions.RData) and are called cbh. For each sampling latitude a number of animals were collected. They can have one form of the allele or the other. So your response variable is split into the frequency of each. If you take those two variables you have a success/failure matrix.

The logistic regression can constrict the likelihood from the success/failure. It does not matter which allele you choose to be the “success” but here I’ve used the Mpi100 allele (because the proportion increases with latitude). The proportion of the “success” is shown in the final column. It won’t be used in the regression except to draw a plot.

Make a regression model

Start by making a regression model. You need to make a success/failure matrix but this can be done as part of the glm() command formula.

cbh.glm <- glm(cbind(cbh$Mpi100, cbh$Mpi90) ~ latitude, data = cbh, family = binomial)

The result shows that latitude is a significant variable:

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.64686    0.92487  -8.268   <2e-16 ***
latitude     0.17864    0.02104   8.490   <2e-16 ***

You can draw this as a plot (because of the replicates we can add error bars):

Predict outcomes

As before you could use the plot to predict the likelihood of “success”, the Mpi100 allele. You can also use the predict() command:

predict(cbh.glm, newdata = data.frame(latitude = c(35, 40, 45)), type = "response")
        1         2         3
0.1986945 0.3772411 0.5967457

Note that this time the newdata parameter used a data.frame created on the fly.

Prediction with multiple predictors

When you have multiple predictor variables you have to specify them all in the newdata parameter of the predict() command. It is easiest to make a new data.frame object to hold these “new” data unless you only have one or two values.

In the following data example there are four predictors and one response (Length).

names(mf)
[1] "Length" "Speed"  "Algae"  "NO3"    "BOD"

The data give the lengths of a freshwater invertebrate (a mayfly species) and some habitat conditions at the sampling locations where each observation took place. The data are in the file (Predict regressions.RData) and are called mf.

Your new data.frame has to have the same variable names but they do not have to be in any order. You only have to include variables that are in the regression model.

Make the regression model

Make the model:

mf.lm <- lm(Length ~ BOD + Algae, data = mf)

The result shows the coefficients (one variable is not significant but we’ll leave it in).

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.34681    4.09862   5.452 1.78e-05 ***
BOD         -0.03779    0.01517  -2.492   0.0207 *
Algae        0.04809    0.03504   1.373   0.1837

Predict the result

Because there are two predictors it is not practical to draw a plot (apart from diagnostics, see Exercise 11.1). So the predict() command is the way to go.

predict(mf.lm, newdata = data.frame(BOD = 20, Algae = 10), type = “response”)
       1
22.07199

You can of course make a separate data.frame containing BOD and Algae columns and populate it with values for which you want to get a prediction.

Data examples

The data examples used here are available as part of the support data for the book. I’ve also added a file Predict regressions.RData, which contains just the three datasets mentioned on this page.