Critical values for the Kruskal-Wallis test

Exercise 10.2.a.

Statistics for Ecologists (Edition 2) Exercise 10.2a

These notes are concerned with the Kruskal-Wallis test (Section 10.2). The notes give the critical values for the Kruskal-Wallis test under various scenarios.

Critical values for the Kruskal-Wallis test

Introduction

The Kruskal-Wallis test is appropriate when you have non-parametric data and one predictor variable (Section 10.2). It is analogous to a one-way ANOVA but uses ranks of items in various groups to determine the likely significance.

If there are at least 5 replicates in each group, the critical values are close to the Chi-Squared distribution. There are exact critical values computed when you have equal group sizes. There are also exact critical values for situation where you have un-equal group sizes.

General critical values: sample sizes >= 5

When you have at least 5 observations in each group the Kruskal-Wallis critical value is approximately the same as Chi Squared. You need to determine the degrees of freedom, which are the number of groups minus 1. You can reject the null hypothesis if your calculated value of H is bigger than the tabulated value.

Critical values for H where all samples have at least 5 replicates.

degrees of Significance level
freedom 5% 2% 1% 0.01%
1 3.841 5.412 6.635 10.830
2 5.991 7.824 9.210 13.820
3 7.815 9.837 11.341 16.270
4 9.488 11.668 13.277 18.470
5 11.070 13.388 15.086 20.510
6 12.592 15.033 16.812 22.460
7 14.067 16.622 18.475 24.320
8 15.507 18.168 20.090 26.130
9 16.909 19.679 21.666 27.880
10 18.307 21.161 23.209 29.590
11 19.675 22.618 24.725 31.260
12 21.026 24.054 26.217 32.910
13 22.362 25.472 27.688 34.530
14 23.685 26.873 29.141 36.120
15 24.996 28.259 30.578 37.700
16 26.296 29.633 32.000 39.250
17 27.587 30.995 33.409 40.790
18 28.869 32.346 34.805 42.310
19 30.144 33.687 36.191 43.820
20 31.410 35.020 37.566 45.310
21 32.671 36.343 38.932 46.800
22 33.924 37.659 40.289 48.270
23 35.172 38.968 41.638 49.730
24 36.415 40.270 42.980 51.180
25 37.652 41.566 44.314 52.620
26 38.885 42.856 45.642 54.050
27 40.113 44.140 46.963 55.480
28 41.337 45.419 48.278 56.890
29 42.557 46.693 49.588 58.300
30 43.773 47.962 50.892 59.700

Exact values: Equal sample sizes

If you have the same number of replicates for each group you can use a table of exact values, instead of the Chi Squared approximation. There are separate tables for different numbers of groups. Reject the null hypothesis if your calculated value of H is equal or greater than the tabulated value for the appropriate sample size.

Groups = 3

Significance level
Sample 5% 2% 1%
size Groups = 3
2
3 5.600 6.489 7.200
4 5.692 6.962 7.654
5 5.780 7.220 8.000
6 5.801 7.240 8.222
7 5.819 7.332 8.378
8 5.805 7.355 8.465
9 5.831 7.418 8.529
10 5.853 7.453 8.607
11 5.885 7.489 8.648
12 5.872 7.523 8.712
13 5.901 7.551 8.735
14 5.896 7.566 8.754
15 5.902 7.582 8.821
16 5.909 7.596 8.822
17 5.915 7.609 8.856
18 5.932 7.622 8.865
19 5.923 7.634 8.887
20 5.926 7.641 8.905
21 5.930 7.652 8.918
22 5.932 7.657 8.928
23 5.937 7.664 8.947
24 5.936 7.670 8.964
25 5.942 7.682 8.975
infinity 5.991 7.824 9.210

Groups = 4

Exact critical values for H for four groups of equal size.

Significance level
Sample 5% 2% 1%
size Groups = 4
2 6.167 6.667 6.667
3 7.000 7.872 8.538
4 7.235 8.515 9.287
5 7.377 8.863 9.789
6 7.453 9.027 10.090
7 7.501 9.152 10.250
8 7.534 9.250 10.420
9 7.557 9.316 10.530
10 7.586 9.376 10.620
11 7.623 9.422 10.690
12 7.629 9.458 10.750
13 7.645 9.481 10.800
14 7.658 9.508 10.840
15 7.676 9.531 10.870
16 7.678 9.550 10.900
17 7.682 9.568 10.920
18 7.698 9.583 10.950
19 7.701 9.595 10.980
20 7.703 9.606 10.980
21 7.709 9.623 11.010
22 7.714 9.629 11.030
23 7.719 9.640 11.030
24 7.724 9.652 11.060
25 7.727 9.659 11.070
infinity 7.815 9.837 11.340

Groups = 5

Exact critical values for H for five groups of equal size.

Significance level
Sample 5% 2% 1%
size Groups = 5
2 7.418 8.073 8.291
3 8.333 9.467 10.200
4 8.685 10.130 11.070
5 8.876 10.470 11.570
6 9.002 10.720 11.910
7 9.080 10.870 12.140
8 9.126 10.990 12.290
9 9.166 11.060 12.410
10 9.200 11.130 12.500
11 9.242 11.190 12.580
12 9.274 11.220 12.630
13 9.303 11.270 12.690
14 9.307 11.290 12.740
15 9.302 11.320 12.770
16 9.313 11.340 12.790
17 9.325 11.360 12.830
18 9.334 11.380 12.850
19 9.342 11.400 12.870
20 9.353 11.410 12.910
21 9.356 11.430 12.920
22 9.362 11.430 12.920
23 9.368 11.440 12.940
24 9.375 11.450 12.960
25 9.377 11.460 12.960
infinity 9.488 11.670 13.280

Groups = 6

Exact critical values for H for six groups of equal size.

Significance level
Sample 5% 2% 1%
size Groups = 6
2 8.846 9.538 9.846
3 9.789 11.030 11.820
4 10.140 11.710 12.720
5 10.360 12.070 13.260
6 10.500 12.330 13.600
7 10.590 12.500 13.840
8 10.660 12.620 13.990
9 10.710 12.710 14.130
10 10.750 12.780 14.240
11 10.760 12.840 14.320
12 10.790 12.900 14.380
13 10.830 12.930 14.440
14 10.840 12.980 14.490
15 10.860 13.010 14.530
16 10.880 13.030 14.560
17 10.880 13.040 14.600
18 10.890 13.060 14.630
19 10.900 13.070 14.640
20 10.920 13.090 14.670
21 10.930 13.110 14.700
22 10.940 13.120 14.720
23 10.930 13.130 14.740
24 10.930 13.140 14.740
25 10.940 13.150 14.770
infinity 11.070 13.390 15.090

Exact values: Unequal sample sizes

When you have different sample sizes you can look up exact critical values using the following table(s).

Max group size = 1 to 5

The sample sizes are given in the first column, so 222110 is equivalent to 5 groups with replicates of 2, 2, 2, 1 and 1. Reject the null hypothesis if your calculated value of H is equal or greater than the tabulated value.

Critical values of H for unequal group sizes (max replicates 5).

Sample Significance level
sizes 5% 2% 1%
222110 6.750
222111 7.600 7.800
222200 6.167 6.667 6.667
222210 5.679
222210 7.133 7.533 7.533
222211 8.018 8.455 8.618
222220 7.418 8.073 8.291
222221 8.455 9.000 9.227
222222 8.846 9.538 9.846
311100
311111
321100
321110 6.583
321111 7.467
322000 4.714
322100 5.833 6.500
322110 6.800 7.400 7.600
322111 7.954 8.345 8.509
322200 6.333 6.978 7.133
322210 7.309 7.836 8.127
322211 8.348 8.939 9.136
322220 7.682 8.303 8.682
322221 8.731 9.346 9.692
322222 9.033 9.813 10.220
331000 5.143
331100 6.333
331110 7.111 7.467
331111 7.909 8.564 8.564
332000 5.361 6.250
332100 6.244 6.689 7.200
332110 7.200 7.782 8.073
332111 8.303 8.803 9.045
332200 6.527 7.182 7.636
332210 7.591 8.258 8.576
332211 8.615 9.269 9.628
332220 7.910 8.667 9.115
332221 8.923 9.714 10.150
333000 5.600 6.489 7.200
333100 6.600 7.109 7.400
333110 7.576 8.242 8.424
333111 8.641 9.205 9.564
333200 6.727 7.636 8.015
333210 7.769 8.590 9.051
333211 8.835 9.670 10.080
333220 8.044 9.011 9.505
333300 7.000 7.872 8.538
333310 8.000 8.879 9.451
411000
411111 7.333
421000
421100 5.833
421110 6.733 7.267
421111 7.827 8.236 8.400
422000 5.333 6.000
422100 6.133 6.667 7.000
422110 7.145 7.636 7.936
422111 8.205 8.727 9.000
422200 6.545 7.091 7.391
422210 7.500 8.205 8.545
422211 8.558 9.192 9.538
422220 7.846 8.673 9.077
422221 8.868 9.643 10.070
431000 5.208
431100 6.178 6.711 7.067
431110 7.073 7.691 8.236
431111 8.053 8.758 9.023
432000 5.444 6.144 6.444
432100 6.309 7.018 7.455
432110 7.439 8.091 8.394
432111 8.429 9.115 9.506
432200 6.621 7.530 7.871
432210 7.679 8.545 8.962
432211 8.742 9.577 10.010
432220 7.984 8.918 9.429
433000 5.791 6.564 6.745
433100 6.545 7.485 7.758
433110 7.660 8.513 8.891
433111 8.654 9.495 9.934
433200 6.795 7.763 8.333
433210 7.874 8.830 9.374
433300 6.984 7.995 8.659
441000 4.967 6.667 6.667
441100 5.945 7.091 7.909
441110 7.114 8.182 8.636
441111 8.231 9.096 9.538
442000 5.455 6.600 7.036
442100 6.386 7.364 7.909
442110 7.500 8.385 8.885
442111 8.571 9.445 9.940
442200 6.731 7.750 8.346
442210 7.797 8.802 9.330
443000 5.598 6.712 7.144
443100 6.635 7.660 8.231
443110 7.714 8.742 9.247
443200 6.874 7.951 8.621
443300 7.038 8.181 8.876
444000 5.692 6.962 7.654
444100 6.725 7.879 8.588
444200 6.957 8.157 8.871
511000
511110 6.667
511111 7.909
521000 5.000
521100 5.960 6.600
521110 6.905 7.418 7.855
521111 7.891 8.473 8.682
522000 5.160 6.000 6.533
522100 6.109 6.927 7.276
522110 7.273 7.909 8.318
522111 8.308 8.938 9.362
522200 6.564 7.364 7.773
522210 7.600 8.408 8.831
522211 8.624 9.442 9.890
531000 4.960 6.044
531100 6.004 6.964 7.400
531110 7.925 8.782 9.316
531111 8.169 9.062 9.503
532000 5.251 6.124 6.909
532100 6.364 7.285 7.758
532110 7.164 7.939 8.303
532111 8.495 9.371 9.837
532200 6.664 7.626 8.203
532210 7.462 8.272 8.756
533000 5.648 6.533 7.079
533100 6.641 7.656 8.128
533110 7.756 8.705 9.251
533200 6.822 7.912 8.607
533300 7.019 8.124 8.848
541000 4.985 6.431 6.955
541100 6.041 7.182 7.909
541110 7.684 8.651 9.187
541110 7.154 8.235 8.831
541111 8.242 9.234 9.841
542000 5.273 6.505 7.205
542100 6.419 7.477 8.173
542110 7.520 8.492 9.152
542200 6.725 7.849 8.473
543000 5.656 6.676 7.445
543100 6.685 7.793 8.409
543200 6.926 8.069 8.802
544000 5.657 6.953 7.760
544100 6.760 7.986 8.726
551000 5.127 6.145 7.309
551100 6.077 7.308 8.108
551110 7.226 8.334 9.152
552000 5.338 6.446 7.338
552100 6.541 7.536 8.327
552200 6.777 7.943 8.634
553000 5.705 6.886 7.578
553100 6.745 7.857 8.611
554000 5.666 7.000 7.823
555000 5.780 7.220 8.000

Max group size 6 to 9

The sample sizes are given in the first column, so 611000 is equivalent to 3 groups with replicates of 6, 1 and 1. Reject the null hypothesis if your calculated value of H is equal or greater than the tabulated value.

Critical values of H for unequal group sizes (replicates 6 to 9).

Sample Significance level
sizes 5% 2% 1%
611000
611100 5.667
611110 7.091
611111 7.879 8.409
621000 4.822
621100 5.964 6.600 7.036
621110 6.909 7.682 8.000
621111 8.013 8.692 9.051
622000 5.345 6.182 6.665
622100 6.242 7.000 7.500
622110 7.308 8.051 8.628
622111 8.374 9.165 9.604
622200 6.538 7.513 7.923
622210 7.593 8.549 9.077
631000 4.855 6.236 6.873
631100 6.045 7.091 7.621
631110 7.051 8.064 8.590
631111 8.209 9.176 9.659
632000 5.348 6.227 6.970
632100 6.397 7.321 7.885
632110 7.505 8.407 9.000
632200 6.703 7.758 8.363
633000 5.615 6.590 7.410
633100 6.637 7.725 8.220
633200 6.876 7.962 8.695
641000 4.947 6.174 7.106
641100 6.071 7.250 8.000
641110 7.187 8.286 9.033
642000 5.340 6.571 7.340
642100 6.489 7.516 8.302
642200 6.743 7.929 8.610
643000 5.610 6.725 7.500
643100 6.710 7.819 8.538
644000 5.681 6.900 7.795
651000 4.990 6.138 7.182
651100 6.110 7.218 8.141
652000 5.338 6.585 7.376
652100 6.541 7.598 8.389
653000 5.602 6.829 7.590
654000 5.661 7.018 7.936
655000 5.729 7.110 8.028
661000 4.945 6.286 7.121
661100 6.133 7.276 8.181
662000 5.410 6.667 7.467
663000 5.625 6.900 7.725
664000 5.724 7.107 8.000
665000 5.765 7.152 8.124
666000 5.801 7.240 8.222
711000
711100 5.945
711110 6.831 7.455
711111 7.791 8.846 8.846
721000 4.706 5.891
721100 6.006 6.786 7.273
721110 6.984 7.753 8.231
721111 8.119 8.821 9.268
722000 5.143 6.058 7.000
722100 6.319 7.011 7.626
722110 7.356 8.143 8.689
722200 6.565 7.568 8.053
731000 4.952 6.043 7.030
731100 6.070 7.037 7.652
731110 7.152 8.119 8.779
732000 5.357 6.339 6.839
732100 6.466 7.383 8.005
732200 6.718 7.759 8.407
733000 5.620 6.656 7.228
733100 6.671 7.721 8.352
741000 4.986 6.319 6.986
741100 6.104 7.222 8.032
742000 5.376 6.447 7.321
742100 6.543 7.531 8.337
743000 5.623 6.780 7.550
744000 5.650 6.962 7.814
751000 5.064 6.194 7.061
751100 6.113 7.318 8.148
752000 5.393 6.477 7.450
753000 5.607 6.874 7.697
754000 5.733 7.084 7.931
755000 5.708 7.101 8.108
761000 5.067 6.214 7.254
762000 5.357 6.587 7.490
763000 5.689 6.930 7.756
764000 5.706 7.086 8.039
765000 5.770 7.191 8.157
766000 5.730 7.197 8.257
771000 4.986 6.300 7.157
772000 5.398 6.693 7.491
773000 5.688 7.003 7.810
774000 5.766 7.145 8.142
775000 5.746 7.247 8.257
811000
811100 6.182
811110 6.538 7.769
811111 7.788 8.712 9.231
821000 4.909 6.000
821100 5.933 6.692 7.423
821110 6.997 7.821 8.308
822000 5.356 5.962 6.663
822100 6.305 7.096 7.648
822200 6.571 7.600 8.207
831000 4.881 6.179 6.804
831100 6.099 7.154 7.788
832000 5.316 6.371 7.022
832100 6.464 7.455 8.114
833000 5.617 6.683 7.350
841000 5.044 6.140 6.973
841100 6.143 7.271 8.029
842000 5.393 6.536 7.350
843000 5.623 6.854 7.585
844000 5.779 7.075 7.853
851000 4.869 6.257 7.110
852000 5.415 6.571 7.440
853000 5.614 6.932 7.706
854000 5.718 7.051 7.992
855000 5.769 7.159 8.116
861000 5.015 6.336 7.256
862000 5.404 6.618 7.522
863000 5.678 6.980 7.796
864000 5.743 7.120 8.045
865000 5.750 7.221 8.226
871000 5.041 6.366 7.308
872000 5.403 6.619 7.571
873000 5.698 7.021 7.827
874000 5.759 7.153 8.118
881000 5.039 6.294 7.314
882000 5.408 6.711 7.654
883000 5.734 7.021 7.889
911000
911100 5.701 6.385
911110 6.755 7.458 8.044
921000 4.842 5.662 6.346
921100 5.919 6.725 7.326
922000 5.260 6.095 6.897
922100 6.292 7.130 7.692
931000 4.952 6.095 6.886
931100 6.105 7.171 7.768
932000 5.340 6.359 7.006
933000 5.589 6.800 7.422
941000 5.071 6.130 7.171
942000 5.400 6.518 7.364
943000 5.652 6.882 7.614
944000 5.704 6.990 7.910
951000 5.040 6.349 7.149
952000 5.396 6.596 7.447
953000 5.670 6.972 7.733
954000 5.713 7.121 8.025
955000 5.770 7.213 8.173
961000 5.049 6.255 7.248
962000 5.392 6.614 7.566
963000 5.671 6.975 7.823
964000 5.745 7.130 8.109
971000 5.042 6.397 7.282
972000 5.429 6.679 7.637
973000 5.656 6.998 7.861
981000 4.985 6.351 7.394
982000 5.420 6.679 7.642
991000 4.961 6.407 7.333

Max group size 10 to 17

The sample sizes are given in the first column (the first two digits are values 10-17), so 1011000 is equivalent to 3 groups with replicates of 10, 1 and 1. Reject the null hypothesis if your calculated value of H is equal or greater than the tabulated value.

Critical values of H for unequal group sizes (replicates 10 to 17).

Sample Significance level
sizes 5% 2% 1%
1011000 4.654
1011100 5.908 6.560
1021000 4.840 5.776 6.429
1021100 5.937 6.794 7.251
1022000 5.120 6.034 6.537
1031000 5.076 6.053 6.851
1032000 5.362 6.375 7.042
1033000 5.588 6.784 7.372
1041000 5.018 6.158 7.105
1042000 5.345 6.492 7.357
1043000 5.661 6.905 7.617
1044000 5.716 7.065 7.907
1051000 4.954 6.318 7.178
1052000 5.420 6.612 7.514
1053000 5.636 6.938 7.752
1054000 5.744 7.135 8.048
1061000 5.042 6.383 7.316
1062000 5.406 6.669 7.588
1063000 5.656 7.002 7.882
1071000 4.986 6.370 7.252
1072000 5.377 6.652 7.641
1081000 5.038 6.414 7.359
1111000 4.747
1111100 5.457 6.714
1121000 4.816 5.834 6.600
1122000 5.164 6.050 6.766
1131000 5.030 6.030 6.818
1132000 5.374 6.379 7.094
1133000 5.583 6.776 7.418
1141000 4.988 6.111 7.090
1142000 5.365 6.553 7.396
1143000 5.660 6.881 7.679
1144000 5.740 7.036 7.945
1151000 5.020 6.284 7.130
1152000 5.374 6.648 7.507
1153000 5.646 6.962 7.807
1161000 5.062 6.304 7.261
1162000 5.408 6.693 7.564
1171000 4.985 6.409 7.330
1211000 4.829
1221000 4.875 5.550 6.229
1222000 5.173 5.967 6.761
1231000 4.930 6.018 6.812
1232000 5.350 6.412 7.134
1233000 5.576 6.746 7.471
1241000 4.931 6.225 7.108
1242000 5.442 6.547 7.389
1243000 5.661 6.903 7.703
1251000 4.977 6.326 7.215
1252000 5.395 6.649 7.512
1261000 5.005 6.371 7.297
1311000 4.900
1321000 4.819 5.727 6.312
1322000 5.199 6.134 6.792
1331000 5.024 6.081 6.846
1332000 5.371 6.407 7.138
1333000 5.613 6.755 7.449
1341000 4.963 6.325 7.052
1342000 5.368 6.587 7.434
1351000 4.993 6.288 7.238
1411000 4.963
1421000 4.863 5.737 6.356
1422000 5.193 6.045 6.812
1431000 4.977 6.029 6.811
1432000 5.383 6.413 7.218
1441000 4.991 6.265 7.176
1511000 5.020
1521000 4.827 5.599 6.053
1522000 5.184 6.044 6.760
1531000 5.019 6.139 6.813
1611000 4.511 5.070
1621000 4.849 5.600 6.189
1711000 4.581 5.116

Use Excel for Two-way ANOVA

Exercise 10.1.5.

Statistics for Ecologists (Edition 2) Exercise 10.1.5

This exercise is concerned with analysis of variance (ANOVA) in Chapter 10. In particular with the situation when you have two predictor variables, that is two-way ANOVA.

Using Excel for two-way ANOVA (analysis of variance)

Introduction

Excel can carry out the necessary calculations to conduct ANOVA and has several useful functions that can help you:

  • FDIST – calculates an exact p-value for a given F-value (and degrees of freedom).
  • VAR – computes the variance of a series of values.
  • COUNT – counts the number of items, useful for degrees of freedom.
  • AVERAGE – calculates the mean.

However, Excel is most suitable for one-way ANOVA, where you have a single predictor variable. When you have two predictor variables two-way ANOVA is possible, but can be tricky to arrange.

In order to carry out the calculations you need to have your data arranged in a particular layout, let’s call it sample layout or “on the ground” layout. This is not generally a good layout to record your results but it is the only way you can proceed sensibly using Excel. In this exercise you’ll see how to set out your data and have a go at the necessary calculations to perform a two-way ANOVA.

The data for this exercise are available as a file: Two Way online.xlsx. There are two worksheets, one with the bare data and one completed version so you can check your work.

You can use the Analysis ToolPak to carry out the computations for you but you’ll still need to arrange the data in a particular manner. The Analysis ToolPak is not “active” by default so you may need to go to the options/settings and look for Add-Ins.

The exercise data

The data you’ll use for this exercise are in the file Two Way online.xlsx and are presented in the following table:

Exercise data for two-way anova.

Water vulgaris sativa
Lo 9 7
Lo 11 6
Lo 6 5
Mid 14 14
Mid 17 17
Mid 19 15
Hi 28 44
Hi 31 38
Hi 32 37

 

These data represent the growth of two different plant species under three different watering regimes. The first column shows the levels of the Water variable, this is one of the predictor variables and you can see that there are three levels: Lo, Mid and Hi. The next two columns show the growth results for two plant species, labelled vulgaris and sativa. These two columns form the second predictor variable (we’ll call that Plant, which seems suitable).

This layout is the only way that Excel can deal with the values but it is not necessarily the most useful general layout for your data. The scientific recording layout is more powerful and flexible.

The other thing to note is that there are an equal number of items (replicates) in each “block”. Here there are only 3 observations per block. This replication balance is important; the more unbalanced your situation is the more “unreliable” the result is. In fact, if you use the Analysis ToolPak for 2-way ANOVA you must have a completely balanced dataset (or the routine refuses to run).

Calculate Column Sums of Squares

Start by opening the example datafile: Two Way online.xlsx. Make sure you go to the Data worksheet (the worksheet Completed is there for you to check your results). The data are set out like the previous table, in sample layout. You will need to calculate the various sums of squares and to help you the worksheet has some extra areas highlighted for you.

Start by calculating the column SS, that is the sums of squares for the Plant predictor.

In the formula x represents each column, T is the overall data and n is the number of samples.

  1. Click in cell A12 and type a label, Col SS, for the sums of squares of the columns (the Plant predictor variable).
  2. In Cell B12 type a formula to calculate the SS for the vulgaris column: =(AVERAGE(B2:B10)-AVERAGE(B2:C10))^2*COUNT(B2:B10). You should get 7.11.
  3. In C12 type a similar formula to get the SS for the sativa column: =(AVERAGE(C2:C10)-AVERAGE(B2:C10))^2*COUNT(C2:C10). You should get 7.11.

You should now have the two sums of squares for the columns (the Plant predictor).

Calculate Row Sums of Squares

The row SS are calculated in a similar manner to the col SS.

  1. In cell E3 type a formula to compute the row SS for the Water block Lo: =(AVERAGE(B2:C4)-AVERAGE(B2:C10))^2*COUNT(B2:C4). You should get 880.07.
  2. In E6 compute row SS for the Mid block: =(AVERAGE(B5:C7)-AVERAGE(B2:C10))^2*COUNT(B5:C7). You should get 71.19.
  3. In E9 compute row SS for the Hi block: =(AVERAGE(B8:C10)-AVERAGE(B2:C10))^2*COUNT(B8:C10). You should get 1451.85.
  4. In E12 calculate the overall row SS: =SUM(E2:E10). You should get 2401.11.
  5. In F12 type a label, Row SS, to remind you what this value is.

You’ve now got the row and column SS.

Calculate Error Sums of Squares (within groups SS)

The next step is to determine the error sums of squares. You can get this by multiplying the block variance by the number of replicates for each group minus 1. This is essentially a tinkering of the formula for variance:

  1. In H3 type a formula to calculate the SS for the Lo Water and vulgaris Plant block: =VAR(B2:B4)*2. You should get 12.67.
  2. Repeat the previous step for the rest of the blocks. You’ll need to highlight the appropriate values for each block. Note that the *2 part is the same for all blocks, as there are three replicates for each block.
  3. In H12 type a formula to calculate the overall error SS: =SUM(H2:I10). You should get 69.33.
  4. In I13 type a label, Error SS, to remind you what the value represents.

You now have the error term for the ANOVA. This is an important value, as you’ll need it to calculate the final F-values.

Calculate Interaction Sums of Squares

The final SS component is that for the interactions between the two variables (Water and Plant). To do this you use the means of the various groups and the replication like so:

The formula looks horrendouns but in reality it is more tedious than really hard. The first mean is the mean of a single block. The next Xa, is essentially the column mean. The Xb mean is the “row” mean. The final mean (double overbar) is the overall mean. The n is the number of replicates in each block.

  1. In K3 type a formula for the interaction SS for the fisrt block: =(AVERAGE(B2:B4)-AVERAGE(B2:B10)-AVERAGE(B2:C4)+AVERAGE(B2:C10))^2*COUNT(B2:B4). You should get 14.81.
  2. In K6: =(AVERAGE(B5:B7)-AVERAGE(B2:B10)-AVERAGE(B5:C7)+AVERAGE(B2:C10))^2*COUNT(B5:B7). Gives 7.26.
  3. In K9: =(AVERAGE(B8:B10)-AVERAGE(B2:B10)-AVERAGE(B8:C10)+AVERAGE(B2:C10))^2*COUNT(B8:B10). Gives 42.81.
  4. In L3: =(AVERAGE(C2:C4)-AVERAGE(C2:C10)-AVERAGE(B2:C4)+AVERAGE(B2:C10))^2*COUNT(C2:C4). Gives 14.81.
  5. In L6: =(AVERAGE(C5:C7)-AVERAGE(C2:C10)-AVERAGE(B5:C7)+AVERAGE(B2:C10))^2*COUNT(C5:C7). Gives 7.26.
  6. In L9: =(AVERAGE(C8:C10)-AVERAGE(C2:C10)-AVERAGE(B8:C10)+AVERAGE(B2:C10))^2*COUNT(C8:C10). Gives 42.81.
  7. In K12 type a formula to get the total interaction SS: =SUM(K2:L10). You should get 129.78.
  8. In L12 type a label, Interact SS, to remind you what the value represents.

Now you have all the sums of squares values you need to complete the ANOVA.

Compute total SS and degrees of freedom

The total sums of squares can be calculated by adding the component SS together. On the other hand, it would be good to check your maths by calculating it from the total variance and df.

You’ll also need the degrees of freedom for the various components before you can construct the final ANOVA table.

  1. In A13 type a label, Total SS, for the overall sums of squares.
  2. In B13 type a formula to calculate the total SS: =VAR(B2:C10)*(COUNT(B2:C10)-1). You should get 2616.44.
  3. In A14 type a label, df, for the degrees of freedom.
  4. In B14 type a formula for the column df: =COUNT(B12:C12)-1. The result should be 1.
  5. In E14 type a formula for row df: =COUNT(E2:E10)-1. You should get 2.
  6. In H14 work out the error df: =C18*C19. You should get 2.
  7. In K14 work out the interaction df: =COUNT(B2:C10)-COUNT(K2:L10). You should get 12.

Now you have everything except the total df, which you can place in the final ANOVA table shortly.

Make the final ANOVA table

You can now construct the final ANOVA table and compute the F-values and significance of the various components. You want your table to end up looking like the following:

Completed anova table for two-way analysis of variance.

ANOVA
Source of Variation SS df MS F P-value F crit
Water 2403.1 2 1201.56 207.96 4.9E-10 3.89
Plant 14.22 1 14.22 2.46 0.1426 4.75
Water*Plant 129.78 2 64.89 11.23 0.0018 3.89
Error 69.33 12 5.78
Total 2616.44 17

 

  1. Start by typing a label, ANOVA, into cell A16.
  2. Type the rest of the labels for the ANOVA table as shown above.
  3. In the SS column you can place the sums of squares results, which you’ve already computed. So in B18: =E12. In B19: =SUM(B12:C12). and so on.
  4. The total SS can be =B13 or a sum of the individual SS components.
  5. In the df column you can place the values, which are already computed. The total SS in C22 needs to be determined: =COUNT(B2:C10)-1. You should get 17.
  6. The MS are worked out by dividing the SS by the df. For e.g. in D18: =B18/C18.
  7. Determine an F-value by dividing the MS for a row by the Error MS. So in E18: =D18/D21. Gives 207.96.
  8. Work out an exact p-value for each of your F-values using the FDIST function. You need your F-value, the df for that row and the error df. In F18 type: =FDIST(E18,C18,C21). The result should be 4.9E-10, which is highly significant.
  9. In F19: =FDIST(E19,C19,C21).
  10. In F20: =FDIST(E20,C20,C21).
  11. You can use the FINV function to get a critical value for F. You’ll need a level of significance (0.05), the df for that row and the error df. In G18 type: =FINV(0.05,C18,C21). The result is 3.89.
  12. In G19: =FINV(0.05,C19,C21).
  13. In G20: =FINV(0.05,C20,C21).

Now you have the final ANOVA table completed. You can see that the interaction term is highly significant (p = 0.0018). The Water treatment is also significant but the Plant variable is not (p = 0.14).

Graphing the result

You should plot your result as a chart of some kind. There are several ways you might proceed. Chapter 6 gives details about constructing charts in Excel and in R for various scenarios.

One option would be to make a bar chart, showing the mean for each block, and with the blocks grouped by watering treatment or by plant species. You should give an impression of the variability using error bars. The following column chart was produced using 95% confidence intervals:

Visualisation of two-way anova result. Bars show sample means and error bars are 95% CI.

The critical value for t can be determined using the TINV function and the degrees of freedom: =(TINV(0.05,16)). Note that in this case df = 16 because we are splitting the blocks by plant (there are 9-1 + 9-1 = 16 degrees of freedom). The size of the error bars is then:

Standard Error * t-crit.

Alter sample format to recording format

Exercise 10.1.4.

Statistics for Ecologists (Edition 2) Exercise 10.1.4

When you have data in the “wrong” layout you need to be able to rearrange them into a more “sensible” layout so that you can unleash the power of R most effectively. The stack() command is a useful tool that can help you achieve this layout.

Altering data layout – switch from sample format to recording format

Data layout – stack()ing your data

There are two main ways you can layout your data. In sample-format each column is a separate sample, which forms some kind of logical sampling unit. This is a typical way that you layout data if you are using Excel because that’s how you have to have your data to be able to make charts and carry out most forms of analysis.

In scientific recording format each column is a variable; you have response variables and predictor variables. This recording-layout is a more powerful and ultimately flexible layout because you can add new variables or observations easily. In R this layout is also essential for any kind of complicated analysis, such as regression or analysis of variance.

When you have data in the “wrong” layout you need to be able to rearrange them into a more “sensible” layout so that you can unleash the power of R most effectively. The stack() command is a useful tool that can help you achieve this layout.

Convert a single predictor

The simplest situation is where you have a single response variable and a single predictor, but the values are laid out in sample columns. Here is an example, the data are counts of a freshwater invertebrate at three different sampling locations:

hog3
  Upper Mid Lower
1     3   4    11
2     4   3    12
3     5   7     9
4     9   9    10
5     8  11    11
6    10  NA    NA
7     9  NA    NA

Note that the samples contain different numbers of observations. The “short” columns are padded by NA items when the data is read into R (usually via theread.csv() command).

The stack() command will take the columns and combine them into a single response variable. The names of the columns will be used to form the levels of a predictor variable.

stack(hog3)
   values   ind
1       3 Upper
2       4 Upper
3       5 Upper
4       9 Upper
5       8 Upper
6      10 Upper
7       9 Upper
8       4   Mid
9       3   Mid
10      7   Mid
11      9   Mid
12     11   Mid
13     NA   Mid
14     NA   Mid
15     11 Lower
16     12 Lower
17      9 Lower
18     10 Lower
19     11 Lower
20     NA Lower
21     NA Lower

Note that the columns are labelled, values and ind. You can alter these easily enough using the names() command.

hog = stack(hog3)

names(hog) <- c("count", "site")

head(hog)
  count  site
1     3 Upper
2     4 Upper
3     5 Upper
4     9 Upper
5     8 Upper
6    10 Upper

Now you have a data.frame in recording layout, but the NA items are still in the data.

Remove NA items

The na.omit() command will “knock-out” any NA items.

hog
   count  site
1      3 Upper
2      4 Upper
3      5 Upper
4      9 Upper
5      8 Upper
6     10 Upper
7      9 Upper
8      4   Mid
9      3   Mid
10     7   Mid
11     9   Mid
12    11   Mid
13    NA   Mid
14    NA   Mid
15    11 Lower
16    12 Lower
17     9 Lower
18    10 Lower
19    11 Lower
20    NA Lower
21    NA Lower
na.omit(hog)
   count  site
1      3 Upper
2      4 Upper
3      5 Upper
4      9 Upper
5      8 Upper
6     10 Upper
7      9 Upper
8      4   Mid
9      3   Mid
10     7   Mid
11     9   Mid
12    11   Mid
15    11 Lower
16    12 Lower
17     9 Lower
18    10 Lower
19    11 Lower

Note that the row names keep their original values.

Re-number row names

If you want to re-number the rows just use the row.names() command.

hog <- na.omit(hog)

row.names(hog) <- 1:length(hog$count)
hog
   count  site
1      3 Upper
2      4 Upper
3      5 Upper
4      9 Upper
5      8 Upper
6     10 Upper
7      9 Upper
8      4   Mid
9      3   Mid
10     7   Mid
11     9   Mid
12    11   Mid
13    11 Lower
14    12 Lower
15     9 Lower
16    10 Lower
17    11 Lower

It’s not really essential but it makes it easier to keep track of items if the numbers are sequential.

Convert multiple predictors

You may have data in a more complicated arrangement, perhaps mirroring the on-ground layout. This can happen for example in a two-way ANOVA design. In Excel the data have to be set out in a particular way for the Analysis ToolPak to carry out the anova. However, in R this layout is not helpful.

wp
  Water vulgaris sativa
1    Lo        9      7
2    Lo       11      6
3    Lo        6      5
4   Mid       14     14
5   Mid       17     17
6   Mid       19     15
7    Hi       28     44
8    Hi       31     38
9    Hi       32     37

This time you have a column representing one of the predictor variables and the other columns in sample layout. If you try a stack() command you’ll find that only the numeric columns are stacked.

wps <- stack(wp)
Warning message:
In stack.data.frame(wp) : non-vector columns will be ignored

wps
   values      ind
1       9 vulgaris
2      11 vulgaris
3       6 vulgaris
4      14 vulgaris
5      17 vulgaris
6      19 vulgaris
7      28 vulgaris
8      31 vulgaris
9      32 vulgaris
10      7   sativa
11      6   sativa
12      5   sativa
13     14   sativa
14     17   sativa
15     15   sativa
16     44   sativa
17     38   sativa
18     37   sativa

This is a minor problem because you can re-build the first predictor variable by duplicating the original and adding it to the new data.frame.

Replace a factor predictor variable

What you must do is take the original predictor variable and repeat it a number of times (equal to the number of sample columns).

wps <- cbind(wps, water = rep(wp$Water, 2))
   values      ind water
1       9 vulgaris    Lo
2      11 vulgaris    Lo
3       6 vulgaris    Lo
4      14 vulgaris   Mid
5      17 vulgaris   Mid
6      19 vulgaris   Mid
7      28 vulgaris    Hi
8      31 vulgaris    Hi
9      32 vulgaris    Hi
10      7   sativa    Lo
11      6   sativa    Lo
12      5   sativa    Lo
13     14   sativa   Mid
14     17   sativa   Mid
15     15   sativa   Mid
16     44   sativa    Hi
17     38   sativa    Hi
18     37   sativa    Hi

Note that you gave the new variable an explicit name. The other column names need replacing using the names() command:

names(wps)[1:2] = c("height", "species")

names(wps)
[1] "height"  "species" "water"

head(wps)
  height  species water
1      9 vulgaris    Lo
2     11 vulgaris    Lo
3      6 vulgaris    Lo
4     14 vulgaris   Mid
5     17 vulgaris   Mid
6     19 vulgaris   Mid

Now you have a more sensible layout that can be used for aov() and graphical commands.

This approach will not work on every dataset that you get but it will take you a long way forward.

Using Excel for Chi-squared goodness of fit tests

Exercise 9.4.

Statistics for Ecologists (Edition 2) Exercise 9.4

This exercise is concerned with association (Chapter 9), in particular chi squared goodness of fit testing using Excel (Section 9.4).

Using Excel for Chi squared goodness of fit tests

Introduction

Excel has several functions related to the Chi-squared statistic. This allows you to undertake chi-squared goodness of fit testing for example.

In goodness of fit tests, you have one set of frequency data in various categories. You also have a matching set of frequencies that you want to “compare”. The comparison set may be a theoretical set of values or perhaps a previous set of observations. The goodness of fit test is often used in genetic studies where you match up observed phenotypes against a theoretical ratio of expected phenotypes.

The following data will be used:

Data for use in goodness of fit test.

Colour Coat Obs
Green Smooth 116
Green Wrinkled 40
Yellow Smooth 31
Yellow Wrinkled 13

 

The data show the results of crossing two varieties of pea plants. The attributes of colour and coat type are of interest. There are green and yellow coloured varieties and two coat types: wrinkled and smooth. After crossing, the offspring were sorted into categories according to their colour and coat type. The Obs column shows how many offspring there were in each category (a frequency or count).

The theoretical ratio of phenotypes is 9:3:3:1.

The goodness of fit test can compare the observed frequencies to the theoretical ratios. Does the data match up with the genetic theory?

Get the example data: pea genetics.xlsx.

Calculate expected values

Open the spreadsheet pea genetics.xlsx. There are two worksheets, Data contains the basic data and Completed version has all the calculations (and a graph) completed for you.

First of all, you need to fill in the theoretical ratios and then calculate the expected frequencies based on those:

  1. In cell D1 type a label, Ratio, for the ratios. Now in the D column type the values 9, 3, 3, 1.
  2. In B6 type a label, Totals:, for some sum values.
  3. In C6 type a formula to sum the frequencies (the total number of observations): =SUM(C2:C5). You should get 200.
  4. Copy and paste the formula in C6 to D6 to give the sum of the ratios. Your result should be 16.
  5. In cell E1 type a label, Exp, for the expected values.
  6. In E2 type a formula to work out the expected value: =D2/D$6*C$6. You should get 112.5. Note the use of $ to “fix” some of the cell references.
  7. Copy and paste the cell E2 down the rest of the column to fill cells E2:E5 and so calculate all the expected frequencies. You can work out the sum if you like, which should be the same as the original, 200.

Your spreadsheet should now appear more or less like the following:

Expected values calculated as part of the goodness of fit test.

Colour Coat Obs Ratio Exp
Green Smooth 116 9 112.5
Green Wrinkled 40 3 37.5
Yellow Smooth 31 3 37.5
Yellow Wrinkled 13 1 12.5
200 16 200

 

You are now ready to move on to calculating the chi-squared values and overall significance.

Calculate Chi-squared values

It is useful to see the individual chi-squared values:

  1. In cell F1 type a label, Chi-Sq, for the chi-squared values.
  2. In F2 type a formula to calculate the chi-squared value: =(C2-E2)^2/E2. You should get 0.11.
  3. Now copy and paste the cell F2 to the rest of the column to place the chi-squared values in cells F2:F5.
  4. Sum the individual chi-squared values to give a total in cell F6: =SUM(F2:F5). You should get 1.42.

You’ll need to compare your total chi-squared to a critical value but you may as well compute the Pearson residuals first.

Pearson residuals

The Pearson residuals are more or less the square root of the chi-squared values.

The advantage is that the sign of the residual tells you about the direction of the difference between the oberved and expected values. Pearson residuals have a z-distribution so values of approximately 2 (1.96) are significant at p < 0.05. In a goodness of fit test small residuals indicate that there is little difference between the observed and expected values.

  1. In cell G1 type a label, Resid, for the residual values.
  2. In G2 type a formula to calculate the Pearson residual for the first category: =(C2-E2)/SQRT(E2). You should get 0.33.
  3. Copy and paste the cell G2 into the rest of the column to calculate residuals for all the categories, cells G2:G5.

You don’t really need a sum for the residuals but you could easily copy the adjacent sum of chi-squared values across.

You can use the residuals in a column chart as a visualization of the results later.

Your spreadsheet should now look more or less like the following:

Chi squared and Pearson residuals computed as part of a goodness of fit test.

Colour Coat Obs Ratio Exp Chi-Sq Resid
Green Smooth 116 9 112.5 0.11 0.33
Green Wrinkled 40 3 37.5 0.17 0.41
Yellow Smooth 31 3 37.5 1.13 -1.06
Yellow Wrinkled 13 1 12.5 0.02 0.14
200 16 200 1.42 -0.18

 

Now you can move on to look at the statistical significance of the results shortly but you can already see from the Pearson residuals than none are > 1.96 and so the result is likely to be not significant.

Critical values and significance

You can check to see if your result is a statistically significant one in several ways. Since you have a total chi-squared value you can compare that to a critical value. You can look up a critical value or use the CHIINV function. You can determine the exact p-value for your total chi-squared value using the CHIDIST function. In either case you’ll need the degrees of freedom (df), which is the number of categories (rows) minus 1. In this case 4-1 = 3.

You can also use the CHITEST function to get an exact p-value from the observed and expected values.

  1. In cell E7 type a label, df, for the degrees of freedom.
  2. In F7 type a function to calculate degrees of freedom: =COUNT(C2:C5)-1. The result is 3.
  3. In E8 type a label, p-val, for the exact p-value.
  4. In F8 type a function to compute the exact p-value using observed and expected values: =CHITEST(C2:C5,E2:E5). You should get 0.7.
  5. In E9 type a label, X-sq, for the chi-squared value based on the probability in step 4.
  6. In F9 type a formula to calculate the chi-squared value based on the p-value from step 4: =CHIINV(F8,F7). You should get the same result as the “long” calculation, 1.422.
  7. In E10 type a label, X-crit, for the critical chi-squared value at p = 0.05.
  8. In F10 type a formula to work out the critical value at 5%: =CHIINV(0.05,F7). You should get 7.815 (for df = 3).
  9. In E11 type another label, p-val, for the exact p-value based on your chi-squared result computed the “long” way.
  10. In F11 type a formula to calculate the exact p-value for your calculated chi-squared result: =CHIDIST(F6,F7). You should get 0.7 as before.

Now you have all the appropriate statistics, some of them computed in alternative ways. In this case the result is not significant. The calculated chi-squared result is smaller than the critical value. Also, the p-value is > 0.05.

You have determined that there is no significant departure between the observed and expected frequencies. In other words, the ratio of pea phenotypes you observed does not depart significantly from the theoretical ratio (9:3:3:1).

Visualize the results

There are many ways to visualize chi-squared results. One option is to plot the Pearson residuals, as this shows the departures from the “goodness” of the fit between Obs and Exp values and shows the significance.

  1. Use the mouse to highlight the cells containing the category names (and the headers), that is cells A1:B5.
  2. Now hold down the Control key and use the mouse to select the Pearson residual values and the header, cells G1:G5.
  3. You should now have two blocks of cells selected. Click the Insert > Column Chart button and choose the 2D Clustered Column chart option. The chart is created immediately.
  4. Click once on the chart to ensure the Chart Tools menus are active.
  5. Click the Format menu and in the Current Selection section use the drop-down to select Horizontal (Category) Axis. Then click the Format Selection You can also double click (or right-click) the axis on the chart itself. This opens the Format Axis dialogue.
  6. Find the Labels Then choose Low from the Label Position drop-down. This puts the category labels away from the axis (useful when you have bars hanging downwards).

With a bit of formatting and general juggling you can get a plot that resembles the one below.

You could also plot the Obs and Exp values as a column chart, try it by highlighting the cells A1:C5 then with the control key E1:E5.

Spearman Rank correlation in Excel

Exercise 8.3.2.

Statistics for Ecologists (Edition 2) Exercise 8.3.2

This exercise is concerned with correlation (Chapter 8) and in particular how you can use Excel to calculate Spearman’s Rank correlation coefficient.

Using Excel for Spearman’s Rank correlation

Introduction

Excel has built-in functions that can calculate correlation, but only when data are normally distributed. The CORREL and PEARSON functions both calculate Pearson’s Product Moment, a correlation coefficient. Once you have the correlation coefficient it is fairly easy to calculate the statistical significance. You compute a t-value then use TINV to compute a critical value or TDIST to get an exact p-value (see Section 8.3.1 in the book).

This exercise shows how you can use the RANK.AVG function to rank the data, then use CORREL on the ranks to obtain a Spearman Rank coefficient.

The data for this exercise are freshwater correlation.xlsx, which look like this:

Example data for practice with correlation. Abundance of invertebrates and water speed.

Abund Speed
12 18
18 12
17 12
14 7
9 8
7 9
6 6
7 11
5 6
3 2

 

These data represent the abundance of a freshwater invertebrate and water speed.

Pearson correlation

To get the regular Pearson correlation you can use the CORREL (or PEARSON) function. In the spreadsheet the data are in columns B and C. To get the correlation coefficient the function required is:

=CORREL(B2:B11, C2:C11)

The result (0.614) is the “regular” parametric correlation coefficient, which is based on the normal distribution. The PEARSON function gives an identical result (Pearson’s Product Moment). Regression is another term used for correlation with parametric (normally distributed or Gaussian) data. The correlation coefficient, r, between two normally distributed variables is calculated like so:

Once you have r you can determine a critical value or exact p-value.

Use Pearson’s correlation when you have normally distributed data and when you expect there to be a linear relationship between the two variables.

Spearman’s Rank correlation

Spearman’s Rank correlation is used to determine the strength of the relationship between two numerical variables. The data do not have to be normally distributed and the relationship does not have to be strictly linear either. The relationship should be one where the variables are generally headed in one direction though (so not U-shaped or the inverse).

As the name suggests, the correlation is based on the ranks of the data. The x and y variables are ranked and the ranks of x are compared to the ranks of y like so:

In the formula D is the difference between ranks, and n is the number of pairs of data.

There is no built-in function to work out Spearman Rank correlation but you can work out the ranks using RANK.AVG and calculate the terms in the formula easily enough.

Rank the data

The RANK.AVG function allows you to rank data (note the older RANK function is not the same). In statistical analysis you want the lowest value to have the smallest rank. Tied values should get a tied rank.

To try this out using the example data do the following:

  1. Open the freshwater correlation.xlsx The variables are in columns B & C.
  2. In cell D1 type a label, RA, for the heading of the Ranks of the Abund variable.
  3. In E1 type a label, RS, for the ranks of the Speed variable.
  4. Now in D2 type a formula to get the rank of the first datum (in cell B2): =RANK.AVG(B2,B$2:B$11,1). Note the $ used to “fix” the rows references. The final 1 ensures the ranks are ascending order.
  5. Copy the formula down the rest of the column to fill cells D2:D11.
  6. Copy the cells in D2:D11 across to column E to get ranks in cells E2:E11.

If you wanted to fill out the Spearman Rank formula “the long way” you would now have to subtract each rank in column E from its counterpart in column D. Then square the differences.

However, there is another (easier) way.

Compute Rs

Now you have ranks you can use the CORREL (or PEARSON) function on the ranks to get a close approximation of the Spearman Rank correlation coefficient. It is usually the same to at least 2 decimal places.

  1. In cell A13 type a label, Correl, for the correlation coefficient(s).
  2. In B13 type a formula to work out Pearson’s correlation coefficient for the original data: =CORREL(B2:B11,C2:C11). The result is 0.614 for the example data.
  3. In D13 type a formula to work out the correlation between the ranks (i.e. columns D & E): =CORREL(D2:D11,E2:E11). The result is 0.771.

The correlation between the ranks is a close approximation to the Spearman Rank coefficient (0.773) computed the “long way”.

Get a t-value

You can compare your calculated Spearman Rank coefficient to a table of critical values (e.g. Table 8.5 in the book) or compute a t-value (another approximation).

To get a value for t you need the correlation coefficient and the number of pairs of values. These then go into the following formula:

The df in the formula is degrees of freedom and is the number of pairs of data minus 2. In the example datafile:

  1. In cell A14 type a label, n, for the number of pairs of data.
  2. In B14 type a formula to work out the number of items you have: =COUNT(B2:B11). You get 10.
  3. In A15 type a label, df, for the degrees of freedom.
  4. In B15 type a formula to work out degrees of freedom: =B14-2. You get 8.
  5. In A16 type a label, t, for the t-value.
  6. In B16 type a formula to work out a t-value: =SQRT(D13^2*B15/(1-D13^2)). The result is 3.420.

Once you have a value for t you are on your way to determining the statistical significance.

Determine statistical significance

Now you have a value for t you can:

  • Get a critical value using the TINV function.
  • Get an exact probability (a p-value) using the TDIST function.

The TINV function requires the significance level (usually 0.05) and the degrees of freedom.

The TDIST function requires a t-value, the degrees of freedom and a 2 (for a two-tailed test, which is most often what you need).

  1. In cell A17 type a label, t-crit, for the critical value.
  2. In B17 type a formula to calculate the critical t value: =TINV(0.05,B15). The result is 2.306.
  3. In A18 type a label, p-value, for the exact probability.
  4. In B18 type a formula to calculate the exact p-value: =TDIST(B16,B15,2). The result is 0.009.

You can compare your calculated value of t to the critical value and simply say that your correlation is significant at p < 0.05 or you can use the exact p-value (of 0.009).

This method of calculating Spearman Rank gives a good approximation of the correlation coefficient but the p-values are a little conservative. This is the method that the base distribution of R uses in its cor.test() command when method=”spearman”.

Graphs and matched pairs results

Exercise 7.3.5.

Statistics for Ecologists (Edition 2) Exercise 7.3.5

these notes accompany Chapter 7 and are about using graphs to display matched pairs data (as in the Wilcoxon matched pairs analysis).

Using graphs to visualise matched pairs results

Introduction

Usually you’ll use a bar chart or box-whisker plot to display data when looking at differences between samples. When you have a matched pairs situation however, these sorts of graph may not always be the best way to summarize your data/results.

An alternative is to use a scatter plot, where you plot one sample against the other. If you add an isocline (a straight line with slope 1 and intercept 0) you can see more clearly how the pairs of observations match up with one another.

These notes show you how to prepare such a scatter plot.

Make a scatter plot

When you have matched pair data you obviously have matched pairs. These data could be plotted as a scatter graph with one sample against the other. You can do this easily in Excel or in R.

Scatter plot in Excel

In Excel you can simply select the two samples and then click the Insert > Scatter button. The chart really comes into its own if you add an isocline (a line of parity). To do this you need to add a straight line with a slope of 1 and an intercept of 0. To add an isocline, you need two values:

  • The maximum data value (across the combined dataset).
  • A minimum value – set this to 0 if your axes start from 0 or set the value to whatever the lowest axis value is.

To add the isocline, you:

  1. Click the chart to activate the Chart Tools
  2. Click the Design > Select Data
  3. Click the button to Add a Legend data series.
  4. Select the values (there are two) in the x-values section.
  5. Select the same values again in the y-values section.
  6. Click OK.
  7. Now format the data series: turn off the points but add a joining line, this is the isocline.

Once you’ve formatted the chart and tidied up a bit you’ll have a scatter plot and isocline.

Scatter plot in R

In R you can use the plot() command to chart one sample against the other. The isocline can be added using the abline() command, set a = 0, b = 1 to get an intercept of 0 and a slope of 1.

a;b
[1]  8  6 12  2  3  3  1  7
[1] 11  8  4  9  3  5  2  2

plot(a,b)
abline(a = 0, b = 1)

This produces a simple scatter with an isocline.

There are many additional graphical parameters you can add to these basic commands, to alter the plotting character, colours and the style of the isocline for example.

Compare to a boxplot

The scatter plot shows your matched pairs data in a different way to a box-whisker plot or a bar chart. The isocline is the key to the plot. The further from the isocline the points are the more “different” the pairs of data are. If all the point lay on the line, then there would be no difference between the pairs. If all the points were to one side, then one sample would be different from the other. The further from the line, the more different.

Sometimes a basic bar or boxplot is all you need, at other times you may decide that the scatter plot and isocline is the way to go.

Show the IQR

With a little bit of tinkering you can show the averages and variability for your matched pairs data right on the scatter plot.

What you need to do is to plot an additional point with co-ordinates that take the average for one sample vs. the average for the other. Then add error bars to the vertical and horizontal.

In Excel the error bars are calculated as an amount, so you need to determine the “deflection” from the average.

In R you add error bars by drawing on the plot so you need the co-ordinates of the extremes of the variability.

For example, if you wanted to show median and inter-quartile range in Excel you’d need to work out how far “above” the median the upper quartile was, and how far “below” the median the lower quartile was. In R you simply calculate the quartiles and use the appropriate values to draw the error bars.

With a bit of tinkering you can get a plot something like this (in Excel):

It is easy to get Excel error bars: click the chart then the Design menu and the Add Chart Element button. Choose the More Error Bars Options and then select the values for the vertical bars.

You should also see horizontal error bars appear in the chart, click them to activate the Horizontal Error Bars options. If you don’t see the horizontal erorr bars, try the Chart Tools > Format menu and select the error bars from the drop-down menu on the left in the Current Selection section.

In R you can use the lines() or arrows() command to draw the error bars in place, once you have the appropriate values, such as from the quantile() command.

Use Excel for Wilcoxon matched pairs

Exercise 7.3.3.

Statistics for Ecologists (Edition 2) Exercise 7.3.3

This exercise is concerned with matched pairs tests (Section 7.3) and in particular how to carry out the non-parametric Wilcoxon Matched Pairs test using Excel (Section 7.3.3).

Use Excel for Wilcoxon matched pairs test

Introduction

There is no in-built function that will carry out the Wilcoxon Matched Pairs test in Excel. However, you can rank the data and compute the rank sums you require using the RANK.AVG function.

You do need to omit zero differences from the calculations and also to separate ranks of positive differences from ranks of negative differences. In this exercise you can see how to use the IF function to help you do this separation.

The example data

You can get the sample data here: Wilcoxon.xlsx. There are two worksheets, one has the data only and the other has a completed set of calculations so you can check your progress.

Here’s what the data look like:

Matched pairs data for use in Wilcoxon matched pairs analysis

Obs A B
1 8 11
2 6 8
3 12 4
4 2 9
5 3 3
6 3 5
7 1 2
8 7 2

 

You can see that there are 8 pairs of data.

Show the direction of the difference

Start by making a column to show the direction of the difference. Make a heading in cell D1, Dir will do as a heading name.

Now in cell D2 type a formula to show if the difference between B2 and C2 is positive or negative. You’re going to subtract the second column from the first. You also want to take into account possible zero differences:

=IF(B2<C2, "-", IF(B2=C2, "", "+"))

So if cell C2 is larger than cell B2 you’ll get a minus symbol. If the cells are equal, then you get a blank. If B2 is larger than C2 you’ll get a + symbol. You will use these symbols to separate the ranks later.

Copy the formula down the rest of the column to show the direction of each of the differences.

The final column shows the direction of differences between A and B (i.e. A-B).

Obs A B Dir
1 8 11
2 6 8
3 12 4 +
4 2 9
5 3 3
6 3 5
7 1 2
8 7 2 +

 

Note that observation 5 (row 6 of the spreadsheet) shows a blank because the items are the same (i.e. a zero difference).

Calculate the differences

Make a column for the difference between samples, make the heading in cell E1, D(A-B) or something similar.

You want to subtract the values in the second column of data from the first column of data (i.e. Col B – Col C). So in cell E2 type a formula to do that. You’ll need to omit any zero difference, so use an IF function to place a blank “” if the difference is zero:

=IF(B2-C2=0, "", ABS(B2-C2))

You are not interested in the sign of any difference, just the magnitude, so the ABS function is needed.

Copy the result down the rest of the column.

The final column shows the absolute magnitude of differences between A and B.

Obs A B Dir D(A-B)
1 8 11 3
2 6 8 2
3 12 4 + 8
4 2 9 7
5 3 3
6 3 5 2
7 1 2 1
8 7 2 + 5

 

You can see that you have 7 values, with one blank (a zero difference, observation 5).

Calculate ranks of differences

Now you want to work out the ranks of the differences. That is the ranks of the absolute value of the differences that you just worked out (column E) and called D(A-B).

Make a new column label in cell F1, call it Rd or something similar.

Now in cell F2 type a formula to work out the ranks of the items from column E:

=IF(E2="", "", RANK.AVG(E2,$E$2:$E$9,1))

Note that you need to take care of any possible blank cells (corresponding to zero differences). You also need to “fix” the cell range E2:E9 using $ since you will be copying the cell down the rest of the column. The final 1 makes sure that your ranks are sorted ascending order, with the smallest difference getting the smallest rank.

The final column shows the rank of the (absolute) differences between A and B.

Obs A B Dir D(A-B) Rd
1 8 11 3 4
2 6 8 2 2.5
3 12 4 + 8 7
4 2 9 7 6
5 3 3
6 3 5 2 2.5
7 1 2 1 1
8 7 2 + 5 5

 

You can see that there are some tied ranks (each given a value of 2.5).

Ranks of + and of – differences

Now you need to separate the ranks due to the positive differences and the ranks due to negative differences. Mae two more column headings in cells G1 and H1, R+ and R- will do fine.

In cell G2 type a formula that shows the rank if it is due to a positive difference but leaves the cell blank is not:

=IF(D2="+", F2,"")

Copy the cell down the rest of the column G.

In cell H2 type a formula that shows the rank if it is due to a negative difference but leaves the cell blank is not:

=IF(D2="-", F2,"")

Copy the cell down the rest of the column H.

The last two columns show the ranks due to positive differences and negative differences.

Obs A B Dir D(A-B) Rd R+ R-
1 8 11 3 4 4
2 6 8 2 2.5 2.5
3 12 4 + 8 7 7
4 2 9 7 6 6
5 3 3
6 3 5 2 2.5 2.5
7 1 2 1 1 1
8 7 2 + 5 5 5

 

You should now see the ranks split according to the sign of the difference between the samples.

Rank sums

Now you need to add up the ranks for the positive and negative differences (columns G & H). You can use the SUM function for this.

In cells A11 and A12 type labels for the counts and sums, n and Sum, will do nicely.

In cell B11 type a formula to work out the number of observations:

=COUNT(B2:B9)

Copy this into cells C11 and F11. The latter being the number of non-zero differences.

In cell B12 type a formula to sum the data in column 12:

=SUM(B2:B9)

Copy the cell into cells C12, F12, G12 and H12. Cells F12:H12 contain the ranks sums; overall, +ve and -ve. Note that the overall sum of ranks should equal the two other rank sums combined.

The final two rows show some summary (count and sum). The test statistic is the smaller of the two rank sums (12 & 16).

Obs A B Dir D(A-B) Rd R+ R-
1 8 11 3 4 4
2 6 8 2 2.5 2.5
3 12 4 + 8 7 7
4 2 9 7 6 6
5 3 3
6 3 5 2 2.5 2.5
7 1 2 1 1 1
8 7 2 + 5 5 5
n 8 8 7
42 44 28 12 16

Final result

The two rank sums are 12 and 16 so the 12 is the test statistic, W. You’ll need the number of non-zero differences to look up the appropriate critical value (see Table 7.13 in the book).

For Nd = 7 the critical value is 2. The calculated value of W is 12, which is larger than the critical value so the result is not significant.

Using R for the t-test

Exercise 7.1.2.

Statistics for Ecologists (Edition 2) Exercise 7.1.2

This exercise is concerned with how to carry out the t-test (Chapter 7) using R (Section 7.1.2).

Using R for the t-test

Introduction

The t-test is used to compare the means of two samples that have a normal (parametric or Gaussian) distribution. The t.test() command carries out the t-test in R. The default is to compute the Welch two-sample test (unequal variances).

You can have your data in several forms:

  • Two separate samples as two data vectors.
  • Two separate samples but in a single data.frame object (i.e. sample format).
  • A response variable and a predictor variable (i.e. scientific recording format)

In any event you can use the t.test() command to carry out the t-test. The example data for this exercise are the same as in the book and you can get the data in the three forms as an RData file: ridge furrow.RData.

Once you have the data you can type the commands shown here for yourself and so follow along.

Separate data objects

When you have two separate samples (probably as vector objects), you can just name them in the t.test() command.

ridge ; furrow
[1] 4 3 5 6 8 6 5 7
[1]  9  8 10  6  7

t.test(ridge, furrow)

Welch Two Sample t-test

data:  ridge and furrow
t = -2.7584, df = 8.733, p-value = 0.02279
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4.5598309 -0.4401691
sample estimates:
mean of x mean of y
      5.5       8.0

The default carries out the Welch two-sample test (with modified degrees of freedom).

To carry out a t-test with the assumption that the variances are equal, you need to add var.equal = TRUE like so:

t.test(ridge, furrow, var.equal = TRUE)

The result gives a slightly different value for t, df and p-value.

Sample format

If your data are separate samples but contained within a data.frame, you’ll need to alter your approach very slightly so that you can “get at” the variables in the data.frame.

There are three main ways:

  • Use $ syntax to specify the frame and sample name explicitly.
  • Use attach() to place the variables in the search path.
  • Use with() to open the frame temporarily.

Here are the example data:

rf2
  Ridge Furrow
1     4      9
2     3      8
3     5     10
4     6      6
5     8      7
6     6     NA
7     5     NA
8     7     NA

Note that the shorter sample is padded with NA items.

Use $ syntax

You can specify a variable by using the name of the enclosing object, a $ and the variable name:

t.test(rf2$Ridge, rf2$Furrow)

Welch Two Sample t-test

data: rf2$Ridge and rf2$Furrow
t = -2.7584, df = 8.733, p-value = 0.02279

Note that R presents the name exactly as you typed it in the command.

Use attach()

If you try to use a variable that is “inside” a data.frame you get an error:

Ridge
Error: object ‘Ridge’ not found

One way around this is to use attach() to “open” the data.frame and allow the separate variables to be found in the search path. Once you have attached an object its contents appear when you use the search() command and can be used without needing the $ syntax.

Type search() to see the current search path (this is an example):

search()
[1] ".GlobalEnv"        "tools:RGUI"        "package:stats"
[4] "package:graphics"  "package:grDevices" "package:utils"
[7] "package:datasets"  "package:methods"   "Autoloads"
[10] "package:base"

Use attach() to open the data object you want:

attach(rf2)

The rf2 object now appears in the search path:

search()
[1] ".GlobalEnv"        "rf2"               "tools:RGUI"
[4] "package:stats"     "package:graphics"  "package:grDevices"
[7] "package:utils"     "package:datasets"  "package:methods"
[10] "Autoloads"        "package:base"

Now you can use the variables within rf2 in your t-test:

t.test(Ridge, Furrow)

Welch Two Sample t-test

data: Ridge and Furrow
t = -2.7584, df = 8.733, p-value = 0.02279

Note that R presents the names exactly as you typed them.

You should use detach() after you are done. This removes the item from the search() path. You can get confusion if you have data objects with the same name as those contained within data.frames.

detach(rf2)

The attach() command will not overwrite any data objects, but if you open a data.frame and it contains items with the same names as existing objects, the attach()ed ones mask the others until you use detach().

Use with()

The attach() command is useful but you do need to be careful to use detach() after you are done. An alternative approach is to use the with() command, which acts like attach() but only for the duration of one command line.

with(data.name, ...)

So, you give the command the name of the data object you want to “open”, followed by the command you want to execute. In that command you can give the variable names as they are and don’t need the $ syntax.

with(rf2, t.test(Ridge, Furrow, var.equal = TRUE)

In the example the variance is considered equal.

So, you don’t have to use detach() afterwards.

Recording format

If your data are in scientific recording format, then you’ll have the data in a different form from that shown previously (sample format). You will have response variables and predictor variables. For a t-test you will have one response and one predictor e.g.

rf1
   count   area
1      4  Ridge
2      3  Ridge
3      5  Ridge
4      6  Ridge
5      8  Ridge
6      6  Ridge
7      5  Ridge
8      7  Ridge
9      9 Furrow
10     8 Furrow
11    10 Furrow
12     6 Furrow
13     7 Furrow

This layout is more flexible than sample format but you need a slightly different way to specify the variables in your t-test. Essentially you give a formula of form: y ~ x where y is the response (count) and x is the predictor (area). You can also give the name of the enclosing data object.

t.test(count ~ area, data = rf1)

Welch Two Sample t-test

data:  count by area
t = 2.7584, df = 8.733, p-value = 0.02279

Note that R presents the names of the variables as they appear in the enclosing data object.

The formula syntax is very powerful and is used in many statistical and graphical commands. You can extend the formula for more complicated scenarios, such as analysis of variance and multiple regression.

Welch two-sample t-test

Exercise 7.1.1

Statistics for Ecologists (Edition 2) Exercise 7.1.1

This exercise is concerned with using Excel for the t-test in Chapter 7 (Section 7.1). In particular you’ll see how to modify the degrees of freedom for cases when the variance of two samples is not equal (which is often).

Welch two-sample t-test

Introduction

The t-test is used to compare the means of two samples that have a normal (parametric or Gaussian) distribution. The “classic” t-test has two major variants:

  • Assumption of equal variance for the two samples.
  • Adjustment of degrees of freedom (Satterthwaite modification).

In the first case the common variance is calculated and used in place of the variance in the regular formula. The calculation for this is relatively simple but it is also pointless, since you still have to determine the variance of the two samples.

The most commonly used modification is to adjust the degrees of freedom to make the result of the t-test a little more conservative. The degrees of freedom are reduced slightly using the Satterthwaite modification. This version of the t-test is generally called the Welch 2-sample t-test.

The calculations are relatively easy. You can then use the modified df for looking up critical values or for computing the exact p-value. The Welch 2-sample t-test is carried out by default in R via the t.test() command. In Excel the TTEST function will give you the exact p-value but it will not provide the modified degrees of freedom.

The Excel functions TDIST and TINV will give incorrect results as they assume equal variance and use un-modified degrees of freedom. This exercise works through the t-test and shows how to use the Satterthwaite modification to alter degrees of freedom. This allows you to get the “proper” result in Excel. The calculation matches that used in the Analysis ToolPak, which is available in Windows versions of Excel and later Mac versions.

The exercise uses the data that you can see in the following table:

Abundance of R. repens in ridges and furrows of a mediaeval meadow

A B C
1 ridge furrow
2 4 9
3 3 8
4 5 10
5 6 6
6 8 7
7 6
8 5
9 7

 

The data show the abundance of a plant species in two different habitats. The two samples are small but are normally distributed. You can get a copy of the data in Excel .xlsx format here: ridge furrow.xlsx.

Calculation

The calculation of the modified degrees of freedom are in two parts. To start with you determine a statistic called u, which you can think of as a kind of proportion (it varies from 0–1).

Once you have a value for u, you can determine f, the modified degrees of freedom.

The formula gives the same value whichever sample you choose to be #1 and which #2. The df are reduced slightly from the original: original df = (n1 – 1 + n2 – 1).

Once you have a modified df you can use it to look up the critical value, either in a table or using the TINV function in Excel. The equivalent in R would be the qt() function.

Carry out basic t-test

Start by opening the ridge furrow.xlsx data file. Go to the Data worksheet (the t-test completed worksheet is provided for you to check your results). The two samples are in columns B and C.

  1. In cell A10 type a label for the number of observations in each sample, “n” will do nicely.
  2. In cell B10 type a formula to determine the number of observations =COUNT(B2:B9)
  3. Copy cell B10 into C10 so that you have a result for each sample
  4. In cell A11 type a label for the mean values, “mean” will do fine.
  5. In cell B11 type a formula to work out the mean = AVERAGE(B2:B9).
  6. Copy cell B11 into C11 so that you have a result for each sample.
  7. In cell A12 type a label for the variance, “variance” seems logical.
  8. In cell B12 type a formula to calculate the variance =VAR(B2:B9)
  9. Copy the variance formula from B12 to C12.
  10. In A13 type a label for the t-value, “t” or “t-value” will do.
  11. In cell B13 type a formula to work out the value of t: =ABS(B11-C11)/SQRT(B12/B10+C12/C10)
  12. In A15 (yes, leave a blank row) type a label “p-value”.
  13. In B15 type a formula to work out the p-value based on the t you calculated: =TDIST(B13,B10+C10-2,2)
  14. The p-value from step 13 (p = 0.019) is based on equal variance (and un-modified degrees of freedom) and so is not really correct. Carry on and calculate a critical value.
  15. In A17 type a label “t-crit”.
  16. In B17 type a formula to work out a critical value for t: =TINV(0.05,B10+C10-2)

The critical value from step 15 (t = 2.201) is also based on equal variance and un-modified degrees of freedom.

  1. In A18 type another label “p-value”, for the exact p-value computed by Excel’s TTEST function.
  2. In B18 type a formula to compute the exact p-value for a t-test with un-equal variance: =TTEST(B2:B9,C2:C6,2,3)

Note that the TTEST function gives a more conservative p-value (of 0.023) because it has calculated the modified degrees of freedom. However, you cannot extract the df directly and will have to compute it if you want to report a more appropriate critical value you’ll need to compute it longhand.

Compute modified degrees of freedom

The calculation of the modified degrees of freedom (Satterthwaite modification) will require some simple maths using the equations from earlier:

  1. In Cell A20 type a label “u” for the result of the first calculation.
  2. In B20 type a formula to calculate u: =(B12/B10)/((B12/B10)+(C12/C10)).
  3. In A21 type a label “f” for the result of the second calculation.
  4. In B21 type a formula to calculate f: =1/(((B20^2)/(B10-1))+(((1-B20)^2)/(C10-1)))
  5. Your result from step 21 should be 8.733. Excel cannot deal with degrees of freedom that are not integer values so you need to round up (Excel always rounds the df value upwards):
  6. In A22 type a label “df” for the integer df result.
  7. In B22 type a formula to round up the value from step 21: =CEILING(B21,1)
  8. Now you have a modified df (df = 9) you can use it to calculate a critical value. You can also see how the modified df can be used in TINV and TDIST functions.
  9. In A24 type a label “t-crit” for the new critical value.
  10. In B24 type a function to compute the critical value based on the new df: =TINV(0.05,B22)
  11. Note that the critical value from step 25 (t = 2.262) is higher than the value from step 15, and so a bit more conservative.
  12. In A25 type a label “p-value” for the exact p-value based on the modified df.
  13. In B25 type a formula to work out the exact p-value: =TDIST(B13,B22,2)

Note that the exact p-value from step 27 (p = 0.022) is very similar to that you obtained from the TTEST function in step 17. If you have a p-value you can get the value of t but only if you have the modified degrees of freedom.

  1. In A26 type a label “t-value” for the t calculation.
  2. In B26 type a formula that calculates the value of t based on the p-value and modified df: =TINV(B25,B22)

The final t-value you calculated in step 29 (t = 2.758) is the same as the one from step 11. Unfortunately, because of slight rounding errors you usually get a slightly different p-value using the TTEST function compared to the “long” method. The upshot is that it is always better to calculate the value of t using the means and variance, rather than indirectly via TTEST and TINV.

Use the Analysis ToolPak

You can use Analysis ToolPak add-in to carry out a t-test. This allows you to get the value of t, the critical value and the exact p-value all at once. The ToolPak is not always “activated” so you need to go to the options and find the Add-Ins (the exact method will depend on your version of Excel).

  1. Open the ridge furrow.xlsx data file.
  2. Click the Data > Data Analysis button to open the Data Analysis dialogue window.
  3. Choose the t-test: Two-Sample Assuming Unequal Variances
  4. Now select the appropriate data and fill in the boxes. Make sure you type a zero in the box labelled Hypothesized Mean Difference

     

  5. Choose the location to place the results. This can be a new workbook or worksheet. In the example above the results are placed in cell E2 of the existing sheet.

Now you just have to interpret the results!:

You can ignore the sign for the t Stat result, if the samples were in different column order the result would be the same value but with opposite sign.

Note that the df are shown as an integer and the exact value is always rounded upwards.

You want the two-tail results in most cases, which are in the last two rows.

Interactive labels in R pie() charts

Exercise 6.5.1.

Statistics for Ecologists (Edition 2) Exercise 6.5.1

These notes relate to Chapter 6, exploring data using graphs. They are especially relevant to Section 6.5.1, which is about using pie charts to show association data. However, the notes are generally relevant as they show how you can place text onto an existing R plot in an interactive manner, using your mouse as a pointer.

Interactive labels in R pie() charts

Introduction

The locator() command is used to “read” the mouse position and generate x, y co-ordinates. These can be used in various ways, in commands that require those x, y co-ordinates. For example, sometimes the default placement of labels on a plot is not quite what you want. You can use the text() command with locator() to place the labels exactly where you want.

In this exercise you’ll see the locator() command used to place labels on a pie() chart. You can download the example data file birds.RData and use the data to follow along.

A basic pie() chart

Here are some data on bird species and habitat selection:

birds
               Garden Hedgerow Parkland Pasture Woodland
Blackbird          47       10       40       2        2
Chaffinch          19        3        5       0        2
Great Tit          50        0       10       7        0
House Sparrow      46       16        8       4        0
Robin               9        3        0       0        2
Song Thrush         4        0        6       0        0

The data are in matrix form, which you can see using the class() command:

class(birds)
[1] "matrix"

We’ll plot a pie() chart of the 2nd row:

pie(birds[2,])

The resulting plot uses the names attribute for the labels (here the colnames of the original matrix).

Simple locator() placement

The locator() command accepts x, y co-ordinates from a mouse click. You can use the locator() command in text() to place labels where you click the mouse. You need to state in locator() how many “clicks” you want.

In the current example there are 5 categories (one has a value of zero), so we want 5 labels.

First of all, you need to suppress the default labels. Each plotting command has a slightly different way of doing this, in the pie() command you use labels = “”.

pie(birds[2,], labels = "")

Now you can add the labels separately. There are 5 categories so you’ll need locator(5) in this example.

text(locator(5), colnames(birds))

Note that the labels are centred over the spot you click, and they are not displayed until the command is finished.

In this pie() chart the labels have been placed with the aid of the locator() command.

It can be a bit tricky to get the placement exactly how you want but there are some additional tools to help you.

Alignment with the text() command

The text() command allows you to tweak the position of the text, relative to the co-ordinates. The pos parameter allows you to specify an integer (or vector of integers) which align the text like so:

  • pos = 1 text is placed below the point (centred).
  • pos = 2 text is placed to the left of the point (the last character next to the point).
  • pos = 3 text is placed above the point (centred).
  • pos = 4 text is placed to the right of the point (the first character next to the point).

In our example the labels need a vector of pos values, as the 2nd is best aligned by the final character and the 4th by the first. The following commands should do the job:

pie(birds[2,], labels = "") # no labels
text(locator(5), colnames(birds), pos = c(1, 2, 1, 4, 1))

To place the labels, you need to click in the plot. The first click will place the “Garden” label, which will be centred just below where you click. The second click will align the last character of the “Hedgerow” label to the left of where you click. The third click is below (centred). The fourth click places the “Pasture” label aligned with the first character to the right of the click-point. The final (fifth) click will be centred just below the click-point.

Custom labels

You can use any label you like by specifying the text explicitly. In a pie() chart you’ll generally want the category label and the frequency or percentage. You can use the prop.table() command to get the proportions (and therefore percentages). The paste() command is useful to join things together to make custom labels.

Use prop.table() to get row or column proportions (margin = 1 for rows, margin = 2 for columns):

prop.table(birds, margin = 1)
                  Garden  Hedgerow  Parkland    Pasture   Woodland
Blackbird      0.4653465 0.0990099 0.3960396 0.01980198 0.01980198
Chaffinch      0.6551724 0.1034483 0.1724138 0.00000000 0.06896552
Great Tit      0.7462687 0.0000000 0.1492537 0.10447761 0.00000000
House Sparrow  0.6216216 0.2162162 0.1081081 0.05405405 0.00000000
Robin          0.6428571 0.2142857 0.0000000 0.00000000 0.14285714
Song Thrush    0.4000000 0.0000000 0.6000000 0.00000000 0.00000000

Use the round() command to display fewer decimal places and let’s x100 to get percentage values:

round(prop.table(birds, margin = 1)*100,1)
                Garden Hedgerow Parkland Pasture Woodland
Blackbird        46.5      9.9     39.6     2.0      2.0
Chaffinch        65.5     10.3     17.2     0.0      6.9
Great Tit        74.6      0.0     14.9    10.4      0.0
House Sparrow    62.2     21.6     10.8     5.4      0.0
Robin            64.3     21.4      0.0     0.0     14.3
Song Thrush      40.0      0.0     60.0     0.0      0.0

Since we are only plotting the 2nd row let’s display only the 2nd row data:

round(prop.table(birds, margin = 1)*100,1)[2,]
Garden Hedgerow Parkland  Pasture Woodland
  65.5     10.3     17.2      0.0      6.9

Now you can use the paste() command to make a custom label containing the name and percentage:

pie.perc <- round(prop.table(birds, margin = 1)*100,1)
pie.labels <- paste(colnames(pie.perc), ", ", pie.perc[2,], "%", sep = "")

The paste() command combines items, which you separate with commas. The sep parameter determines which character (if any) is used to separate the items (none in this case. A comma or space would not be appropriate as we want the % to be adjacent to the percentage value).

Look at the labels you made:

pie.labels
[1] "Garden, 65.5%"   "Hedgerow, 10.3%" "Parkland, 17.2%" "Pasture, 0%"
[5] "Woodland, 6.9%"

You can now use the locator() command as before but specifying your new custom labels.

Making room for labels

Sometimes the labels are simply too large to fit. You may be able to make room by shrinking the text with the cex parameter:

pie(birds[2,], labels = "")
text(locator(5), pie.labels, pos = c(1, 2, 1, 4, 1), cex = 0.8)

The resulting pie() chart resembles the following:

You can make custom labels and use the locator() command to help place them.

In this case the cex parameter was used to shrink the text, allowing it to fit better. This does not always work out and you may need to shrink the pie in its frame. To do this you use the radius parameter (default: radius = 0.8), in the pie() command.

Labels over multiple lines

Another way to make labels fit better is to split them over more than one line. In our current example the percentage value could be placed below the category labels.

The “\n” text is treated as a “newline” character. So, if you use “\n” in your paste() command you can make a custom label spread over more than one line.

pie.perc <- round(prop.table(birds, margin = 1)*100,1)
pie.labels <- paste(colnames(pie.perc), "\n", pie.perc[2,], "%", sep = "")

This is what they look like in the console:

pie.labels
[1] "Garden\n65.5%"   "Hedgerow\n10.3%" "Parkland\n17.2%" "Pasture\n0%"
[5] "Woodland\n6.9%"

Now you proceed as before and place the labels:

pie(birds[2,], labels = “”)
text(locator(5), pie.labels, pos = c(1, 2, 1, 4, 1), cex = 0.8)

The resulting pie() chart looks like this:

Placing “\n” in a custom label acts as a newline character.

This gives you another way to display the labels.

Leading lines

If you have labels around the outside of your pie() chart you may want to add leader lines to “join” the label to the appropriate segment. The locator() command can do this by itself, but you’ll have to create one line at a time.

You’ll need to specify two co-ordinates; the start and end point of the line.

locator(2, type = "l")

Note that you need type = “l” (that’s a lower case L not a number 1). Then click the plot at the start and end points. You can use additional graphics parameters to alter the appearance of the line, examples might be col (colour), lwd (line width), lty (line type).