Alter sample format to recording format

Exercise 10.1.4.

Statistics for Ecologists (Edition 2) Exercise 10.1.4

When you have data in the “wrong” layout you need to be able to rearrange them into a more “sensible” layout so that you can unleash the power of R most effectively. The stack() command is a useful tool that can help you achieve this layout.

Altering data layout – switch from sample format to recording format

Data layout – stack()ing your data

There are two main ways you can layout your data. In sample-format each column is a separate sample, which forms some kind of logical sampling unit. This is a typical way that you layout data if you are using Excel because that’s how you have to have your data to be able to make charts and carry out most forms of analysis.

In scientific recording format each column is a variable; you have response variables and predictor variables. This recording-layout is a more powerful and ultimately flexible layout because you can add new variables or observations easily. In R this layout is also essential for any kind of complicated analysis, such as regression or analysis of variance.

When you have data in the “wrong” layout you need to be able to rearrange them into a more “sensible” layout so that you can unleash the power of R most effectively. The stack() command is a useful tool that can help you achieve this layout.

Convert a single predictor

The simplest situation is where you have a single response variable and a single predictor, but the values are laid out in sample columns. Here is an example, the data are counts of a freshwater invertebrate at three different sampling locations:

hog3
  Upper Mid Lower
1     3   4    11
2     4   3    12
3     5   7     9
4     9   9    10
5     8  11    11
6    10  NA    NA
7     9  NA    NA

Note that the samples contain different numbers of observations. The “short” columns are padded by NA items when the data is read into R (usually via theread.csv() command).

The stack() command will take the columns and combine them into a single response variable. The names of the columns will be used to form the levels of a predictor variable.

stack(hog3)
   values   ind
1       3 Upper
2       4 Upper
3       5 Upper
4       9 Upper
5       8 Upper
6      10 Upper
7       9 Upper
8       4   Mid
9       3   Mid
10      7   Mid
11      9   Mid
12     11   Mid
13     NA   Mid
14     NA   Mid
15     11 Lower
16     12 Lower
17      9 Lower
18     10 Lower
19     11 Lower
20     NA Lower
21     NA Lower

Note that the columns are labelled, values and ind. You can alter these easily enough using the names() command.

hog = stack(hog3)

names(hog) <- c("count", "site")

head(hog)
  count  site
1     3 Upper
2     4 Upper
3     5 Upper
4     9 Upper
5     8 Upper
6    10 Upper

Now you have a data.frame in recording layout, but the NA items are still in the data.

Remove NA items

The na.omit() command will “knock-out” any NA items.

hog
   count  site
1      3 Upper
2      4 Upper
3      5 Upper
4      9 Upper
5      8 Upper
6     10 Upper
7      9 Upper
8      4   Mid
9      3   Mid
10     7   Mid
11     9   Mid
12    11   Mid
13    NA   Mid
14    NA   Mid
15    11 Lower
16    12 Lower
17     9 Lower
18    10 Lower
19    11 Lower
20    NA Lower
21    NA Lower
na.omit(hog)
   count  site
1      3 Upper
2      4 Upper
3      5 Upper
4      9 Upper
5      8 Upper
6     10 Upper
7      9 Upper
8      4   Mid
9      3   Mid
10     7   Mid
11     9   Mid
12    11   Mid
15    11 Lower
16    12 Lower
17     9 Lower
18    10 Lower
19    11 Lower

Note that the row names keep their original values.

Re-number row names

If you want to re-number the rows just use the row.names() command.

hog <- na.omit(hog)

row.names(hog) <- 1:length(hog$count)
hog
   count  site
1      3 Upper
2      4 Upper
3      5 Upper
4      9 Upper
5      8 Upper
6     10 Upper
7      9 Upper
8      4   Mid
9      3   Mid
10     7   Mid
11     9   Mid
12    11   Mid
13    11 Lower
14    12 Lower
15     9 Lower
16    10 Lower
17    11 Lower

It’s not really essential but it makes it easier to keep track of items if the numbers are sequential.

Convert multiple predictors

You may have data in a more complicated arrangement, perhaps mirroring the on-ground layout. This can happen for example in a two-way ANOVA design. In Excel the data have to be set out in a particular way for the Analysis ToolPak to carry out the anova. However, in R this layout is not helpful.

wp
  Water vulgaris sativa
1    Lo        9      7
2    Lo       11      6
3    Lo        6      5
4   Mid       14     14
5   Mid       17     17
6   Mid       19     15
7    Hi       28     44
8    Hi       31     38
9    Hi       32     37

This time you have a column representing one of the predictor variables and the other columns in sample layout. If you try a stack() command you’ll find that only the numeric columns are stacked.

wps <- stack(wp)
Warning message:
In stack.data.frame(wp) : non-vector columns will be ignored

wps
   values      ind
1       9 vulgaris
2      11 vulgaris
3       6 vulgaris
4      14 vulgaris
5      17 vulgaris
6      19 vulgaris
7      28 vulgaris
8      31 vulgaris
9      32 vulgaris
10      7   sativa
11      6   sativa
12      5   sativa
13     14   sativa
14     17   sativa
15     15   sativa
16     44   sativa
17     38   sativa
18     37   sativa

This is a minor problem because you can re-build the first predictor variable by duplicating the original and adding it to the new data.frame.

Replace a factor predictor variable

What you must do is take the original predictor variable and repeat it a number of times (equal to the number of sample columns).

wps <- cbind(wps, water = rep(wp$Water, 2))
   values      ind water
1       9 vulgaris    Lo
2      11 vulgaris    Lo
3       6 vulgaris    Lo
4      14 vulgaris   Mid
5      17 vulgaris   Mid
6      19 vulgaris   Mid
7      28 vulgaris    Hi
8      31 vulgaris    Hi
9      32 vulgaris    Hi
10      7   sativa    Lo
11      6   sativa    Lo
12      5   sativa    Lo
13     14   sativa   Mid
14     17   sativa   Mid
15     15   sativa   Mid
16     44   sativa    Hi
17     38   sativa    Hi
18     37   sativa    Hi

Note that you gave the new variable an explicit name. The other column names need replacing using the names() command:

names(wps)[1:2] = c("height", "species")

names(wps)
[1] "height"  "species" "water"

head(wps)
  height  species water
1      9 vulgaris    Lo
2     11 vulgaris    Lo
3      6 vulgaris    Lo
4     14 vulgaris   Mid
5     17 vulgaris   Mid
6     19 vulgaris   Mid

Now you have a more sensible layout that can be used for aov() and graphical commands.

This approach will not work on every dataset that you get but it will take you a long way forward.

Comments are closed.