Alter sample format to recording format
Exercise 10.1.4.
Statistics for Ecologists (Edition 2) Exercise 10.1.4
When you have data in the “wrong” layout you need to be able to rearrange them into a more “sensible” layout so that you can unleash the power of R most effectively. The stack() command is a useful tool that can help you achieve this layout.
Altering data layout – switch from sample format to recording format
Data layout – stack()ing your data
There are two main ways you can layout your data. In sample-format each column is a separate sample, which forms some kind of logical sampling unit. This is a typical way that you layout data if you are using Excel because that’s how you have to have your data to be able to make charts and carry out most forms of analysis.
In scientific recording format each column is a variable; you have response variables and predictor variables. This recording-layout is a more powerful and ultimately flexible layout because you can add new variables or observations easily. In R this layout is also essential for any kind of complicated analysis, such as regression or analysis of variance.
When you have data in the “wrong” layout you need to be able to rearrange them into a more “sensible” layout so that you can unleash the power of R most effectively. The stack() command is a useful tool that can help you achieve this layout.
Convert a single predictor
The simplest situation is where you have a single response variable and a single predictor, but the values are laid out in sample columns. Here is an example, the data are counts of a freshwater invertebrate at three different sampling locations:
hog3 Upper Mid Lower 1 3 4 11 2 4 3 12 3 5 7 9 4 9 9 10 5 8 11 11 6 10 NA NA 7 9 NA NA
Note that the samples contain different numbers of observations. The “short” columns are padded by NA items when the data is read into R (usually via theread.csv() command).
The stack() command will take the columns and combine them into a single response variable. The names of the columns will be used to form the levels of a predictor variable.
stack(hog3) values ind 1 3 Upper 2 4 Upper 3 5 Upper 4 9 Upper 5 8 Upper 6 10 Upper 7 9 Upper 8 4 Mid 9 3 Mid 10 7 Mid 11 9 Mid 12 11 Mid 13 NA Mid 14 NA Mid 15 11 Lower 16 12 Lower 17 9 Lower 18 10 Lower 19 11 Lower 20 NA Lower 21 NA Lower
Note that the columns are labelled, values and ind. You can alter these easily enough using the names() command.
hog = stack(hog3) names(hog) <- c("count", "site") head(hog) count site 1 3 Upper 2 4 Upper 3 5 Upper 4 9 Upper 5 8 Upper 6 10 Upper
Now you have a data.frame in recording layout, but the NA items are still in the data.
Remove NA items
The na.omit() command will “knock-out” any NA items.
hog count site 1 3 Upper 2 4 Upper 3 5 Upper 4 9 Upper 5 8 Upper 6 10 Upper 7 9 Upper 8 4 Mid 9 3 Mid 10 7 Mid 11 9 Mid 12 11 Mid 13 NA Mid 14 NA Mid 15 11 Lower 16 12 Lower 17 9 Lower 18 10 Lower 19 11 Lower 20 NA Lower 21 NA Lower
na.omit(hog) count site 1 3 Upper 2 4 Upper 3 5 Upper 4 9 Upper 5 8 Upper 6 10 Upper 7 9 Upper 8 4 Mid 9 3 Mid 10 7 Mid 11 9 Mid 12 11 Mid 15 11 Lower 16 12 Lower 17 9 Lower 18 10 Lower 19 11 Lower
Note that the row names keep their original values.
Re-number row names
If you want to re-number the rows just use the row.names() command.
hog <- na.omit(hog) row.names(hog) <- 1:length(hog$count) hog count site 1 3 Upper 2 4 Upper 3 5 Upper 4 9 Upper 5 8 Upper 6 10 Upper 7 9 Upper 8 4 Mid 9 3 Mid 10 7 Mid 11 9 Mid 12 11 Mid 13 11 Lower 14 12 Lower 15 9 Lower 16 10 Lower 17 11 Lower
It’s not really essential but it makes it easier to keep track of items if the numbers are sequential.
Convert multiple predictors
You may have data in a more complicated arrangement, perhaps mirroring the on-ground layout. This can happen for example in a two-way ANOVA design. In Excel the data have to be set out in a particular way for the Analysis ToolPak to carry out the anova. However, in R this layout is not helpful.
wp Water vulgaris sativa 1 Lo 9 7 2 Lo 11 6 3 Lo 6 5 4 Mid 14 14 5 Mid 17 17 6 Mid 19 15 7 Hi 28 44 8 Hi 31 38 9 Hi 32 37
This time you have a column representing one of the predictor variables and the other columns in sample layout. If you try a stack() command you’ll find that only the numeric columns are stacked.
wps <- stack(wp) Warning message: In stack.data.frame(wp) : non-vector columns will be ignored wps values ind 1 9 vulgaris 2 11 vulgaris 3 6 vulgaris 4 14 vulgaris 5 17 vulgaris 6 19 vulgaris 7 28 vulgaris 8 31 vulgaris 9 32 vulgaris 10 7 sativa 11 6 sativa 12 5 sativa 13 14 sativa 14 17 sativa 15 15 sativa 16 44 sativa 17 38 sativa 18 37 sativa
This is a minor problem because you can re-build the first predictor variable by duplicating the original and adding it to the new data.frame.
Replace a factor predictor variable
What you must do is take the original predictor variable and repeat it a number of times (equal to the number of sample columns).
wps <- cbind(wps, water = rep(wp$Water, 2)) values ind water 1 9 vulgaris Lo 2 11 vulgaris Lo 3 6 vulgaris Lo 4 14 vulgaris Mid 5 17 vulgaris Mid 6 19 vulgaris Mid 7 28 vulgaris Hi 8 31 vulgaris Hi 9 32 vulgaris Hi 10 7 sativa Lo 11 6 sativa Lo 12 5 sativa Lo 13 14 sativa Mid 14 17 sativa Mid 15 15 sativa Mid 16 44 sativa Hi 17 38 sativa Hi 18 37 sativa Hi
Note that you gave the new variable an explicit name. The other column names need replacing using the names() command:
names(wps)[1:2] = c("height", "species") names(wps) [1] "height" "species" "water" head(wps) height species water 1 9 vulgaris Lo 2 11 vulgaris Lo 3 6 vulgaris Lo 4 14 vulgaris Mid 5 17 vulgaris Mid 6 19 vulgaris Mid
Now you have a more sensible layout that can be used for aov() and graphical commands.
This approach will not work on every dataset that you get but it will take you a long way forward.
Comments are closed.