Working with data in R

In this section you’ll learn a bit about:

Making data items.
Importing data from disk.
Rearranging and managing datasets.

Combine values command

The c() command is used for joining things together to make longer things!

> data1 = c(3, 5, 7, 5, 3, 2, 6, 8, 5, 6, 9)
> data1
[1] 3 5 7 5 3 2 6 8 5 6 9

The items you want to join are listed in the parentheses and separated with commas. The items are joined in the order they appear in the command.

Typing in values using scan()

The c() command is useful but it can be a pain to type all the commas! The scan() command allows you to enter data from the keyboard without commas. Here are some of the possible parameters:

scan(what, sep, dec = “.”)

what – the default is to expect numbers but you can also choose what = “character” for text.
sep – the default is to use a space to separate elements, use sep = “,” for commas or sep = “\t” for tabs.
dec – the decimal point character defaults to a full stop (period).

To use scan() you start by assigning a name to “hold” the result. At the same time you tell the command the type of data to expect and if the elements are separated by anything other than spaces.

> mydata = scan()
1:

R now “waits” and you will see the 1: instead of the regular cursor. You can now type your data (in this example it will be expecting numbers).

1: 2 5 6.2 33 25 1.3 8

If you press <enter> then R continues on a new line but still waits:

8:

Type in more data if you like:

8: 111

To stop adding values you press <enter> on a blank line (in other words press <enter> twice).

9:
Read 8 items

You can now see your data by typing the name you assigned to it:

> mydata
[1]   2.0   5.0   6.2  33.0  25.0   1.3   8.0 111.0

If you want to type text values, then use what = “character” as a parameter:

> myscan = scan(what = "character")
1: jan feb mar apr
5: may jun
7:
Read 6 items
> myscan
[1] "jan" "feb" "mar" "apr" "may" "jun"

Note that you don’t have to use quotes around the items as you type (this is way you used what = “character”).

Read the clipboard

The scan() command can read the clipboard, which means you can copy and paste items from other programs (or web pages). Use the sep parameter to match the separator.

Importing data

The next step is to get your data into R. If you have saved your data in a .CSV file then you can use the read.csv(filename) command to get the information. You need to tell R where to store the data and to do this you assign it a name. All names must have at least one letter (otherwize it is a number of course!). Remember, you can use a period (e.g. test.1) or underscore (e.g. test_1) but no other punctuation marks. R is case sensitive so the variable test is different from Test or teSt.

The read.csv() command has various parameters that allow you control over how and what is imported, here are some of the essentials:

read.csv(file, header = TRUE, sep = ",", dec = ".", row.names)

file – the filename of the data you want to read in. Give this in quotes, with the full path e.g. “My documents/mydata.csv”. Alternatively you can use choose() to bring up a file explorer, allowing you to choose the file.
header – if set to TRUE (the default) the first row of the CSV is used as column labels. You can set this to FALSE, then R will assume all the data are data and assign standard names for the columns (e.g. V1, V2, V3).
sep – this sets the separator between “fields”, given as a character in quotes. For a CSV this is of course a comma (this is the default). If you have tab separated data use sep = “\t”
dec – this sets the decimal point character. The default is a full stop (period) dec = “.” but other characters are possible.
names – this allows a column of data to be used as labels (i.e. row names). Usually this would be the first column e.g. row.names = 1, but any number is allowed.

Here are some examples:

To get a file into R with basic columns of data and their labels use:

> mydata = read.csv(file.choose(), header = TRUE)

To get a file into R with column headings and row headings use:

> mydata = read.csv(file.choose(), row.names = 1)

To read a CSV file from a European colleague where commas were used as decimal points and the data were separated by semi-colon.

> mydata = read.csv(file.choose(), dec = ",", sep = ";")

On Linux machines the file.choose() capability is absent, so you’ll need to use the full file and path name (in quotes):

> mydata = read.csv(file = "My Documents/myfile.CSV")

N.B. There are occasions when R won’t like your data file! Check the file carefully. In some cases, the addition of an extra linefeed at the end will sort out the issue. To do this open the file in a word processor and make sure that non-printing characters are displayed. Add the extra carriage return and save the file.

Seeing your data in R

To view your data once it is imported you simply type the name you assigned to it.

> bird = read.csv(file = file.choose(), row.names = 1)
> bird
              Garden Hedgerow Parkland Pasture Woodland
Blackbird         47       10       40       2        2
Chaffinch         19        3        5       0        2
Great Tit         50        0       10       7        0
House Sparrow     46       16        8       4        0
Robin              9        3        0       0        2
Song Thrush        4        0        6       0        0

In this case the first column of data was set to act as row names, so when you view the data you see those names. In other cases you do not have explicit row names, in which case R assigns a simple index number just to help you navigate the dataset.

> mf = read.csv(file = file.choose())
> mf
  Length Speed Algae  NO3 BOD
1     20    12    40 2.25 200
2     21    14    45 2.15 180
3     22    12    45 1.75 135
4     23    16    80 1.95 120
5     21    20    75 1.95 110
6     20    21    65 2.75 120
7     19    17    65 1.85  95
8     16    14    65 1.75 168

Notice that each column has a name (taken from the CSV) but that simple numbers are used as a “row index”.

View individual data columns

The variable name you assign to data that you import “covers” the overall data. However, R cannot pick out individual columns (variables) in your data without a bit of help.

> Algae
Error: object ‘Algae’ not found

There are several ways to overcome this, here are two of them:

Use the $ and specify the overall data and column.
Use the attach() command to “open up” the data.

The $ is a general part of the R language and allows you to specify a variable inside another named object like so:

> mf$Algae
[1] 40 45 45 80 75 65 65 65

The attach() command allows the names of the columns in your data to be “seen” by R:

> attach(mf)
> Speed
[1] 12 14 12 16 20 21 17 14

But beware! You may have other data with the same names. To avoid “confusion” you should use detach() after you are done:

> detach(mf)
> Algae
Error: object ‘Algae’ not found

View individual data rows

The $ syntax allows you to “pick out” named elements of a dataset (e.g. the columns) but cannot help you extract individual rows. Instead you can use square brackets to extract one or more elements from a larger item.

data[rows, columns]

If your data are two-dimensional (i.e. have rows and columns) then you always specify the rows first and then the columns. If you leave out one of these, you get “everything”.

> mf
  Length Speed Algae  NO3 BOD
1     20    12    40 2.25 200
2     21    14    45 2.15 180
3     22    12    45 1.75 135
4     23    16    80 1.95 120
5     21    20    75 1.95 110
6     20    21    65 2.75 120
> mf[1, ]
  Length Speed Algae  NO3 BOD
1     20    12    40 2.25 200

If you want several rows, then indicate that in the []:

> mf[2:5, ]
  Length Speed Algae  NO3 BOD
2     21    14    45 2.15 180
3     22    12    45 1.75 135
4     23    16    80 1.95 120
5     21    20    75 1.95 110

The c() command can be helpful here if you want non-sequential rows:

> mf[c(1,3,5), ]
  Length Speed Algae  NO3 BOD
1     20    12    40 2.25 200
3     22    12    45 1.75 135
5     21    20    75 1.95 110

You can also specify certain columns too, in a similar manner.

Transposing data frames

Sometimes you need to rotate your data so that the rows and columns are switched; the t() command can do this:

> fw
         count speed
Taw          9     2
Torridge    25     3
Ouse        15     5
Exe          2     9
Lyn         14    14
Brook       25    24
Ditch       24    29
Fal         47    34

> t(fw)
      Taw Torridge Ouse Exe Lyn Brook Ditch Fal
count   9       25   15   2  14    25    24  47
speed   2        3    5   9  14    24    29  34

Missing values

When you import data from a file (e.g. CSV) R “tidies up” and makes the resulting data a nice rectangular shape. If any data column is shorter than the others it will be padded with NA items.

> grass2
  mow unmow
1  12     8
2  15     9
3  17     7
4  11     9
5  15    NA

In this example the mow column has 5 elements but the unmow column contains only 4. R has “interpreted” the empty cell as a missing value and replaced it with NA (think of it as “not available”).

The NA item is important in data analysis. In this case the NA is obviously simply the result of a “short” column. In other datasets NA may well represent missing values (R will replace any missing cells of a spreadsheet with NA). Some statistical tests will return a result of NA if the data contains NA items. In most cases you may ignore the NA values by including the parameter na.rm= TRUE.

Stacking data

Sometimes you need to reorganize your data and convert from one kind of recording layout to another. For example:

> flies
    C  F F.G  G  S
1  75 58  58 57 62
2  67 61  59 58 66
3  70 56  58 60 65
4  75 58  61 59 63
5  65 57  57 62 64
6  71 56  56 60 62
7  67 61  58 60 65
8  67 60  57 57 65
9  76 57  57 59 62
10 68 58  59 61 67

This dataset shows five columns. Each column shows repeated measurements of the size of fly wings under a specific diet. If you want to compare the diets, you’ll need to carry out analysis of variance but R will need the data in a different form.

  size diet
1   75    C
2   67    C
3   70    C
4   75    C
5   65    C
6   71    C

In this layout the numerical data are all in one column, called size. This is the response variable (sometimes called dependent variable). The second column is headed diet, and is a grouping variable (each size measurement is related to a group), where the names of the groups are related to the column headings in the previous layout. This column is the predictor variable (also called independent variable).

The stack() command allows you to reassemble data in multi-sample layout to a stacked layout.

> fly = stack(flies)
fly
   values ind
1      75   C
2      67   C
3      70   C
4      75   C
5      65   C
6      71   C

Notice that the numerical variable is called values and the grouping variable is called ind.

Selecting columns

You do not have to use all the columns when you stack() the data; use the select parameter to choose:

> stack(flies, select = c(“F”, “G”, “S”))
   values ind
1      58   F
2      61   F
3      56   F
4      58   F
5      57   F
6      56   F

Naming the stacked columns

You can carry out your analyses using the names R has used when it did the stack(). If you want to use different names then use the names() command:

> names(fly) = c(“size”, “diet”)
   size diet
1   75    C
2   67    C
3   70    C

Note that you give the names wrapped in a c() command and the names themselves in quotes.

Unstack

The opposite of the stack() command is unstack(). This looks to reassemble the data so that each grouping level is matched up with the appropriate values. If there are equal replicates in each grouping, then your result is a neat data frame:

> unstack(fly)
    C  F F.G  G  S
1  75 58  58 57 62
2  67 61  59 58 66
3  70 56  58 60 65
4  75 58  61 59 63

If the replication is unbalanced the result is several individual elements bundled together in a list.

> grass
  rich graze
1   12   mow
2   15   mow
3   17   mow
4   11   mow
5   15   mow
6    8 unmow
7    9 unmow
8    7 unmow
9    9 unmow
> grasses = unstack(grass)
> grasses
$mow
[1] 12 15 17 11 15

$unmow
[1] 8 9 7 9

In this example the grass dataset shows that there are 5 replicates for mow and 4 for unmow.

What data are loaded?

After a while you may well end up with quite a few items in R. It is easy to lose track of what data are loaded (in computer memory) into your R console (the name of the main window where you type stuff). The ls() command shows you what items are currently available:

> ls()
 [1] "algae"       "all.samples" "ans1"        "ans2"        "ans3"
 [6] "ans4"        "ans5"        "ans6"        "b"           "barley"
[11] "barplot.eb"  "bats"        "bbel"        "mf"

Note how the “index” on the left helps you to see how many things there are.

You can also see what items are contained in each object by using ls() and specifying an object:

> ls(mf)
[1] "Algae"  "BOD"    "Length" "NO3"    "Speed"

Removing data items

The rm() command allows you to remove items from the computer memory and so also from the R console. Just place the names of the items to “delete” in the parentheses, separated by commas:

rm(name1, name2, name3, …)

On Windows and Mac computers the GUI also has toolbar menus you can click to remove everything. Use this with caution as there is no undo button!

2nd August 2019 aJfsfjlser3f Learn R

Previous Next