Dr. Mark Gardener
On this page...
Using R for statistical analyses - More about data
This page is intended to be a help in getting to grips with the powerful statistical program called R. It is not intended as a course in statistics (see here for details about those). If you have an analysis to perform I hope that you will be able to find the commands you need here and copy/paste them into R to get going.
On this page learn how to create and manipulate data without using a spreadsheet. Learn more about reading data files.
I run courses in using R; these may be held at various locations:
My publications about R and Data Science
|See my books about R and Data Science on my Publications page|
I have more projects in hand - visit my Publications page from time to time. You might also like my random essays on selected R topics in MonogRaphs. See also my Writer's Bloc page, details about my latest writing project including R scripts developed for the book.
R is Open Source
R is Free
R is an open-source (GPL) statistical environment modeled after S and S-Plus. The S language was developed in the late 1980s at AT&T labs. The R project was started by Robert Gentleman and Ross Ihaka (hence the name, R) of the Statistics Department of the University of Auckland in 1995. It has quickly gained a widespread audience. It is currently maintained by the R core-development team, a hard-working, international team of volunteer developers. The R project web page is the main site for information on R. At this site are directions for obtaining the software, accompanying packages and other sources of documentation.
R is a powerful statistical program but it is first and foremost a programming language. Many routines have been written for R by people all over the world and made freely available from the R project website as "packages". However, the basic installation (for Linux, Windows or Mac) contains a powerful set of tools for most purposes.
Because R is a programming language it can seem a bit daunting; you have to type in commands to get it to work. However, it does have a Graphical User Interface (GUI) to make things easier. You can also copy and paste text from other applications into it (e.g. word processors). So, if you have a library of these commands it is easy to pop in the ones you need for the task at hand. That is the purpose of this web page; to provide a library of basic commands that the user can copy and paste into R to perform a variety of statistical analyses.
With larger data sets the most useful method of creating and storing your information remains the use of a spreadsheet. R can read spreadsheet files in .XLS format but it is probably better to use .CSV. This format is readily opened by text editors and can be easily modified. Your original data set can be kept in native spreadsheet format and you can use 'save as' to create a .CSV file for the analysis you want to run. To remind yourself about creating and reading CSV files see the introduction page.
The most useful function to read data into R is the read.csv() command. Here is a recap:
variable = read.csv(file.choose(), header=TRUE, row.names=#)
opens an explorer=type window allowing you to select your file.
This is not the only way to get data into R as we shall find out now.
The c() command is used extensively in R, especially as a parameter within other finctions. It is also a quick way to enter small amounts of data.
If you wish to enter a small vector of data it may not be worthwhile creating a spreadsheet and saving it as a CSV file and then reading it into R. It would be much easier to type the data in directly. There are several ways to do this. The first one is using the c() command (c is short for combine). An example will demonstrate it's use:
> data1 = c(2, 4, 5, 2, 3, 7, 8, 4)
Here you have created a variable called data1 and assigned the values in the brackets to it. You may now use the variable you created like any other.
You can use the c() command to append data to an existing vector e.g.
> data1 = c(data1, 12, 14, 11, 9)
Now we have added 4 values to our existing variable. This command is used as part of other functions in R. For example in graphing it is possible to set the limits of the x and y axes, this command is called from within the plot() function like so:
plot(data, xlim= c(lower, upper), ylim= c(lower, upper), ...other commands)
See the section on scatter plots for more information on this command.
Numeric values can be entered 'as is' but text values must be in "quotes" when using the c() command.
The values you entered using your c() command were obviously numeric. You can enter text values merely by enclosing them in (double) quotes so:
dates = c("Jan", "Feb", "Mar", "Apr", "May")
You now have a variable called dates which contains five text values.
What if you were to type in the months without quotes? Try and see:
= (Jan, Feb, Mar, Apr, May)
Oh dear. So, it appears that you either have to have numbers or text values in quotes. It is possible to get one other data type but you will see that later on.
scan() is a useful command for adding larger amounts of data. The basic command accepts numeric values only. To read in text values we must use scan(what="char")
Typing in values using the c() command is fine but when you have substantial sample size you don't necessarily want to type all the commas! R provides another way of entering data using the scan() command. In basic form the scan command works like this:
> more.data = scan()
The 1: indicates that R is waiting for you to type in the first element of your data. What you need to do now is to type in some values; this time you separate them with spaces and don't bother with the commas. You can press the enter key to spread over several lines. Data entry will stop when you enter a blank line e.g.
1: 2 5 6.2 33 25 1.3 8
To see what you entered type the name of the variable e.g.
You can see that R has appended decimals to our data so that the precision matches for all items in the vector.
If you try the same thing but with text labels what happens?
It looks like you might need to enter the values in quotes again. It is a real pain to enter lots of quotes so let's find a way around that. Try this:
That's better; now you don't have to type quotes around each item, you merely type what="character" to tell the function to expect text values. In fact you cannot read text values into the scan() command in any other way. In addition you cannot mix text and numeric values.
When you only have 1-2 variables to input and these are of moderate length, it may be worthwhile entering them using scan() or c() commands. However, when you have more data it is usually better to enter the data into a spreadsheet first and then save as a CSV file for input to R. This subject was introduced earlier (see data files) but here you'll see a bit more detail.
So far you have looked at two types of data item, numeric and text. Let's get a data file to illustrate:
You have three variables, height, plant and water. This is the sort of thing you would expect to form the basis for a two-way analysis of variance. In order for R to read the variables from this data file you would attach() the main variable e.g. attach(twoway). However, it is possible to read the variables without doing this.
To access variables from within larger data sets we can use one of several methods:
attach(data.frame) aloows the variables to be accessed by typing the name.
data.frame$variable reads a variable directly.
data.frame[row, col] allows you to access a specific row, column or element.
To see the height variable you type the following:
 9 11 6 14 17 19 28 31 32 7 6 5 14 17 15 44 38 37
You see the vector of numbers, it's obviously a numeric variable. Notice how you type the name of the original variable then append a dollar sign and the name of the variable within it that you wish to see.
If you look at the water variable next:
lo lo lo mid mid mid hi hi hi lo lo lo mid mid mid hi
This is something new; the variable doesn't appear to be text (the items are not enclosed in quotes). The first couple of lines show you the data items in the order they are in the table and then you see a line starting with "Levels:" This line shows you that there are three 'things' in the water variable, lo, mid and hi. This type of variable is a factor (as opposed to character or numeric). R assumes that all text values in your CSV file are either headings or are factors unless you specifically tell it otherwize. You will see this later.
A single variable is termed a vector. When you create a larger data file (e.g. as a CSV file) the resulting variable (e.g. twoway above) is called a data frame. You can display the individual variables from the data frame by using the $ symbol as you have just seen. However, there is another way. The data frame is composed of rows and columns; you can pull-out individual items using the following syntax:
So, to see the height variable you type:
 9 11 6 14 17 19 28 31 32 7 6 5 14 17 15 44 38 37
Since you left the row blank all rows are displayed.
If you wish to see the water variable you type:
lo lo lo mid mid mid hi hi hi lo lo lo mid mid mid hi
You can display a single row of course:
The transpose command t() is a fast way to re-arrange a data frame by switching rows and columns.
Once you create and enter a CSV file of data you create a data frame. Here is a simple example showing monthly mean temperatures for an Antarctic research station:
Apart from the fact that it is decidedly chilly you can see that you have two variables, month and temp arranged in two columns. If you wished to create a bar chart of these data it may be more useful to have the data arranged in 12 columns, one for each month, rather than the two. You can switch around a data frame using the transpose command t(). To do that you merely type t(dataname) e.g.
The data frame has now been switched around. Also you can see that all the data are enclosed in quotes as if they were text. What has happened is that R has taken the data from the data.frame and made it into a matrix. This is a separate type of data item that I won't cover here.
The t() function is useful for producing barplots that may contain both row and column headings as it allows you to display (and therefore graph) the data sorted by row or column.
To see an individual row or column in a matrix you cannot use the $ notation but you can use the [row, col] method e.g.
This displays the second row only (the temperatures). To see the 2nd column only, you type:
Interestingly it does not display as you might expect (although it is the 2nd column). You can replace a single number in the square brackets for an expression. So if for example you wanted to see the 2nd, 3rd and 4th columns you could type:
> t(vostok)[, c(1, 2, 6, 7)]
If you create a data frame in our spreadsheet and save the result as a CSV file for reading into R you get a selection of numeric and factor variables. However, you may wish to have R regard some of the variables as text (i.e. character variables). To do this you append a separate command to the read.csv function.
In the example above you only had 2 columns, the file was read into R using a basic command:
> vostok = read.csv(file.choose())
Since the CSV file already contained the column headings no other parameters were required. However, if you wish to alter the 1st column (month) from a factor to a character you need to use the as.is=# parameter like so:
> vostok = read.csv(file.choose(), as.is=1)
Now the 1st column of data will be read as character rather than as a factor. If you wish to include several columns you can use syntax similar to above e.g. x:y or c(x, y, z)
A data frame consists of a regtangular matrix consisting of a number of columns, each containing a series of data as numbers or text. If one column is shorter than the others it will be padded out with NA values. These are ignored by most stats tests but may be included in routines to calculate the mean or median for example. In most cases you may ignore the NA values by including the parameter na.rm= TRUE (see the section on basic stats).
The data fram you are working with may contain several columns, each containing a sample of numeric data. Here is a sample data file (called sugars). Each column shows the growth of an insect fed on a particular diet. These data were used in the demonstration of one-way ANOVA:
You can see the data are in 6 columns, each representing a sample. These are the sort of data that would likely be analysed using ANOVA. However, the aov() routine in R requires the data to be organized in a slightly different manner. What is required are two columns only, one for the growth data (i.e. the numbers) and one for the factors (i.e. the types of treatment, the sugars). Ideally you would have entered the data into your spreadsheet in the appropriate manner right at the start but, if for some reason this was not done then all is not lost.
R provides a routine to take the individual columns and stack them together to form a new data frame in the correct fashion for our ANOVA. The command is stack(data.frame) and if we perform this on our sugar data we see something like the following:
The function creates two columns, the numbers are placed in a column entitled values whilst the factors are entitled ind. You can now perform our analysis on the stacked data, either by assigning it to a new variable name (easiest option) or replacing the variables in the aov() expression with the stack() variables e.g.
> aov(stack(sugars)$values ~ stack(sugars)$ind)
It is possible that you may want to extract only some of the columns from a data frame. The stack() command allows you to select which columns to make into the new stacked variable.
In general terms the command is:
stack(data, select= c(var1, var2))
Notice how the list of variables we wish to extract is in the c(item1, item2) format that you have come across before (see also the examples in the section on scatter plots). For the example above, if you wished to extract only "pure" sugars we might use the following command:
> sugar.st = stack(sugars, select= c(C, F, G, S))
The new data frams now contains two columns entitled values and ind as before but you have missed out the samples for F.G and test.
It is possible to give more meaningful names to the two columns of your new stacked data frame. To do this you use the names() command. In this instance you would type:
> names(carb) = c("growth", "sugar")
The opposite of stacking is unstacking! Using the example above, you have your stacked sugar/growth data and wish to extract the various samples into individual variables. You use the unstack(data.frame) command so:
Now you have a list of six vectors, one for each sample (e.g. sugar). To see a single sample you use the $ notation:
This can be useful to extract a single sample for some other analysis.
|Data Analysis Home||Back to Introduction | R Tips & Tricks | MonogRaphs | Forward to Basic Statistics|