Dr. Mark Gardener

Data Analysis Publications Courses About

On this page...

Making Data
Combine command
Types of Data
Entering data with scan()
Multiple variables
More types of data
Variables within data
Transposing data
Making text columns
Missing values
Stacking data
Selecting columns
Naming columns
Unstacking data

Home > Data Analysis > Using R - More about Data

Using R for statistical analyses - More about data

This page is intended to be a help in getting to grips with the powerful statistical program called R. It is not intended as a course in statistics (see here for details about those). If you have an analysis to perform I hope that you will be able to find the commands you need here and copy/paste them into R to get going.

On this page learn how to create and manipulate data without using a spreadsheet. Learn more about reading data files.

What is R? | Topic Navigation Index| R Tips, Tricks and Hints | MonogRaphs | Go to 1st Topic


I run courses in using R; these may be held at various locations:

If you are interested then see our Courses page or contact us for details.


My publications about R and Data Science

See my books about R and Data Science on my Publications page
 
 

I have more projects in hand - visit my Publications page from time to time. You might also like my random essays on selected R topics in MonogRaphs. See also my Writer's Bloc page, details about my latest writing project including R scripts developed for the book.


Skip directly to the 1st topic

R is Open Source

R is Free

Get R at the R Project Page

What is R?

R is an open-source (GPL) statistical environment modeled after S and S-Plus. The S language was developed in the late 1980s at AT&T labs. The R project was started by Robert Gentleman and Ross Ihaka (hence the name, R) of the Statistics Department of the University of Auckland in 1995. It has quickly gained a widespread audience. It is currently maintained by the R core-development team, a hard-working, international team of volunteer developers. The R project web page is the main site for information on R. At this site are directions for obtaining the software, accompanying packages and other sources of documentation.

R is a powerful statistical program but it is first and foremost a programming language. Many routines have been written for R by people all over the world and made freely available from the R project website as "packages". However, the basic installation (for Linux, Windows or Mac) contains a powerful set of tools for most purposes.

Because R is a programming language it can seem a bit daunting; you have to type in commands to get it to work. However, it does have a Graphical User Interface (GUI) to make things easier. You can also copy and paste text from other applications into it (e.g. word processors). So, if you have a library of these commands it is easy to pop in the ones you need for the task at hand. That is the purpose of this web page; to provide a library of basic commands that the user can copy and paste into R to perform a variety of statistical analyses.


Top

Navigation index

Introduction

Getting started with R:

Top
What is R?
Introduction
Data files
Inputting data
Seeing your data in R
What data are loaded?
Removing data sets
Help and Documentation


Data2

More about manipulating data and entering data without using a spreadsheet:

Making Data
Combine command
Types of Data
Entering data with scan()
Multiple variables
More types of data
Variables within data
Transposing data
Making text columns
Missing values
Stacking data
Selecting columns
Naming columns
Unstacking data


Help and Documentation

A short section on how to find more help with R

 

Basic Statistics

Some statistical tests:

Basic stats
Mean
Variance
Quantile
Length

T-test
Variance unequal
Variance Equal
Paired t-test
T-test Step by Step

U-test
Two sample test
Paired test
U-test Step by Step

Paired tests
T-test: see T-test
Wilcoxon: see U-test

Chi Squared
Yates Correction for 2x2 matrix
Chi-Squared Step by Step

Goodness of Fit test
Goodness of Fit Step by Step


Non-Parametric stats

Stats on multiple samples when you have non-parametric data.

Kruskal Wallis test
Kruskal-Wallis Stacked
Kruskal Post-Hoc test
Studentized Range Q
Selecting sub-sets
Friedman test
Friedman post-hoc
Rank data ANOVA

 

Correlation

Getting started with correlation and a basic graph:

Correlation
Correlation and Significance tests
Graphing the Correlation
Correlation step by step


Regression

Multiple regression analysis:

Multiple Regression
Linear regression models
Regression coefficients
Beta coefficients
R squared
Graphing the regression
Regression step by step


ANOVA

Analysis of variance:

ANOVA analysis of variance
One-Way ANOVA
Simple Post-hoc test
ANOVA Models
ANOVA Step by Step

 

Graphs

Getting started with graphs, some basic types:

Introduction
Bar charts
Multi-category
Stacked bars
Frequency plots
Horizontal bars

Histograms

Box-whisker plots
Single sample
Multi-sample
Horizontal plot


Graphs2

More graphical methods:

Scatter plot

Stem-Leaf plots

Pie charts


Graphs3

More advanced graphical methods:

Line Plots
Plot types
Time series
Custom axes


Top

Navigation Index

 

read.csv() is the most useful command for entering large and complex data sets into R.

Creating data

With larger data sets the most useful method of creating and storing your information remains the use of a spreadsheet. R can read spreadsheet files in .XLS format but it is probably better to use .CSV. This format is readily opened by text editors and can be easily modified. Your original data set can be kept in native spreadsheet format and you can use 'save as' to create a .CSV file for the analysis you want to run. To remind yourself about creating and reading CSV files see the introduction page.

The most useful function to read data into R is the read.csv() command. Here is a recap:

variable = read.csv(file.choose(), header=TRUE, row.names=#)

file.choose() opens an explorer=type window allowing you to select your file.
header=TRUE reads the 1st row as a list of column names (you can set this to FALSE).
row.names=# this command tells R which column contains row names (if any).

This is not the only way to get data into R as we shall find out now.


Top

Navigation Index

 

The c() command is used extensively in R, especially as a parameter within other finctions. It is also a quick way to enter small amounts of data.

Combine values command

If you wish to enter a small vector of data it may not be worthwhile creating a spreadsheet and saving it as a CSV file and then reading it into R. It would be much easier to type the data in directly. There are several ways to do this. The first one is using the c() command (c is short for combine). An example will demonstrate it's use:

> data1 = c(2, 4, 5, 2, 3, 7, 8, 4)

Here you have created a variable called data1 and assigned the values in the brackets to it. You may now use the variable you created like any other.

You can use the c() command to append data to an existing vector e.g.

> data1 = c(data1, 12, 14, 11, 9)

Now we have added 4 values to our existing variable. This command is used as part of other functions in R. For example in graphing it is possible to set the limits of the x and y axes, this command is called from within the plot() function like so:

plot(data, xlim= c(lower, upper), ylim= c(lower, upper), ...other commands)

See the section on scatter plots for more information on this command.


Top

Navigation Index

 

Numeric values can be entered 'as is' but text values must be in "quotes" when using the c() command.

Types of data

The values you entered using your c() command were obviously numeric. You can enter text values merely by enclosing them in (double) quotes so:

dates = c("Jan", "Feb", "Mar", "Apr", "May")

You now have a variable called dates which contains five text values.

What if you were to type in the months without quotes? Try and see:

> month = (Jan, Feb, Mar, Apr, May)
Error: syntax error

Oh dear. So, it appears that you either have to have numbers or text values in quotes. It is possible to get one other data type but you will see that later on.


Top

Navigation Index

 

scan() is a useful command for adding larger amounts of data. The basic command accepts numeric values only. To read in text values we must use scan(what="char")

Typing in values using scan()

Typing in values using the c() command is fine but when you have substantial sample size you don't necessarily want to type all the commas! R provides another way of entering data using the scan() command. In basic form the scan command works like this:

> more.data = scan()

1:

The 1: indicates that R is waiting for you to type in the first element of your data. What you need to do now is to type in some values; this time you separate them with spaces and don't bother with the commas. You can press the enter key to spread over several lines. Data entry will stop when you enter a blank line e.g.

1: 2 5 6.2 33 25 1.3 8
8: 111
9:
Read 8 items
>

To see what you entered type the name of the variable e.g.

> more.data
[1] 2.0 5.0 6.2 33.0 25.0 1.3 8.0 111.0
>

You can see that R has appended decimals to our data so that the precision matches for all items in the vector.

If you try the same thing but with text labels what happens?

> more.months = scan()
1: jan feb mar apr
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected 'a real', got 'jan'
>

It looks like you might need to enter the values in quotes again. It is a real pain to enter lots of quotes so let's find a way around that. Try this:

> more.months = scan(what="character")
1: jan feb mar apr may jun
7: jul
8:
Read 7 items
> more.months
[1] "jan" "feb" "mar" "apr" "may" "jun" "jul"
>

That's better; now you don't have to type quotes around each item, you merely type what="character" to tell the function to expect text values. In fact you cannot read text values into the scan() command in any other way. In addition you cannot mix text and numeric values.


Top

Navigation Index

Multiple variables

When you only have 1-2 variables to input and these are of moderate length, it may be worthwhile entering them using scan() or c() commands. However, when you have more data it is usually better to enter the data into a spreadsheet first and then save as a CSV file for input to R. This subject was introduced earlier (see data files) but here you'll see a bit more detail.


Types of data (again)

So far you have looked at two types of data item, numeric and text. Let's get a data file to illustrate:

> twoway = read.csv(file.choose())
> twoway

 
height
plant
water
1
9
vulgaris
lo
2
11
vulgaris
lo
3
6
vulgaris
lo
4
14
vulgaris
mid
5
17
vulgaris
mid
6
19
vulgaris
mid
7
28
vulgaris
hi
8
31
vulgaris
hi
9
32
vulgaris
hi
10
7
sativa
lo
11
6
sativa
lo
12
5
sativa
lo
13
14
sativa
mid
14
17
sativa
mid
15
15
sativa
mid
16
44
sativa
hi
17
38
sativa
hi
18
37
sativa
hi

You have three variables, height, plant and water. This is the sort of thing you would expect to form the basis for a two-way analysis of variance. In order for R to read the variables from this data file you would attach() the main variable e.g. attach(twoway). However, it is possible to read the variables without doing this.


Top

Navigation Index

 

To access variables from within larger data sets we can use one of several methods:

attach(data.frame) aloows the variables to be accessed by typing the name.

data.frame$variable reads a variable directly.

data.frame[row, col] allows you to access a specific row, column or element.

Variables inside data sets

To see the height variable you type the following:

> twoway$height

[1] 9 11 6 14 17 19 28 31 32 7 6 5 14 17 15 44 38 37

You see the vector of numbers, it's obviously a numeric variable. Notice how you type the name of the original variable then append a dollar sign and the name of the variable within it that you wish to see.

If you look at the water variable next:

> twoway$water

[1] lo lo lo mid mid mid hi hi hi lo lo lo mid mid mid hi
[17] hi hi
Levels: hi lo mid

This is something new; the variable doesn't appear to be text (the items are not enclosed in quotes). The first couple of lines show you the data items in the order they are in the table and then you see a line starting with "Levels:" This line shows you that there are three 'things' in the water variable, lo, mid and hi. This type of variable is a factor (as opposed to character or numeric). R assumes that all text values in your CSV file are either headings or are factors unless you specifically tell it otherwize. You will see this later.

A single variable is termed a vector. When you create a larger data file (e.g. as a CSV file) the resulting variable (e.g. twoway above) is called a data frame. You can display the individual variables from the data frame by using the $ symbol as you have just seen. However, there is another way. The data frame is composed of rows and columns; you can pull-out individual items using the following syntax:

data.frame[row, col]

So, to see the height variable you type:

> twoway[,1]

[1] 9 11 6 14 17 19 28 31 32 7 6 5 14 17 15 44 38 37

Since you left the row blank all rows are displayed.

If you wish to see the water variable you type:

> twoway[,3]

[1] lo lo lo mid mid mid hi hi hi lo lo lo mid mid mid hi
[17] hi hi
Levels: hi lo mid

You can display a single row of course:

> twoway[4,]

 
height
plant
water
4
14
vulgaris
mid

Top

Navigation Index

 

The transpose command t() is a fast way to re-arrange a data frame by switching rows and columns.

Transposing data frames

Once you create and enter a CSV file of data you create a data frame. Here is a simple example showing monthly mean temperatures for an Antarctic research station:

> vostok

 
month
temp
1
Jan
-32.0
2
Feb
-47.3
3
Mar
-57.2
4
Apr
-62.9
5
May
-61.0
6
Jun
-70.6
7
Jul
-65.5
8
Aug
-68.2
9
Sep
-63.2
10
Oct
-58.0
11
Nov
-42.0
12
Dec
-30.4

Apart from the fact that it is decidedly chilly you can see that you have two variables, month and temp arranged in two columns. If you wished to create a bar chart of these data it may be more useful to have the data arranged in 12 columns, one for each month, rather than the two. You can switch around a data frame using the transpose command t(). To do that you merely type t(dataname) e.g.

> t(vostok)

  1 2 3 4 5 6 7 8 9 10 11 12
month "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
temp "-32.0" "-47.3" "-57.2" "-62.9" "-61.0" "-70.6" "-65.5" "-68.2" "-63.2" "-58.0" "-42.0" "-30.4"

The data frame has now been switched around. Also you can see that all the data are enclosed in quotes as if they were text. What has happened is that R has taken the data from the data.frame and made it into a matrix. This is a separate type of data item that I won't cover here.

The t() function is useful for producing barplots that may contain both row and column headings as it allows you to display (and therefore graph) the data sorted by row or column.

To see an individual row or column in a matrix you cannot use the $ notation but you can use the [row, col] method e.g.

> t(vostok)[2,]

  1 2 3 4 5 6 7 8 9 10 11 12
temp "-32.0" "-47.3" "-57.2" "-62.9" "-61.0" "-70.6" "-65.5" "-68.2" "-63.2" "-58.0" "-42.0" "-30.4"

This displays the second row only (the temperatures). To see the 2nd column only, you type:

> t(vostok)[,2]

month temp
"Feb" "-47.3"

Interestingly it does not display as you might expect (although it is the 2nd column). You can replace a single number in the square brackets for an expression. So if for example you wanted to see the 2nd, 3rd and 4th columns you could type:

> t(vostok)[,2:4]

  2 3 4
month "Feb" "Mar" "Apr"
temp "-47.3" "-57.2" "-62.9"

The expression now reads, columns 2 to 4. For a more complex arrangement you can use the c() function that you have come across before (see creating data above and the section on scatter plots) e.g.

> t(vostok)[, c(1, 2, 6, 7)]

  1 2 6 7
month "Jan" "Feb" "Jun" "Jul"
temp "-32.0" "-47.3" "-70.6" "-65.5"

Top

Navigation Index

Making text columns in data frames

If you create a data frame in our spreadsheet and save the result as a CSV file for reading into R you get a selection of numeric and factor variables. However, you may wish to have R regard some of the variables as text (i.e. character variables). To do this you append a separate command to the read.csv function.

In the example above you only had 2 columns, the file was read into R using a basic command:

> vostok = read.csv(file.choose())

Since the CSV file already contained the column headings no other parameters were required. However, if you wish to alter the 1st column (month) from a factor to a character you need to use the as.is=# parameter like so:

> vostok = read.csv(file.choose(), as.is=1)

Now the 1st column of data will be read as character rather than as a factor. If you wish to include several columns you can use syntax similar to above e.g. x:y or c(x, y, z)


Top

Navigation Index

Missing values

A data frame consists of a regtangular matrix consisting of a number of columns, each containing a series of data as numbers or text. If one column is shorter than the others it will be padded out with NA values. These are ignored by most stats tests but may be included in routines to calculate the mean or median for example. In most cases you may ignore the NA values by including the parameter na.rm= TRUE (see the section on basic stats).


Top

Navigation Index

Stacking data

The data fram you are working with may contain several columns, each containing a sample of numeric data. Here is a sample data file (called sugars). Each column shows the growth of an insect fed on a particular diet. These data were used in the demonstration of one-way ANOVA:

> sugars

 
C
G
F
F.G
S
test
1
75
57
58
58
62
63
2
67
58
61
59
66
64
3
70
60
56
58
65
66
4
75
59
58
61
63
65
5
65
62
57
57
64
67
6
71
60
56
56
62
68
7
67
60
61
58
65
64
8
67
57
60
57
65
NA
9
76
59
57
57
62
NA
10
68
61
58
59
67
NA

You can see the data are in 6 columns, each representing a sample. These are the sort of data that would likely be analysed using ANOVA. However, the aov() routine in R requires the data to be organized in a slightly different manner. What is required are two columns only, one for the growth data (i.e. the numbers) and one for the factors (i.e. the types of treatment, the sugars). Ideally you would have entered the data into your spreadsheet in the appropriate manner right at the start but, if for some reason this was not done then all is not lost.

R provides a routine to take the individual columns and stack them together to form a new data frame in the correct fashion for our ANOVA. The command is stack(data.frame) and if we perform this on our sugar data we see something like the following:

> stack(sugars)

 
values
ind
1
75
C
2
67
C
3
70
C
4
75
C
5
65
C
6
71
C
7
67
C
8
67
C
9
76
C
10
68
C
11
57
G
...

The function creates two columns, the numbers are placed in a column entitled values whilst the factors are entitled ind. You can now perform our analysis on the stacked data, either by assigning it to a new variable name (easiest option) or replacing the variables in the aov() expression with the stack() variables e.g.

> carbs = stack(sugars)
> aov(values ~ ind, data= carbs)

or...

> aov(stack(sugars)$values ~ stack(sugars)$ind)


Selecting columns

It is possible that you may want to extract only some of the columns from a data frame. The stack() command allows you to select which columns to make into the new stacked variable.

In general terms the command is:

stack(data, select= c(var1, var2))

Notice how the list of variables we wish to extract is in the c(item1, item2) format that you have come across before (see also the examples in the section on scatter plots). For the example above, if you wished to extract only "pure" sugars we might use the following command:

> sugar.st = stack(sugars, select= c(C, F, G, S))

The new data frams now contains two columns entitled values and ind as before but you have missed out the samples for F.G and test.


Naming the stacked columns

It is possible to give more meaningful names to the two columns of your new stacked data frame. To do this you use the names() command. In this instance you would type:

> names(carb) = c("growth", "sugar")

You will notice how the names are assigned using the c() function that you came across earlier (see also the examples in the section on scatter plots).


Top

Navigation Index

Unstack

The opposite of stacking is unstacking! Using the example above, you have your stacked sugar/growth data and wish to extract the various samples into individual variables. You use the unstack(data.frame) command so:

> unstack(carbs)

$C
[1] 75 67 70 75 65 71 67 67 76 68

$F
[1] 58 61 56 58 57 56 61 60 57 58

$F.G
[1] 58 59 58 61 57 56 58 57 57 59

$G
[1] 57 58 60 59 62 60 60 57 59 61

$S
[1] 62 66 65 63 64 62 65 65 62 67

$test
[1] 63 64 66 65 67 68 64

Now you have a list of six vectors, one for each sample (e.g. sugar). To see a single sample you use the $ notation:

> unstack(carb)$F

$F
[1] 58 61 56 58 57 56 61 60 57 58

This can be useful to extract a single sample for some other analysis.


Data Analysis Home Back to Introduction | R Tips & Tricks | MonogRaphs | Forward to Basic Statistics
Top
Main Hompage