Dr. Mark Gardener 

Data Analysis  Publications  Courses  About  
On this page... Basic stats (e.g. mean, median) 
Using R for statistical analyses  Basic StatisticsThis page is intended to be a help in getting to grips with the powerful statistical program called R. It is not intended as a course in statistics (see here for details about those). If you have an analysis to perform I hope that you will be able to find the commands you need here and copy/paste them into R to get going. On this page learn how to perform simple statistical tests like the ttest, utest, chisquared and goodness of fit tests as well as some basic descriptive statistical functions (e.g mean, variance). What is R?  Topic Navigation Index R Tips, Tricks and Hints  MonogRaphs  Go to 1st Topic I run courses in using R; these may be held at various locations:
If you are interested then see our Courses page or contact us for details. My publications about R and Data Science 

See my books about R and Data Science on my Publications page  
I have more projects in hand  visit my Publications page from time to time. You might also like my random essays on selected R topics in MonogRaphs. See also my Writer's Bloc page, details about my latest writing project including R scripts developed for the book. 

R is Open Source 
What is R?R is an opensource (GPL) statistical environment modeled after S and SPlus. The S language was developed in the late 1980s at AT&T labs. The R project was started by Robert Gentleman and Ross Ihaka (hence the name, R) of the Statistics Department of the University of Auckland in 1995. It has quickly gained a widespread audience. It is currently maintained by the R coredevelopment team, a hardworking, international team of volunteer developers. The R project web page is the main site for information on R. At this site are directions for obtaining the software, accompanying packages and other sources of documentation. R is a powerful statistical program but it is first and foremost a programming language. Many routines have been written for R by people all over the world and made freely available from the R project website as "packages". However, the basic installation (for Linux, Windows or Mac) contains a powerful set of tools for most purposes. Because R is a programming language it can seem a bit daunting; you have to type in commands to get it to work. However, it does have a Graphical User Interface (GUI) to make things easier. You can also copy and paste text from other applications into it (e.g. word processors). So, if you have a library of these commands it is easy to pop in the ones you need for the task at hand. That is the purpose of this web page; to provide a library of basic commands that the user can copy and paste into R to perform a variety of statistical analyses. 

Navigation index 

By default R includes NA values in variables. These arise from data sets containing variables of unequal length. To ensure that you are only using the real data add na.rm= TRUE to your commands. 
Basic statsR provides a number of functions for basic statistics.
It is possible to use these and other functions in combination as if you were using a calculator. Other functions include sqrt(variable) to determine square root. To generate a power function use the caret character e.g. 2^3 gives 2 to the power of 3 (i.e. 8). If your data set is made up of several columns they may not all be of the same length. By default R pads out the 'missing' cells with NA. If your variable contains NA values then this will affect your calculations. To get around this use na.rm= TRUE in the command e.g. mean(variable, na.rm= TRUE) 

The ttest defaults to the Welch proceedure, which assumes the variances are unequal. 
TtestThe ttest is used to determine statistical differences between two samples. There is also a version that can be used as a paired test i.e. when you have measurements collected as matched pairs. The first stage is to arrange your data in a .CSV file. Use a column for each variable and give it a meaningful name. Don't forget that variable names in R can contain letters and numbers but the only punctuation allowed is a period. The second stage is to read your data file into memory and give it a sensible name. The next stage is to attach your data set so that the individual variables are read into memory. To perform a ttest you type: > t.test(var1, var2) Welch Two Sample ttest data: x1 and x2 > This version of the test does not assume that the variance of the two samples is equal and performs a Welch two sample ttest. The "classic" version of the ttest can be run as follows: > t.test(var1, var2, var.equal=T) Two Sample ttest data: x1 and x2 >


A paired ttest is simple to run; just add paired= TRUE
to the basic command.

Now the variances of the two samples are considered equal and the basic version is performed. To run a ttest on paired data you add a new term: > t.test(var1, var2, paired=T) Paired ttest data: x1 and x2 > 

Ttest Step by Step


The MannWhitney Utest is also known as the Wilcoxon rank sum test. If you have tied ranks R will give you an warning message. 
UtestThe MannWitney Utest is commonly used to test for significant differences between two samples when data are nonparametric. In R the test is perhaps confusingly called the Wilcoxon test and can be applied to two samples or paired data. The first stage is to arrange your data in a .CSV file. Use a column for each variable and give it a meaningful name. Don't forget that variable names in R can contain letters and numbers but the only punctuation allowed is a period. The second stage is to read your data file into memory and give it a sensible name. The next stage is to attach your data set so that the individual variables are read into memory. The basic utest is performed on two samples so: > wilcox.test(var1, var2) Wilcoxon rank sum test with continuity correction data: x1 and x3 Warning message: > 

To run a paired test simply add paired= TRUE to the
basic command.

If you have paired data you can run a matched pair test: > wilcox.test(var1, var2, paired=T) Wilcoxon signed rank test with continuity correction data: x1 and x3 Warning messages: > In the above examples we see that there are several warning messages. We can safely ignore these. Also, the test runs with continuity correction as the default. If you want to turn this off (I cannot see why you would) then add correct=F to the parameters e.g. > wilcox.test(var1, var2, correct=F) 

Utest Step by Step


Paired testsMost of the regular stats routines provide for an option to run as a paired variant. 

Chisquared testsTests for association are easily performed in R. The basc function is chisq.test() The first stage is to arrange your data in a .CSV file. Use row and column names. Don't forget that variable names in R can contain letters and numbers but the only punctuation allowed is a period. The second stage is to read your data file into memory and give it a sensible name. You will need to tell R that the file contains row names so that a data matrix is created. To perform the Chisquared test you type something like the following: > chisq.test(your.data) Pearson's Chisquared test data: your.data > This gives you a basic result but you will want more than that in order to interpret the statistic. The test produces more data than is displayed, to see what you have to work with type: > names(chisq.test(your.data)) [1] "statistic" "parameter" "p.value" "method" "data.name" "observed" > This shows you that there are other data that you can call upon to help you. It is cumbersome to run the test each time to it would be better to assign the chisquared test result to a variable. It's a good habit to get into when using R and means that you can use the results in further calculations. In this instance you might try: > your.chi
= chisq.test(your.data) To see the observed values (i.e. the original data) type: > your.chi$observed
> To see the expected values type: > your.chi$expected
> To see the residuals type: > your.chi$residuals
> The residuals calculated are the Pearson residuals i.e. (observed  expected) / sqrt(expected). You can examine these and easiy pick out which are the most important associations (and the direction). You do not actually need to type the full command to see the components of the chisquared test. After the $ sign you can type a short version and as long as it is unique it will be intepreted e.g. > your.chi$obs R will produce the desired table. If you wish to extract a single value from one of these tables then you can do that by appending an extra part e.g. > your.chi$res["row.name", "col.name"] In other words add a square bracket and type in the row and column headings (in quotes) that define the value you wish. 

Yates Correction is appropriate for 2 x 2 contingency
tables.

Yates CorrectionWhen using a 2 x 2 contingency table it is common practice to reduce the OE differences by 0.5. To do this add correct=T to the original function e.g. > your.chi = chisq.test(your.data, correct=TRUE) If your table is larger then the correction will not be done (the basic test will run instead). 

ChiSquared Step by Step


Goodness of Fit testWe can use the Chisquared distribution to calculate goodness of fit to predetermined distributions. The function is chisq.test(), which is the same as discussed above in the section on ChiSquared tests. If you haven't already done so it is a good idea to look over that first. In this case we will have a list of observations and another list of the expected ratios, propotions or values. The first stage is to arrange your data in a .CSV file. Use row and column names. Don't forget that variable names in R can contain letters and numbers but the only punctuation allowed is a period. One column should contain the observed values and the other should contain thecorresponding ratio, proportions or values. The second stage is to read your data file into memory and give it a sensible name. You will need to tell R that the file contains row names so that a data matrix is created. The next stage is to attach your data set so that the individual variables are read into memory. We now run the analysis using the chisq.test() function e.g: > your.chi = chisq.test(observed.data, p = expected.values, rescale.p = TRUE) In this case observed.data is the column of your measured data and expected.values is the column of ratios (or expected values in some form). The rescale.p=T part tells R to convert the expected values so that they add up to unity. It is a good habit to get into to add this parameter as then it does not matter in what form your expected values come; R will convert to proportions. Here is an example of a goodness of fit analysis: > gfit = read.csv(file.choose(), row.names=1)
> attach(gfit) Chisquared test for given probabilities data: visit > As before we can extract the expected values and the residuals: > gfit.g$exp 

Goodness of Fit test  Step by Step


Data Analysis Home  Back to More about Data  R Tips & Tricks  MonogRaphs  Forward to NonParametric Stats  