Dr. Mark Gardener |
||||||||||||||||||||||||||||||||||
Data Analysis | Publications | Courses | About | |||||||||||||||||||||||||||||||
On this page... Introduction to graphing |
Using R for statistical analyses - Graphs 1This page is intended to be a help in getting to grips with the powerful statistical program called R. It is not intended as a course in statistics (see here for details about those). If you have an analysis to perform I hope that you will be able to find the commands you need here and copy/paste them into R to get going. On this page you can find out information on producing a range of graphs to illustrate your analyses. Specifically you'll find information on bar charts, histograms and box-whisker plots. For information on scatter plots, pie charts and stem and leaf plots you need to go to the graph2 page. What is R? | Topic Navigation Index| R Tips, Tricks and Hints | MonogRaphs | Go to 1st Topic I run courses in using R; these may be held at various locations:
If you are interested then see our Courses page or contact us for details. My publications about R and Data Science |
|||||||||||||||||||||||||||||||||
See my books about R and Data Science on my Publications page | ||||||||||||||||||||||||||||||||||
I have more projects in hand - visit my Publications page from time to time. You might also like my random essays on selected R topics in MonogRaphs. See also my Writer's Bloc page, details about my latest writing project including R scripts developed for the book. |
||||||||||||||||||||||||||||||||||
R is Open Source |
What is R?R is an open-source (GPL) statistical environment modeled after S and S-Plus. The S language was developed in the late 1980s at AT&T labs. The R project was started by Robert Gentleman and Ross Ihaka (hence the name, R) of the Statistics Department of the University of Auckland in 1995. It has quickly gained a widespread audience. It is currently maintained by the R core-development team, a hard-working, international team of volunteer developers. The R project web page is the main site for information on R. At this site are directions for obtaining the software, accompanying packages and other sources of documentation. R is a powerful statistical program but it is first and foremost a programming language. Many routines have been written for R by people all over the world and made freely available from the R project website as "packages". However, the basic installation (for Linux, Windows or Mac) contains a powerful set of tools for most purposes. Because R is a programming language it can seem a bit daunting; you have to type in commands to get it to work. However, it does have a Graphical User Interface (GUI) to make things easier. You can also copy and paste text from other applications into it (e.g. word processors). So, if you have a library of these commands it is easy to pop in the ones you need for the task at hand. That is the purpose of this web page; to provide a library of basic commands that the user can copy and paste into R to perform a variety of statistical analyses. |
|||||||||||||||||||||||||||||||||
Navigation index |
||||||||||||||||||||||||||||||||||
Introduction to GraphingR has great graphical power but it is not a point and click interface. This means that you must use typed commands to get it to produce the graphs you desire. This can be a bit tedious at first but once you have the hang of it you can save a list of useful commands as text that you can copy and paste into the R command line. |
||||||||||||||||||||||||||||||||||
The barplot() command makes bar charts; you can make bars vertical or horizontal |
Bar chartsThe bar chart is familiar to everyone and is a useful graphical tool that may be used in a variety of ways. The basic function is: barplot(data) Before you can draw a graph you need to get your data into an appropriate format. R has many ways of manipulating data but it is often easiest to assemble and manipulate your data in a spreadsheet (you can save in .CSV format). The first stage is to arrange your data in a .CSV file. You may have your data arranged in columns or in rows. You may also have both row and column names. Don't forget that variable names in R can contain letters and numbers but the only punctuation allowed is a period. The second stage is to read your data file into memory and give it a sensible name. When using barplots you may have both row and column names so don't forget to tell R that you are using row names if you are. Simple multi-category chartYour data may consist of a simple row of means e.g. here are some data on road deaths in Virginia. These data come with the basic distribution of R and are called VADeaths. The means have been extracted below and assigned to the variable VADmeans.
You can see that there are four categories. To create a basic bar chart you simply call the barplot() function: > barplot(VADmeans, main="Road Deaths in Virginia",xlab="Categories", ylab="Mean Deaths") This produces a very basic plot; I have added a main title and labels for the x and y axes using fairly simple commands. When plotting a graph R opens a graphics window. If you select the window (by clicking on or in it) you may then copy to the clipboard and paste into a variety of applications. |
|||||||||||||||||||||||||||||||||
Multiple category barplots can be displayed as stacked or grouped (beside = TRUE) Titles can be added to graphs using the title() command |
Stacked charts or not?The VADeaths dataset consists of a matrix of values with both column and row labels:
If you attempt to produce a bar chart of these data you get something like the following: > barplot(VADeaths, legend= rownames(VADeaths)) This time a legend was added using the legend command along with the rownames of the dataset. You see that by default a stacked bar chart is produced. To unstack the bars and plot them alongside one another you add an instruction to the command: > barplot(VADeaths, legend= rownames(VADeaths), beside= TRUE) This is fine but the colour scheme is kind of boring. Here is a new set of commands: > barplot(VADeaths, beside = TRUE, col = c("lightblue", "mistyrose", "lightcyan","lavender", "cornsilk"), legend = rownames(VADeaths), ylim = c(0, 100)) > title(main = "Death Rates in Virginia", font.main = 4) This is a bit better. You have specified a list of colours to use for the bars. Note how the list is in the form c(item1, item2, item3, item4). The command ylim sets the limits of the y-axis. In this case a lower limit of 0 and an upper of 100. The command is in the form ylim= c(lower, upper) and note again the use of the c(item1, item2) format. The legend takes the names from the row names of the datafile. You set the y-axis limit to accommodate the legend box. It is possible to specify the title of the graph as a separate command, which is what was done above. The command title() achieves this but of course it only works when a graphics window is already open. The command font.main sets the typeface, 4 produces bold italic font. |
|||||||||||||||||||||||||||||||||
The table() command tabulates frequencies of your data in various categories |
Frequency plotsSometimes you will have a single column of data that you wish to summarize. A common use of a bar chart is to produce a frequency plot showing the number of items in various ranges. Here is a vector of numbers: 75 67 70 75 65 71 67 67 76 68 These have been assigned to a variable called carb and we wish to make a frequency plot. Let's try: > barplot(carb) Oops. That wasn't really what you wanted at all. What's happened is that each item has been plotted as a separate entity. You need to tabulate the frequencies. Fortunately there is an easy way to do this. You use the table() function. Let's redraw the graph but using the following: This is much better. Now you have the frequencies for the data arranged in several categories (sometimes called bins). As with other graphs you can add titles to axes and to the main graph. You can look at the table() function directly to see what it produces. > table(carb)
You can see that the function has summarised the data for us into various numerical categories. |
|||||||||||||||||||||||||||||||||
you may wish to show the frequencies as a proportion of the total rather than as raw data. To do this you simply divide each item by the total number of items in your dataset: > barplot(table(carb)/length(carb)) This shows exactly the same pattern but now the total of all the bars add up to one. |
||||||||||||||||||||||||||||||||||
The horiz = TRUE instruction makes bars horizontal in the barplot() command |
Horizontal bar plotsIt is straightforward to rotate your plot so that the bars run horizontal rather than vertical (which is the default). To produce a horizontal plot you add horiz= TRUE to the command e.g. > barplot(table(carb),
horiz=TRUE, col="lightgreen", xlab="Frequency", ylab="Range") This time I have used the title() command to add the main title separately. The value of 4 sets the font to bold italic (try other values). |
|||||||||||||||||||||||||||||||||
The hist() command makes histograms Show probabilities rather than frequency by using the probability = TRUE instruction The breaks instruction allows you to control the breakpoints |
HistogramsThe barplot function can be used to create a frequency plot of sorts but it does not produce a continuous distribution along the x-axis. A true frequency distribution should have the bar categories (i.e. the x-axis) as continuous items. The frequency plot produced previously has discontinuous categories. To create a frequency distribution chart you need a histogram, which has a continuous range along the x-axis. The command in R is: hist(variable) Here is a vector of numbers saved as the variable test.data: 2.1 2.6 2.7 3.2 4.1 4.3 5.2 5.1 4.8 1.8 1.4 2.5 2.7 3.1 2.6 2.8 To create a histogram you type: > hist(test.data) To plot the probabilities (i.e. proportions) rather than the actual frequency you need to add the instruction, probability = TRUE like so: > hist(test.data, probability = TRUE) This is useful but the plots are a bit basic and boring. You can change axis labels and the main title using the same commands as for the barplot() command. Here is a new plot with a few enhancements: > hist(test.data, col="cornsilk", xlab="Data range", ylab="Frequency of data", main="Histogram", font.main=4) These commands are largely self-explanatory. The 4 in the font.main command sets the font to italic (try some other values). By default R works out where to insert the breaks between the bars. You can change the number of breaks by adding a simple command e.g. > hist(data.set, breaks=10) # 10 breaks, or just hist(data.set, 10) The # tells R that what follows is a comment, useful for creating your own library of commands. Alternatively you can be more specific and set the breaks exactly: > hist(data.set,breaks=c(0,1,2,3,4,5,10,20,max(data.set))) # specify break points exactly Notice how the exact break points are specified in the c(x1, x2, x3) format. You can manipulate the axes by changing the limits e.g. make the x-axis start at zero and run to 6 by another simple command e.g.: > hist(test.data, 10, xlim=c(0,6), ylim=c(0,10)) This sets 10 break-points and sets the y-axis from 0-10 and the x-axis from 0-6. Notice how the commands are in the format c(lower, upper). The xlim and ylim commands are useful if you wish to prepare several histograms and want them all to have the same scale for comparison. |
|||||||||||||||||||||||||||||||||
The boxplot() command makes box-whisker plots Control how outliers are evaluated/displayed using the range instruction |
Box and whisker plotsSingle sample plotA box and whisker graph allows you to convey a lot of information on one simple plot. Generally they are used for data that are not normally distributed (i.e. that are non-parametric). You can plot a single sample or create a more complex plot of categories within a data set. The basic function is boxplot() Here is a vector of numbers saved as the variable test.data: 2.1 2.6 2.7 3.2 4.1 4.3 5.2 5.1 4.8 1.8 1.4 2.5 2.7 3.1 2.6 2.8 To create a box-whisker plot you type: > boxplot(test.data) Not the most exciting graph ever but you can jazz it up later. What you see is a box with a line through it. The line represents the median of the sample. The box itself shows the upper and lower quartiles. The whiskers show the range (i.e. the largest and smallest values). It is easy to see that this sample has a skewed distribution and is certainly non-parametric. You can add axis labels, a main title and colour the box using simple instructiona. These instructions are the same as for those used in producing barplots and histograms. For example: > boxplot(test.data, xlab="Single sample", ylab="Value axis", main="Simple Box plot", col="lightblue") Let's make the data even more skewed and add an outlier: 2.1 2.6 2.7 3.2 4.1 4.3 5.2 5.1 4.8 1.8 1.4 2.5 2.7 3.1 2.6 2.8 12.0 Now redraw the graph. This time the main title will be added using a separate command: > boxplot(test.data,
xlab="Single sample", ylab="Value axis", col="lightblue")
Now you see the outlier separately. R doesn't automatically show the full range of data (as I implied earlier). You can control the range shown using a simple instruction range= n. If you set n to 0 then the full range is shown. Otherwise the whiskers extend to n x the inter-quartile range. The default is set to n = 1.5. > boxplot(test.data2,
xlab="Single sample",
ylab="Value axis", col="lightblue", range=0) |
|||||||||||||||||||||||||||||||||
The formula syntax (y ~ x) allows you to plot data using grouping variables |
Plotting several samplesSo you can see how to represent a single sample but often you wish to compare samples.For example, you may have raised broods of flies on various sugars. You measure the size of the individual flies and record the diet for each. Your data file would consist of two columns; one for growth and one for sugar. e.g.
These data are the same as you saw in the example on analysis of variance. Here is shown only part of the larger data set. You have one variable, growth, and several samples (i.e. the different sugars). To plot these you use the boxplot() command with slightly different syntax e.g. boxplot(y ~ x). This model syntax is used widely in R for setting-up ANOVA and regression analyses for example. To create a summary boxplot you type something like: > boxplot(growth
~ sugar, data=fly, xlab="Sugar type", ylab="Growth",
col="bisque", range=0) Now you can see that the different sugar treatments appear to produce differing growth in your subjects. |
|||||||||||||||||||||||||||||||||
Add horizontal = TRUE to the boxplot() command to make horizontal charts In horizontal charts the x-axis is still the "bottom" axis and the y-axis is still the "left" axis |
Horizontal box plotsIt is straightforward to rotate your plot so that the bars run horizontal rather than vertical (which is the default). To produce a horizontal plot you add horizontal= TRUE to the command e.g. > boxplot(growth
~ sugar, data=fly, ylab="Sugar type", xlab="Growth",
col="mistyrose", range=0, horizontal=TRUE) Once again I have used the title command separately to add a main title. The 4 in the font.main command sets bold itailic (try other values). The ylab and xlab instructions refer to the left and bottom axes respectively so it is important to switch these around; it is easy to forget. |
|||||||||||||||||||||||||||||||||
Data Analysis Home | Back to Analysis of Variance | R Tips & Tricks | MonogRaphs | Forward to Graphs 2 (Scatter, Stem, Pie) | |||||||||||||||||||||||||||||||||