Axis labels in R plots using expression() command

Exercise 6.4.2.

Statistics for Ecologists (Edition 2) Exercise 6.4.2

These are some notes about axis labels in R plots, particularly how you can use superscript, bold and so on.

Axis labels in R plots using expression() command

The labelling of your graph axes is an important element in presenting your data and results. You often want to incorporate text formatting to your labelling. Superscript and subscript are particularly important for scientific graphs. You may also need to use bold or italics (the latter especially for species names).

The expression() command allows you to build strings that incorporate these features. You can use the results of expression() in several ways:

  • As axis labels directly from plotting commands.
  • As axis labels added to plots via the title()
  • As marginal text via the mtext()
  • As text in the plot area via the text()

You can use the expression() command directly or save the “result” to a named object that can be used later.

Introduction

The expression() command

The expression() command takes regular characters and uses them in a special way, allowing you to build more complicated strings. You don’t need quotes (most of the time) as the usual letters and numbers are not “interpreted”. R usually takes strings that are un-quoted and tries to interpret them as objects or commands.

What the expression() command does do though, is to look for certain characters or phrases, which are treated as “switches” that do something, like turn on superscript or bold font.

  • ~ Acts as a space character (actual spaces are ignored in R commands).
  • * Acts as a connector, this allows you to join several elements.
  • “” Quotes are used to enclose items that would otherwise be treated as a special character (like ~ or *).

So, type a ~ when you want a space and “~” when you want a ~. The * is a connector, which can be used to join sections of the expression. This allows you to “turn off” superscript for example, or switch font face.

There are various “reserved” characters e.g. + – / * ? ^ (mostly they are not letters or numbers), and these should be inside quotes. Items in quotes should be bracketed by ~ and/or * characters.

When you type an expression() any spaces you type are ignored. You can type spaces to help yourself see clearly what you have typed but they are all stripped out. If you display an expression() result R will place a single space (for clarity) between various elements of your expression().

The following expression():

expression(The~"~"~character~forms~spaces)

Would appear like “The ~ character forms spaces” when used in titles or text.

Superscript & subscript

The most common thing you’ll want to do in axis labels is to make superscripts and subscripts.

  • ^ Anything following the caret is displayed as superscript.
  • [] Anything inside the square brackets is displayed as subscript.

The [] are simple enough to use, anything that you want to be subscripted goes inside the brackets.

Note that you cannot start an expression() with a [ so you have to “fool” the system and use a pair of empty quotes:

expression(""[x]*X)

Note also that if you do that you have to use a connector (*) afterwards (or a space character ~)! The preceding example would produce text like so:

xX

Superscript is “started” by the caret ^ character. Anything after ^ is superscript. The superscript continues until you use a * or ~ character. If you want more text as superscript, then enclose it in quotes. The only exception is + or – when preceded by a number.

In the following example only the word “script” would appear superscript:

expression(Super^script~text)

Superscript text

The following uses quotes to get the two words superscripted.

expression(Super^"script text")

Superscript text

The following commands produce a plot with superscript and subscript labels:

opt = par(cex = 1.5) # Make everything a bit bigger

xl <- expression(Speed ~ ms^-1 ~ by ~ impeller)
yl <- expression(Abundance ~ by ~ Kick ~ net[30 ~ sec] ~ sampling)

plot(abund ~ speed, data = fw, xlab = xl, ylab = yl)

par(opt) # Reset the graphical parameters

The expression() command used to make superscripts and subscripts in axis labels.

Note that R does not “like” subscripts beginning with numbers and continuing with letters! So [2xyz] gives an error but [2 * xyz] is fine.

Font face: bold, italic, underline

You can alter the basic font face by enclosing the items you require in a command-like element:

  • plain() Anything in the parentheses is regular plain font face.
  • italic()italic.
  • bold()Bold
  • bolditalic()Italic and bold.
  • underline()Underlined.

The font face element must be preceded by a ~ or a * so that R can recognize it as a font face element.

The title() command allows you to specify a general font face as part of the command. Similarly the par() command allows you to specify font face for various plot elements:

  • font – the main text font face.
  • lab – axis labels.
  • main – main title.
  • sub – sub-title.

You specify the font face as an integer:

  • 1 = Plain.
  • 2 = Bold.
  • 3 = Italic.
  • 4 = Bold & Italic.

You can set the font face(s) from par() or as part of the plotting command. This is useful for the entire label/title but does not allow for mixed font faces. To mix font faces use the expression() elements italic(), bold() and so on.

The following lines give some simple examples:

opt <- par(cex = 1.5)

em <- expression(Abundance~of~italic(Gammarus~pulex)plain(~a~shrimp))
ey <- expression(Abundance~30s~underline(kick~sample))
ex <- expression(Speed~bold(ms^-1))

plot(abund ~ speed, data = fw, xlab = ex, ylab = ey, main = em)

par(opt)

To get mixed font faces you need the expression() command.

Note that expression() “does not like” a mix of letters and numbers, so split them using the * character.

Maths expressions

The expression() command can also produce a range of mathematical symbols and… expressions! You can create fractions, degree signs, arrows and all manner of items. These are generally less useful in axis labels but here are a few of the expressions to whet your appetite:

x + y Produces x + y.
x – y Produces x – y.
x == y Produces x = y.
x != y Produces x ≠ y.
x %~~% y Produces x ≈ y.
x %+-% y Produces x ± y.
x %/% y Produces x ÷ y.
bar(x) Produces x with an overbar.
frac(x, y) Produces a fraction with x over y.
x %up% y Produces an up arrow, x ↑ y.
x %down%y Produces a down arrow, x ↓ y.
x %->% y Produces a right arrow, x → y.
x %<-% y Produces a left arrow, x ← y.
sum(x, a, b) Produces a sum (capital sigma) symbol, ∑, with optional sub and superscripts.
sqrt(x)
sqrt(x, y)
Produces a square root symbol, √x, with optional root, y√x.
infinity An infinity symbol, ∞.
alpha – omega Greek letters in lowercase.
Alpha – Omega Greek letters in uppercase.
180*degree Produces a degree symbol, 180˚.
x ~ y A space, x y.

 

There are plenty of others, type help(plotmath) into your R console to get the help entry page.

Ways to incorporate expression() into plots

There are four main ways you can incorporate expression() objects into your plots:

  • In the plot area via text()
  • As a title via title()
  • Directly via the plotting command; essentially the same as title()
  • In the margin via mtext()

As far as the expression() part goes there is no difference between these methods; the main difference is the placement.

Add text to a plot

The text() command allows you to add text, and expression() objects to an existing plot window.

text(x, y, ...)

You need to type in the co-ordinates and then the text, quoted or as an expression(). There are other graphical parameters you can add such as:

col
cex

There are a whole lot more besides, but this article is primarily about axis labels so I’ll gloss over text() for the moment, except to demonstrate some mathematical symbols.

Math symbols

The math symbols can be used in axis labels via plotting commands or title() or as plain text in the plot window via text() or in the margin with mtext().

The following commands place some text into a plot window but the expression() parts would work in axis labels, margins or titles.

opt <- par(cex = 1.5)
plot(1:10, 1:10, type = "n", xlab = "X-vals", ylab = "Y-vals")

text(1, 1, expression(hat(x)))
text(2, 1, expression(bar(x)))
text(2, 2, expression(alpha==x))
text(3, 3, expression(beta==y))
text(4, 4, expression(frac(x, y)))
text(5, 5, expression(sum(x)))
text(6, 6, expression(sum(x^2)))
text(7, 7, expression(bar(x) == sum(frac(x[i], n), i==1, n)))
text(8, 8, expression(sqrt(x)))
text(9, 9, expression(sqrt(x, 3)))

par(opt)

The expression() command used with text() to create math formulae.

You can create quite complicated formulae using expression() but it can also be confusing, especially if you are using a plotting command. Create your expression() first and save the result to a named object to help keep yourself organised.

Add axis titles

You can use the title() command to add titles to the main marginal areas of an existing plot. In general, you’ll use xlab and ylab elements to add labels to the x and y axes. However, you can also add a main or sub title too.

Most graphical plotting commands allow you to add titles directly, the title() command is therefore perhaps redundant. However, it is often easier to set your titles “” (i.e. blank) and then use title() afterwards, especially if they are complicated.

If you are using expression() to make a label/title then save the expression() result as a named object, which is easier to use in the subsequent command(s) that use them.

The title() command has an additional “trick” up its sleeve, the line parameter. This allows you to select a position for the title(s) in lines from the edge of the plot.

  • Set line = 0 to place the title beside the axis (where the tick-marks usually are).
  • Set line = 1 to place the title one line in (where the axis values usually are).

The maximum value you can set depends on the margin sizes. In practice you can get the margin value minus one. To see the currently set margin sizes:

par(mar)

You’ll get a vector of four values (bottom, left, top, right).

You can also set the title to appear inside the plot using negative values, line = -1 will be adjacent to the axis and just inside.

Add marginal text

The mtext() command allows you to place text and expression() objects into any of the margins of a plot. The mtext() command allows you a bit more control over the placement of the text, compared to the title() command.

The general form of the command is:

mtext(text, side = 3, line = 0, outer = FALSE, at = NA,
      adj = NA, padj = NA, cex = NA, col = NA, font = NA, ...)

 

text The text to write. This can be a character string or an expression.
side = 3 The side of the plot to use. The sides are 1= bottom, 2= left, 3 = top, 4 = right. The default is the top.
line = 0 The line of the margin to use. The default is 0, which is adjacent to the outside of the plot area. Positive values move outward and negative values inward.
outer = FALSE If outer = TRUE, the outer margin is used if available.
at = NA How far along the side to place the text in relation to the axis scale. Text is centered on this point.
adj = NA How far along the side to place the text as a proportion. The default is effectively 0.5, which places the text halfway along. If text is oriented parallel to the axis, adj = 0 will result in left or bottom placement. Text is centered.
padj = NA Adjusts the text perpendicular to the reading direction. This permits “tweaking” of the placement. Positive values place text lower; negative values higher.
cex = NA The character expansion. Values  1 make text larger; values < 1 make text smaller.
col = NA The color for the text. The default, NA, means use the current setting par(“col”).
font = NA The font to use. The default, NA, means use the current setting par(“font”). Use font = 1 for regular text; 2 = bold, 3 = italic, 4 = bold+italic.
Additional graphics parameters can be used. Of particular interest is las, which controls the text direction:·        las = 0—Text parallel to axis (default).

·        las = 1—Text horizontal.

·        las = 2—Text perpendicular to axis.

·        las = 3—Text vertical.

The following commands will demonstrate some of the parameters.

Make a basic plot

plot(1:10, 1:10, type = "n", xlab = "x-vals", ylab = "y-vals")

 

Add marginal text

mtext("mtext(side = 1, line = -1, adj = 1)", side =1, line =-1, adj =1)
mtext("mtext(side = 1, line = -1, adj = 0)", side=1, line=-1, adj=0)
mtext("mtext(side = 2, line = -1, font = 3)", side=2, line=-1, font=3)
mtext("mtext(side = 3, font = 2)", side=3, font=2)
mtext("mtext(side = 3, line = 1, font = 2)", line=1, side=3, font=2)
mtext("mtext(side = 3, line = 2, font = 2, cex = 1.2)", cex=1.2, line=2, side=3, font=2)
mtext("mtext(side = 3, line = -2, font = 4, cex = 0.8)", cex=0.8, font=4, line=-2)
mtext("mtext(side = 4, line = 0)", side=4, line=0)

Using mtext() to place text or expression() objects into plot margins.

The mtext() command allows for fine placement of marginal text. In the example any font face changes were applied directly and the entire text is altered. If you want to have mixed font face then replace the text in quotes with an expression().

Ordering boxes in an R boxplot()

Exercise 6.3.2.

Statistics for Ecologists (Edition 2) Exercise 6.3.2

These notes concern box-whisker plots and in particular how you can rearrange the order of the boxes in such plots.

Ordering boxes in an R boxplot()

Introduction

The boxplot() command is one of the most useful graphical commands in R. The box-whisker plot is useful because it shows a lot of information concisely. However, the boxes do not always appear in the order you would prefer. These notes show you how you can take control of the ordering of the boxes in a boxplot().

There are four main methods, which in turn depend on the layout of the data:

  • Use order() to select column order when you have separate samples (i.e. vectors, columns in a data.frame or a list).
  • Use [row, column] to select an explicit column order when you have separate samples.
  • Use reorder() to change the order of a factor variable according to a function (e.g. mean), when you have response and predictor variables.
  • Use ordered() to make a custom ordered factor variable when you have response and predictor variables.

There are subtle differences between these methods but essentially you are creating an index, which you can use in the boxplot() command to control the order the boxes appear in the plot.

Data in sample format

If your data are arranged as samples in a data.frame (or matrix) you can use boxplot() to plot the data in “one go”. The order of the boxes will depend on the order of the columns.

hog3
   Upper Mid Lower
1     3   4    11
2     4   3    12
3     5   7     9
4     9   9    10
5     8  11    11
6    10  NA    NA
7     9  NA    NA

boxplot(hog3)

You can specify an explicit order for the columns using column numbers:

boxplot(hog3[, 3:1])

The boxplot on the left uses the default column order. The boxplot on the right uses an explicit order x[, columns].

Note the [row, column] syntax to specify the order for plotting.

Order columns by a function

Rather than give an explicit order you may want to have the boxplot appear in order of some function (e.g. mean or median). You can use the order() command to arrange items in ascending (or descending) order. To proceed use these general steps:

  1. Use a command that gives you the values you require e.g. colMeans(), apply().
  2. Use the result from step 1 and make an order()
  3. Use the result of step 2 to define the order of the columns in the boxplot().

The apply() command is most flexible:

m <- apply(hog3, MARGIN = 2, FUN = median, na.rm = TRUE)
m
Upper   Mid Lower
    8     7    11

Now you can set an order based on the medians you calculated:

o <- order(m, decreasing = FALSE)
o
[1] 2 1 3

Use the x[row, column] syntax like before but use your calculated order:

boxplot(hog3[, o])

If you want decreasing order setdecreasing = TRUE.

Data in a list

If your data are in a list you can use the same principles but need a slightly modified procedure:

hogl = list(U = hog3$Upper, M = hog3$Mid, L = hog3$Lower)
hogl

$U
[1] 3 4 5 9 8 10 9

$M
[1] 4 3 7 9 11 NA NA

$L
[1] 11 12 9 10 11 NA NA

Use the lapply() command to work out the median over the list elements.

m <- lapply(hogl, median, na.rm = TRUE)

If you try to order() the result you get an error, so you must unlist() the result first:

order(unlist(m))
[1] 2 1 3

Now save the new order and use it in the plot.

o <- order(unlist(m))
boxplot(hogl[o])

Note that you don’t use [row, column] for the list, just give [element], as the list is one-dimensional.

Data in scientific recording layout

When your data are in scientific recording format you will have a column for each variable and will have response variables and predictor variables e.g.

hog2
   count  site
1      3 Upper
2      4 Upper
3      5 Upper
4      9 Upper
5      8 Upper
6     10 Upper
7      9 Upper
8      4   Mid
9      3   Mid
10     7   Mid
11     9   Mid
12    11   Mid
13    11 Lower
14    12 Lower
15     9 Lower
16    10 Lower
17    11 Lower

These are the same data as before but in a more “sensible” layout. However, when you try a boxplot() you get the boxes plotted in alphabetical order.

Order a factor using a function

You can use the reorder() command to reorder a predictor variable by a function applied to the response variable. In other words, you can determine the order of the boxes using a median or other function. Use the following general process:

Use reorder(predictor, response, FUN) to determine an order for the predictor variable.

Use the result of reorder() in place of the original predictor variable in the boxplot() command.

bpm <- with(hog2, reorder(site, count, FUN = median))
boxplot(count ~ bpm, data = hog2)

Here the with() command is used to “see inside” the hog2 data. You could use:

attach(hog2)
bpm <- reorder(site, count, FUN = median)
detach(hog2)

The result is ordered ascending. If you want a descending order simply add a minus sign in front of the response variable:

bpm <- with(hog2, reorder(site, -count, FUN = median))
boxplot(count ~ bpm, data = hog2)

The procedure works with multiple predictors but you can only reorder() one at a time.

You can use the reorder() command to reorder a predictor variable by a function applied to the response variable. In other words, you can determine the order of the boxes using a median or other function. Use the following general process:

Use reorder(predictor, response, FUN) to determine an order for the predictor variable.

Use the result of reorder() in place of the original predictor variable in the boxplot() command.

bpm <- with(hog2, reorder(site, count, FUN = median))
boxplot(count ~ bpm, data = hog2)

Here the with() command is used to “see inside” the hog2 data. You could use:

attach(hog2)
bpm <- reorder(site, count, FUN = median)
detach(hog2)

The result is ordered ascending. If you want a descending order simply add a minus sign in front of the response variable:

bpm <- with(hog2, reorder(site, -count, FUN = median))
boxplot(count ~ bpm, data = hog2)

The procedure works with multiple predictors but you can only reorder() one at a time.

Make a factor in an explicit order

You can make a factor variable into an explicit order using the ordered() command. You just give the name of the factor you want to order and then the names of the levels in the order you want.

The result of the ordered() command is an ordered factor. The upshot is that the order you set will take precedent over the default alphabetical order.

o <- ordered(hog2$site, levels = c("Upper", "Lower", "Mid"))
o
[1] Upper Upper Upper Upper Upper Upper Upper Mid   Mid   Mid   Mid   Mid
[13] Lower Lower Lower Lower Lower
Levels: Upper < Lower < Mid

boxplot(count ~ o, data = hog2)

Gridlines in graphs and charts

Exercise 6.3.1b.

Statistics for Ecologists (Edition 2) Exercise 6.3.1b

Here are some notes regarding the use of gridlines in graphs and charts, to supplement Chapter 6.

Gridlines in graphs and charts

Introduction

Gridlines are potentially useful items you might want to incorporate in your charts. Gridlines can help the reader to gauge the height of bars in a column chart more easily for example, and so the readability is improved.

On the other hand, gridlines can “get in the way” and hinder readability by making your chart cluttered. In scatter plots you may require both horizontal and vertical gridlines, having gridlines on one axis only can “lead the eye”. Knowing when to apply gridlines or not is part of the skill of presentation.

Gridlines are added and edited easily in Excel. In R you can add gridlines using the abline() command.

Gridlines in Excel charts

You can easily add gridlines using Excel. Many chart templates incorporate them (sometimes when you do not require) and once you have a chart you can easily add them via the Chart Tools menus. In Excel 2013 there is an Add Chart Element button (you can also use the + button that appears beside a chart you have clicked on).

In previous versions of Excel (e.g. 2010) you can find the Gridlines button on the Layout menu of the Chart Tools.

Editing Excel gridlines

Once you have added your Excel gridlines you can choose to alter their appearance. You can double-click or right-click on the gridlines directly or you can use the Current Selection section of the Chart Tools > Format menu, which allows you to select, then format chart elements.

Once you have chosen to format the gridlines you will be presented with a range of formatting options, allowing you to choose the colour, width and style for example. You may not want the gridlines to be too bold so a mid-gray and dashed line might be more appropriate than a solid black line.

Altering Excel gridline visibility

By default, your chart colours will be solid. This means that on a column chart the gridlines will disappear behind the bars and only be visible between them.

In most cases this is exactly what you want but there may be occasions when you want to see the gridlines through the bars. You can edit the data series and choose to alter the transparency of the bars (on a column chart).

You can easily change the level of transparency to get the effect you want.

Of course generally you won’t want to allow the gridlines to be visible through the bars but it is a handy trick to have up your sleeve.

Gridlines in R plots

Use the abline() command to add gridlines to R plots. The command adds straight lines to existing plots. You can use the command to add a line of best-fit by specifying intercept and slope (indeed the command can read the results of other commands that produce coefficients), but for gridlines you can use one of the following:

  • abline(h = x, …) For horizontal lines.
  • abline(v = x, …) For vertical lines.

You specify x, which is the position of the line(s). You can use various methods to produce a set of values that define the position of the lines:

Command Detail
c(…) Give the values explicitly, separated by commas.
start:end Give the start and end points, this will produce a primitive sequence with an interval of 1.
seq(from = , to = , by = ) Make a sequence, you specify start and end points and the interval.
pretty(start:end, n) Makes a sequence from a simple range (start:end) that is split into “pretty” intervals. You also give n, which is an idealized number of intervals. The command does its best to produce n items.

 

The seq() command is probably the easiest to use as the results are completely defined by the user. The pretty() command is generally used internally to make the plot axes intervals so using it would likely match the axis. However, you may not want gridlines at exactly the same intervals so it’s probably best to stick to seq().

You can use various additional parameters to alter the appearance of the lines for example:

Parameter Detail
col The colour of the line(s). Usually as a “name” but you can specify an integer, which will use a colour from the current colour palette.
lwd The width of the line. Think of it as an expansion factor. Values >1 make the line wider, whilst values <1 make it thinner.
lty The line type, 0 = none, 1 = solid, 2 = dashed, 3 = dotted, 4 = dotdash, 5 = longdash, 6 = twodash. You can also specify the type as a “string” that matches the names given here.

 

The abline() command thus gives you good control over the position and format of gridlines.

Place gridlines behind other R plot elements

When you add gridlines to an R plot your lines will usually over-top any points or bars that were present.

It may be that this is what you want, but generally it is desirable to have the gridlines disappear behind the bars. If you are drawing gridlines to a barplot() or a boxplot() then you can easily achieve this by re-plotting and adding add = TRUE in the plotting command.

If you are using a scatter plot and the regular plot() command you take a different approach.

  1. Use the plot() command but set type = “n” to create the plot but not any points.
  2. Add the gridlines using abline()
  3. Add the data points using the points()

These simple “tricks” should ensure that your gridlines end up where you want them.

Example code

The bar chart with error bars shown earlier was drawn using the following code:

hog3 # The data
  Upper Mid Lower
1     3   4    11
2     4   3    12
3     5   7     9
4     9   9    10
5     8  11    11
6    10  NA    NA
7     9  NA    NA
Get median values for each column
med <- apply(hog3, MARGIN = 2, median, na.rm = TRUE)
Get the upper and lower quartiles, which will form the error bars
up <- apply(hog3, MARGIN = 2, quantile, na.rm = TRUE, prob = 0.75)
dn <- apply(hog3, MARGIN = 2, quantile, na.rm = TRUE, prob = 0.25)
dat <- rbind(med, up, dn) # Make a matrix of the data and error bars
dat
    Upper Mid Lower
med   8.0   7    11
up    9.0   9    11
dn    4.5   4    10
Draw the bar chart
barplot(dat["med",], col = "lightblue")
Add the gridlines
abline(h = seq(2,10,2), lty = "dashed", col = "gray30")
Add axis titles
title(xlab = "Sample Site", ylab = "Abundance")
Re-plot bars over gridlines

Note that the plot is given a name, which allows the x-values for the error bars to be calculated. Use add = TRUE to allow the bars to be plotted over the existing ones, thus ending up “on top” of the gridlines.

bp <- barplot(dat["med",], col = "lightblue", add = TRUE)
Draw the error bars using the inter-quartile values
arrows(bp,dat["up",], bp,dat["dn",], length = 0.1, angle = 90, code = 3)

Legends on graphs and charts

Exercise 6.3.1a.

Statistics for Ecologists (Edition 2) Exercise 6.3.1a

These notes about legends in graphs/charts supplement the text in Chapter 6.

Legends on graphs and charts

Introduction

A legend is a tool to help explain a graph. You are most commonly going to want to add one to a bar chart where you have several data series. You’ll also want to add one to a line or scatter plot when you have more than one series. Essentially you use a legend to help make a complicated plot more understandable.

In R you can add a legend to any plot using the legend() command. You can also use the legend = TRUE parameter in the barplot() command. The barplot() command is the only general plot type that has a legend parameter (the others need a separate legend).

The legend() command has a host of parameters, which can be tweaked to produce the finished article. Generally, the most difficult part is making room on the chart for the legend itself!

Legend Essentials

The legend() command has a wealth of parameters at its disposal. This gives it a great deal of flexibility and customizability <sic> but this also makes it daunting and hard to get to grips with.

The legend() command has the following general form:

legend(x, y = NULL, legend, col, pch, lty, lwd, fill, border, bty, ncol, y.intersp)

x, y = NULL The co-ordinates to place the legend (its top-left corner). You can also specify a shortcut location as a text string: “top”, “right”, “bottomleft” and so on, which allows a general spot to be filled quickly.
legend The text to be used for the legen entries. The default is taken from the data.
col The colors to be used for lines or points that appear in the legend.
pch The plotting character(s) to use.
lty The line type (style) to use.
lwd The line width to use.
fill = NULL A set of colors to appear in boxes beside the legend text entries. If NULL (the default) an empty box is placed. To suppress the box omit the fill parameter.
border = “black” The border color for the box if fill is specified (as a color or NULL).
bty = “o” The border type for the overall legend, use bty = “n” for no border.
ncol = 1 The number of columns for the legend, the default is 1 (i.e. a vertical legend).
y.intersp = 1 The width between legend lines, set >1 to space the lines out. There is also a x.intersp parameter, which operates horizontally.

 

There are a number of other parameters that I’ve not listed. The ones here are the most essential.

When adding a legend you need to make sure that the items in the legend() command match the parameters you set in the plotting command. It helps to specify pch, col, lty and so on explicitly in the plotting command, as you can match the parameters more easily than if you relied on the defaults.

The barplot() command is the only general plotting command that has a legend parameter. You can pass additional parameters to the legend using the args.legend parameter, as you’ll see shortly.

Adding legends from the barplot() command

The barplot() command allows use of a legend parameter, which calls legend() with its basic settings. You can pass parameters to the legend() command by adding args.legend and giving the details as a list().

The biggest problem is usually how to make appropriate space for the legend to fit in the plot window. There are two main options:

  • Alter the axis size to give extra room (vertically or horizontally).
  • Place the legend into the plot margin.

The following examples use a matrix dataset that gives the abundance of some butterfly species at a site over several sample years:

> bf
        1996 1997 1998 1999 2000
M.bro     88   47   13   33   86
Or.tip    90   14   36   24   47
Paint.l   50    0    0    0    4
Pea       48  110   85   54   65
Red.ad     6    3    8   10   15
Ring     190   80   96  179  145
> class(bf) # Check you have a matrix
[1] "matrix"

You can get the sample data here: butterfly.RData.

Altering the y-axis to make room

In a basic plot there will often not be enough room to accommodate the legend:

> barplot(bf, legend = TRUE)

Note that the default location for the legend is “topright”. In this case the simplest way to make room is to resize the y-axis using the ylim parameter.

> barplot(bf, legend = TRUE, ylim = c(0,550))

A simple rescale of the y-axis will often allow the legend to fit.

You may have to play around with the axis setting to get the best values to use.

Pass legend parameters via barplot()

To pass parameters (arguments) to the legend() command from barplot() the parameters need to be passed as a list() with the args.legend parameter.

> barplot(bf, beside = TRUE, col = terrain.colors(6), ylim = c(0, 250), legend = TRUE, args.legend = list(bty = “n”, x = “top”, ncol = 3))
title(xlab = “Sample Year”, ylab = “Abundance”)

In this case the y-axis is lengthened and additional parameters passed to legend(). The legend box is suppressed (bty = “n”), placed at the top center (x = “top”), and made into 3 columns (ncol = 3).

Altering the x-axis to make room

You can alter the x-axis using the xlim parameter, which allows you to place the legend at the right.

> mycols = c("tan", "orange1", "magenta", "cyan", "red", "sandybrown")
> barplot(bf, beside = TRUE, col = mycols, legend = TRUE, xlim = c(0, 45))

Note that the xlim parameter in this example set the axis from 0 to 45. Each bar in the barplot() takes up a space, so you need to allow about 1 unit per bar plus a bit extra for the legend itself.

Note also that the colparameter set the colors for the plot, which were passed automatically to legend() without requiring the args.legend parameter. Of course if you wanted other parameters (such as supressing the legend box) you would require the args.legend parameter.

Place a legend in a plot margin

Altering the x-axis or y-axis size to accommodate the legend is a fairly simple matter. Sometimes however you want a legend to be at the bottom or even the left, and however you alter the axes you will not make space!

What you need is to be able to place the legend in the margin of the plot, so that it does not overlap the plotting zone at all. To do this you need to tweak the graphical parameters via the par() command.

The general running order is as follows:

  1. Set the plot margins as you need to give the space in the required margin.
  2. Make your plot (any plot).
  3. Reset the graphical parameters back to defaults.
  4. Now set the plot margins to 0 and at the same time set plotting to allow “overplot”.
  5. Use legend() to place the legend where your extra-large margin is.
  6. Reset the graphical parameters back to defaults.

Steps 3 and 6 are not absolutely essential but preferable, as you can get into an awful mess if you forget the current settings.

Legend in the plot margin for a bar chart

To make the plot margin larger use par(oma = c(b, l, t, r)) where b, l, t, r are values for the margin sizes at the bottom, left, top and right. For example:

> opar = par(oma = c(0,0,0,4)) # Large right margin for plot
> mycols = c("tan", "orange1", "magenta", "cyan", "red", "sandybrown")
> barplot(bf, beside = TRUE, col = mycols)
> par(opar) # Reset par

Now you have to set the graphical parameters again, this time set oma, mar and new = TRUE. The last parameter is important as it does not wipe the plot as the graphical parameters are set. Once you’ve altered the graphical parameters you can set the legend():

> opar = par(oma = c(0,0,0,0), mar = c(0,0,0,0), new = TRUE)
> legend(x = "right", legend = rownames(bf), fill = mycols, bty = "n", y.intersp = 2)
> par(opar) # Reset par

Place a legend in the bottom margin of a plot

The bottom margin of a plot can present some slight difficulties because the axis labels get in the way. The easiest solution is to use the inset parameter and shift the legend outwards (use a small negative value).

Start by making space in the bottom outer margin, then make the basic plot:

> opar = par(oma = c(2,0,0,0))
> barplot(bf, beside = TRUE, col = cm.colors(6))
> par(opar) # Reset par

Now set the margins to zero and set the overplot. The legend() command can now be used with the inset parameter:

> opar =par(oma = c(0,0,0,0), mar = c(0,0,0,0), new = TRUE)
> legend(x = "bottom", legend = rownames(bf), fill = cm.colors(6), bty = "n", ncol = 3, inset = -0.15)
> par(opar) # reset par

The inset parameter shifts the legend position slightly, to avoid the axis labels.

Note that positive values for inset shift the position upwards, a value of 0.5 is about half-way up. The direction of the inset shift is determined by the position you set in the command. If you used x = “bottom” then positive values shift the position upwards. If you used x = “right” then positive values shift the position left.

Legend in the plot margin for a scatter or line plot

If you use a line plot or a scatter plot you’ll have to contend with different plotting characters and line types. It is easier to set the parameters explicitly in the plot() command so that they are more easily matched in the legend().

In the following example a matplot() is used to create a multiple-series line plot. The legend is placed in the right margin.

> t(bf) # Rotate the data as matplot() takes columns as series
     M.bro Or.tip Paint.l Pea Red.ad Ring
1996    88     90      50  48      6  190
1997    47     14       0 110      3   80
1998    13     36       0  85      8   96
1999    33     24       0  54     10  179
2000    86     47       4  65     15  145

> mycols = c("tan", "orange1", "magenta", "cyan", "red", "sandybrown")
> opar = par(oma = c(0,0,0,5.5)) # Set right margin

## Plot without axes (because axis labels are numbers)
> matplot(t(bf), ylab = "", type = "b", pch = 1:6, lty = 1:6,
                axes = FALSE, lwd = 2, col = mycols)

> axis(2) # Use default y-axis

# Custom x-axis using years as labels
> axis(1, at = 1:5, labels = colnames(bf))

> box() # Bounding box around plot

> title(ylab = "Abundance", xlab = "Year")
> par(opar) # Reset par

> opar = par(oma = c(0,0,0,0), mar = c(0,0,0,0), new = TRUE)

# Legend pars to match matplot
> legend(x = "right", legend = rownames(bf), col = mycols,
         pch = 1:6, lty = 1:6, bty = "n",
         ncol = 1, text.col = "blue", y.intersp = 2)
> par(opar) # Reset par

Note that in this example the colour of the legend text has also been customized via the text.col parameter. For all the parameters used by legend() type help(legend) into R.

Using colour in graphs and charts

Exercise 6.3.

Statistics for Ecologists (Edition 2) Exercise 6.3

On this page you’ll find some additional notes regarding the use of colour in graphs and charts. These items did not quite make the final edit to the book itself but make a useful (I hope) addition to the topic in Chapter 6.

Colouring in: Using colour in graphs and charts:

Introduction

Colour is very important in presenting data and results. Both Excel and R have a wide range of colours you can use when creating your graphs and charts (certainly more than 50 shades of gray!).

Controlling and managing the colours you display is an important element in presenting your work. With an increasing volume of work being presented via the Internet, colour is something not to take for granted. Using default colours is “easy” but for maximum impact you should think carefully about how to present the best colours for the job.

Traditional journals generally use monochrome, which you can think of as just another set of colours, but even if you are “stuck” with shades of grey you need to think carefully. Pattern filling can be an especially useful option when using monochrome.

Using colour in Excel charts

You can set colours in Excel in several ways:

  • General color can be set from the Page Layout menu, which sets overall color themes.
  • When you make a chart the Chart Tools > Design > Change Colors button allows you to “override” the general colors and set your own.
  • You can format elements in charts directly from the Chart Tools > Format
  • Right-click or double-click a chart element directly.

When you set a colour explicitly you can also incorporate fill effects, with Pattern Fill being the most useful.

Quick settings in Excel charts

Whenever you make a chart in Excel it will have a default colour palette. You can alter the general colour theme for an existing chart from the Chart Tools menu. The exact option you select depends on the version of Excel you use e.g.

  • 2013: Chart Tools > Design > Change Colors button.
  • 2010: Chart Tools > Design > Chart Styles section.

This gives you a few options for altering the general flavour of the chart.

The colour options you get depend on the overall color theme that’s in operation; you set this from the Page Layout menu.

Overall setting of colours in Excel

You can set the general colour theme using the Page Layout > Colors button. This sets the default colors for charts as well as other Excel items (such as headings in Pivot Tables). Once you’ve set an overall theme and created a chart you can alter the chart colors using the Design > Change Colors button.

Once you have a general color set you can of course still tinker with the individual colors by formatting chart elements directly. Don’t just settle for the defaults; your graphs are important and it is worth spending a little bit of time to get the “best” look you can. If you use basic defaults your chart will look lacklustre and people will think that your work is similarly lacking. Your results graphs are the most important aspect of your work, make them count!

Setting colour explicitly in Excel charts

Whatever color theme you have set or applied you can always alter the individual colors of the chart elements. Usually this means altering colors for the various data series but you can also set colors for axis lines and labels.

The most reliable method of selecting a chart element is to use the Chart Tools > Format > Format Selection menu item. However, you can also right-click or double-click on the chart directly.

Once you have selected a data series you’ll most likely have opened the Format Data Series dialogue box. You can alter the Fill settings by using solid colors and choosing the color you want.

You can also choose a Pattern Fill. The pattern is especially helpful with monochrome charts, such as those intended for paper publication.

You can set the general color for the pattern (the Foreground button); the background defaults to white but you can set it separately if you like.

There are plenty of options for patterns, look to keep things simple though.

Using a Pattern Fill is helpful in monochrome charts.

Generally, you want to avoid crazy effects, your aim is to help make the chart readable, not to assault the senses!

Using color in R plots

R has over 650 named colors to choose from. You can see the colors using the colors() command.

colors() 
 [1] "white"          "aliceblue"      "antiquewhite"
 [4] "antiquewhite1"  "antiquewhite2"  "antiquewhite3"
 [7] "antiquewhite4"  "aquamarine"     "aquamarine1"
[10] "aquamarine2"    "aquamarine3"    "aquamarine4"
[13] "azure"          "azure1"         "azure2"
[16] "azure3"         "azure4"         "beige"
[19] "bisque"         "bisque1"        "bisque2"
[22] "bisque3"        "bisque4"        "black"
[25] "blanchedalmond" "blue"           "blue1"
[28] "blue2"          "blue3"          "blue4"

The colors used for most graphical commands are taken from a default palette, which can be different for different commands. For example the barplot() command uses shades of gray whilst the pie() command uses a palette of pastel shades.

In most graphical commands you can set the colors for the plot using the col parameter. Colors can be named explicitly or given numbers, in which case the numbers are taken to be the colors of the existing color palette().

Specifying color

You can set the colors of an R plot in several ways:

  • Give color names (in quotes) as in the colors()
  • Give numbers – the numbers will refer to the colors in the currently set palette().
  • Give the name of a built-in color palette (and the number of shades).

You can set the colors directly in a plotting command using the col parameter and a vector of names (in quotes), for example:

> barplot(VADeaths, beside = TRUE,
col = c("aliceblue", "bisque", "coral", "seagreen", "tomato"))

If you plan to use the colors more than once you might want to save your vector of colors as a named object.

> mycol = c("aliceblue", "bisque", "coral", "seagreen", "tomato")
> pie(VADeaths[1,], col = mycol)

If you don’t provide enough colors then they are recycled. If there are too many colors the un-required ones are ignored.

If you give the colors as numbers the plotting command will take the colors from their position in the vector of colors that are in the color palette(), not the position of the color from the colors() command.

You can also use one of the built-in palettes by specifying the palette name and the number of shades needed.

Setting a color palette

The palette() command allows you to set a default color palette. The default palette() can be viewed using an empty command:

> palette()
[1] "black"   "red"     "green3"  "blue"    "cyan"
[6] "magenta" "yellow"  "gray"

Now, whenever a color is referred to by an integer value the color is taken as the position in the current palette(). Some plotting commands have their own default colors, which are used if you do not specify the colors explicitly.

  • The barplot() command uses a palette() of gray.
  • The pie() command uses a palette() of pastel shades.

You can set the palette() by giving a vector of colors:

> palette(mycol)
> palette()
[1] "aliceblue" "bisque"    "coral"     "seagreen"  "tomato"

To restore the default palette():

> palette("default")

To use the colors in your palette() you can either give the numbers of the position of the colors in the palette() or specify col = palette().

> palette()
[1] "black" "red" "green3" "blue" "cyan"
[6] "magenta" "yellow" "gray"

> mycol
[1] "aliceblue" "bisque"    "coral"     "seagreen"  "tomato"

> palette(mycol)
> pie(VADeaths[,1], col = palette())

The pie() command uses a set range of colors unless told otherwise. Here a custom palette() has been used.

Once you have set a palette() the colors remain “operational” until you change the palette(). Any colors referred to by number will refer to the current palette() but of course you can over-ride the palette() by specifying colors by name.

Built-in color palettes

You can make your own color palette() by specifying color names explicitly. However, R has several built-in palettes that you can use to create a series of “co-ordinated” colors.

There are six built-in palettes.

  • rainbow()
  • colors()
  • colors()
  • colors()
  • colors()
  • colors()

In general you specify how many colors you require in the palette and the command produces a series of colors graded across the range, however, with rainbow() and gray.colors() you can specify starting and ending points.

Type help(palette) to get more information from within R.

Shading lines

In some chart types it may be helpful to have shading lines, especially when using monochromatic color palettes. The barplot() and pie() commands allow you to specify shading lines from within the command. There are two parameters:

  • density – gives the density of the shading lines in lines per inch. If not specified or negative the color is solid, even if angle is given.
  • angle – gives the angle to draw the shading lines measured anti-clockwise, 0 is horizontal and the default is 45 degrees.

Shading is not available in the boxplot() command. If is rather less useful there anyhow, as the bars are all labelled. It is possible to find a solution using the polygon() command but this is not for the faint hearted.

Summary

Don’t just rely on the default colors for your graphs and charts. If you are presenting a graphic that uses color then think carefully about how easy it will be for readers to differentiate the colors. Think about the colors themselves, perhaps a particular color theme would be especially suitable for your presentation.

You can use color to highlight a particularly important finding, making some data stand out from the rest. You can also make some data “disappear” into obscurity!

For monochrome charts try a wide range so that individual series are identifiable. This is where pattern fills are especially useful.

Don’t forget that the purpose of your chart is to help a reader understand the data/results easily.

Tally plots in R

Exercise 6.2.2.

Statistics for Ecologists (Edition 2) Exercise 6.2.2

Recently I saw a message in a forum asking about the difference between dot plots and histograms. This got me thinking and so I decided to work out how to make R produce a dot plot from scratch. These notes also supplement Chapter 6 (Graphics).

Tally plots in R

Dot charts as an alternative to the histogram

A histogram is a way of showing the frequency of your numeric data in a visual manner. The histogram looks more or less like a bar chart except that the bars are touching – the x-axis is a continuous scale rather than being discrete categories. Look at the following data:

> mydata = c(6, 7, 8, 7, 6, 3, 8, 9, 10, 7, 6, 9)

Stem-leaf plot

You can visualise the distribution using a stem-leaf plot:

> stem(mydata)
The decimal point is at the |
 2 | 0
 4 |
 6 | 000000
 8 | 0000
10 | 0

The stem() command does not give much flexibility when it comes to the bins separating the data categories but you can use the scale = n instruction. The default is 1 so making the value larger will increase the number of bin categories:

> stem(mydata, scale = 2)
The decimal point is at the |
 3 | 0
 4 |
 5 |
 6 | 000
 7 | 000
 8 | 00
 9 | 00
10 | 0

Making the scale smaller gives a different impression:

> stem(mydata, scale = 0.5)
The decimal point is 1 digit(s) to the right of the |
0 | 3
0 | 6667778899
1 | 0

The stem() command can be useful but it does not really match the histogram.

Make a frequency table with the table() command

Another method of looking at the data is to make a frequency table:

> table(mydata)
mydata
3  6  7  8  9 10
1  3  3  2  2  1

Not very visual but it does a job. It splits the data into chunks and shows the frequency for each. The table() command really only works sensibly on integer values (otherwise you end up with loads of “bins”).

Visualize frequency with a bar chart

The resulting table can be turned into a visual representation of the data if you make a bar chart:

> barplot(table(mydata))

The resulting bar chart gives you an impression of the frequency distribution:

The barplot is useful but can be misleading. The bars are discrete categories (bins or size classes) and are discontinuous. In the preceding barplot you can see that there is a jump from the 3-bin to the 6-bin. The barplot() command is very flexible and you can customize your plot in many ways but you cannot get around this problem.

A true histogram

A true histogram has a continuous x-axis and you can make one using the hist() command:

> hist(mydata)

The histogram can be jazzed up and customized in various ways, which I won’t delve into at this point. However, one important aspect is the control of the x-axis. The x-axis is a continuous scale and you can see the difference between this and the earlier barplot by looking at the position of the axis labels. In the barplot they are in the middle of each bar but in the histogram they are placed at the edges of the bars.

You can control the breakpoints using the breaks instruction. The default is breaks = “sturges”, which uses an algorithm to determine the breakpoints. You can also specify the number of breakpoints you want or even specify the “exact” position of the breakpoints by giving the values explicitly.

Developing a script to draw a tally plot or dot histogram

What I wanted was to make a chart that replaced the bars with dots, the number of dots in each column being equal to the frequency. One feature of the hist() command is that you can make a histogram without actually making the final plot. In other words you can calculate all the required statistics. I started by making a result object of the histogram data like so:

> hg = hist(mydata, plot = FALSE)

The result contains several elements in a list; useful elements are the mid-points of the columns and the counts (frequency):

> hg$mids
[1] 3.5 4.5 5.5 6.5 7.5 8.5 9.5

> hg$counts
[1] 1 0 3 3 2 2 1

I reasoned that I could use the $mids as the x-values in a regular plot. The y-values would come from the $counts data. A frequency of 3 would get plotted three times, at y = 1, y = 2 and y = 3. This meant I had to replicate the count data to make a sequence, which would have to be matched up to the x-data.

A loop of some sort seemed unavoidable and the number of times the loop would need to run would be equal to the number of bins, that is the number of bars. Put another way, it is the number of breaks-1. It is simplest to count the number of items in the $counts:

> bins = length(hg$counts)

To make the y-values I needed to make each frequency into a series, so a value of 3 would become 1, 2, 3. I also needed to take care of 0 values so I decided to make each frequency a series 0:frequency. Actually it was logical to do this the other way around freqency:0 so the loop becomes:

> yvals = numeric(0)
>  for(i in 1:bins) {
     yvals = c(yvals, hg$counts[i]:0)
  }

The first line simply creates a blank numeric vector. The loop creates the appropriate values and appends them to the vector. For the data under consideration this produces:

> yvals
[1] 1 0 0 3 2 1 0 3 2 1 0 2 1 0 2 1 0 1 0

Each count value is a sequence ending in zero, the count that was a zero remains so.

The x-values are derived from the $mids result, since I added an extra 0 to each y-value each item needed to be repeated a number of times equivalent to the count +1. This has the bonus of dealing with the 0 count, as a repeat of 0 would be “difficult”. A loop is needed again and it will run for as many times as there are bin categories.

> xvals = numeric(0)
>  for(i in 1:bins) {
     xvals = c(xvals, rep(hg$mids[i], hg$counts[i]+1))
  }
> xvals
[1] 3.5 3.5 4.5 5.5 5.5 5.5 5.5 6.5 6.5 6.5 6.5 7.5 7.5 7.5 8.5 8.5 8.5 9.5 9.5

The xvals and yvals cannot be used directly because there are zero items and we don’t want points plotted at 0. The simplest way to deal with this is to join up the values in a data.frame and then remove rows where y = 0.

> dat = data.frame(xvals, yvals)
> dat = dat[yvals > 0, ]

Now the data are ready to make into a plot. A regular scatter plot will do the job via the plot() command:

> plot(yvals ~ xvals, data = dat)

However, the points are too small and the plot does not look “tidy”.

The trick is to remove the axes, allow the points to spill over the plot area a little and to make the points larger. In addition, it is helpful to plot each point a little bit higher on the y-axis so that the bottom row do not overlap the axis too much. A few extra tweaks are also necessary to get the axis scales to come out right. After a bit of tweaking I get the final plot to appear thus:

The command uses the default breaks = “sturges” to work out the breakpoints, you can specify other breakpoints in exactly the same way as for the hist() command. The plotting symbols are set to pch = 19 (a solid circle) and enlarged somewhat with cex = 3. You can specify other values. The offset = 0.4 parameter plots each point slightly “upwards”. You can alter this offset and with the cex and pch parameters can get the appearance you want.

The biggest alteration you can make is with the graphics window. It seemed a lot of hassle to attempt to match the plot window size to the other parameters. It is easiest to simply use the mouse to resize the plot window to give the appearance you like. You can easily save the plot to a file once it is completed.

The hg_dot() command

When made up into a function the command lines look like the following:

## Dotplot histogram
## Mark Gardener 2013
## www.dataanalytics.org.uk
hg_dot <- function(x, breaks = "sturges",
                      offset = 0.4,
                      cex = 3,
                      pch = 19, ...) {

#   x = data vector
# ... = other instructions for plot

hg <- hist(x, breaks = breaks, plot = FALSE) # Make histogram data but do not plot
bins <- length(hg$count                      # How many bins are needed?
yvals <- numeric(0)                 # A blank variable to fill in

for(i in 1:bins) {                  # Start a loop
yvals <- c(yvals, hg$counts[i]:0)  # Work out the y-values
}                                  # End the loop

xvals <- numeric(0)                                 # A blank variable

for(i in 1:bins) {                                  # Start a loop
xvals <- c(xvals, rep(hg$mids[i], hg$counts[i]+1))  # Work out x-values
}                                                   # End the loop

dat <- data.frame(xvals, yvals)  # Make data frame of x, y variables
dat <- dat[yvals > 0, ]          # Knock out any zero y-values
 minx <- min(hg$breaks)  # Min value for x-axis
 maxx <- max(hg$breaks)  # Max value x-axis
  miny <- min(dat$yvals)  # Min value for y-axis
  maxy <- max(dat$yvals)  # Max value for y-axis

# Make the plot, without axes, allow points to overspill plot region
plot(yvals + offset ~ xvals, data = dat,
        xlim = c(minx, maxx), ylim = c(miny, maxy),
        axes = FALSE, ylab = "", xpd = NA,
        cex = cex, pch = pch, ...)
axis(1)   # Add in the x-axis

# Make results of original data, histogram and plot data
result <- list(hist = hg, original = x, plot.data = dat)
invisible(result)  # Save all the results invisibly
  } # end
## END

Once you run the command your chart will be created in whatever size your default graphics window is set to. Simply drag the window to a new size as appropriate.

The command produces a list result that contains the following:

  • the original data $original
  • the histogram statistics $hist
  • the values plotted $plot.data

If you assign a named object to the command you can access these results afterwards.

> hg = hg_dot(mydata)

> names(hg)
[1] "hist"      "original"  "plot.data"

You can get the R script here: Dot Histogram Script.

Tally plots for data visualisation in Excel

Exercise 6.2.1.

Statistics for Ecologists (Edition 2) Exercise 6.2.1

Some notes on data visualisation to supplement Chapter 6.

Tally plots for data visualisation in Excel

Introduction

Data distribution is important. You need to know the shape of your data so that you can determine the best:

  • Summary statistics
  • Analytical routines

Some statistical tests use the properties of the normal (Gaussian) distribution, whilst others use data ranks. So, it’s important to know if your data are normally distributed or otherwise.

The classic way to look at the shape of your data is with a histogram, showing the frequency of observations that fall into certain size classes (called bins). However, you can make a “quick and dirty” histogram using pencil and paper, a tally plot. The tally plot is useful as it is something you can do in a notebook in the field as you collect data.

The classic histogram

A classic histogram uses an x-axis that is a continuous variable, like the following example drawn using R:

A classic histogram has a continuous x-axis (this drawn using R).

The x-axis is split according to the bin boundaries and the bars show the frequency of observations that fall between two bin boundaries. This classic histogram is easy to draw using R with the hist() command. The command works out the axis breakpoints and calculates the frequencies for you.

Excel cannot draw a “proper” histogram; the nearest it can get is a column chart with separate categories representing the bin classes:

In Excel the column chart is used in lieu of a true histogram. The x-axis is split into categories representing the bin sizes.

This is not so terrible, but it is not quite the same as a true histogram. In the preceding example the bars have been widened to reduce the gap width, which makes the chart appear more like a real histogram.

Using Excel to calculate data frequency

You can draw the histogram (column chart) in Excel easily enough but need to compute the frequency of observations for each bin class. To do this you need the FREQUENCY function.

FREQUENCY(data_array, bins_array)

To use the function, follow these steps:

  1. Make sure you have your data!
  2. Work out the minimum value you’ll need (the MIN function is helpful).
  3. Work out the maximum (use the MAX function).
  4. Determine the interval you need to give you the number of bins you want.
  5. Type in the values for the bins somewhere in your worksheet.
  6. Highlight the cells adjacent to the bins you just made, these will hold the frequencies.
  7. Type the formula for FREQUENCY, highlighting the appropriate cells (or type their cell range) for the data and bins.
  8. Do NOT press Enter since FREQUENCY is an array function. Instead press Ctrl+Shift+Enter (on a Mac Cmd+Shift+Enter), which will complete the function and place the result into all the cells you highlighted.

Figure 11. Frequency-array.png

The FREQUENCY function places a result into an array of cells.

Now you have the bins and the frequency so you can build your histogram (column chart), using the Insert > Column Chart button.

Frequency calculation via Analysis ToolPak

You could skip the frequency calculation stage altogether and use the Analysis ToolPak (Insert > Data Analysis), which will also draw a column chart (histogram) for you if you like.

If you cannot see the Data > Data Analysis button you may not have the ToolPak installed. Go to the Excel options and the Add-Ins section, where you can enable it.

The Histogram section of the Analysis ToolPak requires you to have some data and a range of bins. So:

  1. Make sure you have your data!
  2. Work out the minimum value you’ll need (the MIN function is helpful).
  3. Work out the maximum (use the MAX function).
  4. Determine the interval you need to give you the number of bins you want.
  5. Type in the values for the bins somewhere in your worksheet.
  6. Click the Data > Data Analysis button.
  7. Scroll down and choose the Histogram option.
  8. Fill in the boxes for the data range and the bins ranges.
  9. Click OK and the Analysis ToolPak will compute the frequencies for you.

Once you’ve clicked OK you should see the results, the bins will be repeated alongside the appropriate frequencies. If you selected to have a chart output, then you’ll see a column chart too.

If you do not have Windows or indeed Excel, then you can’t use the Analysis ToolPak but you can still use the FREQUENCY function. Both FREQUENCY and REPT are part of the armoury of OpenOffice and LibreOffice (and probably others).

Making a Tally plot

You may not want a graphic and require only a simple tally plot. It is easy to do this in Excel using the FREQUENCY and REPT functions. You use the FREQUENCY function as described before to determine the frequency of observation in various bins. Once you have the frequencies you use the REPT function to repeat some text a number of times (corresponding to the frequency).

The REPT function repeats some text a number of times. Use with FREQUENCY to make a tally plot.

In the example I have set the character text I want to use in cell D6. The frequencies are in E2:E12. The REPT function takes the text I want to display as a tally mark and repeats it by the number on the frequency column. Note the cell reference D$6, which “fixes” the row so that the formula stays correct as it it copied down (from G2 to G12).

You could use the text directly in the REPT formula e.g. =REPT(“X”, E2) but then if you wanted to alter the tally plot you’d need to alter all the cells. If you point to a single cell you can alter the tally plot simply by typing a new tally character into that one cell.

The tally plot doesn’t really replace the histogram but it can be useful to have a plain text representation of your data rather than a graphic. You can copy the spreadsheet cells into various programs.

Bins
16 X
18 X
20 X
22 XXX
24 XXX
26 XXXXX
28 XXX
30 XXX
32 XXX
34 X
36 X

If it not too hard to get the tally plot oriented with the tallies vertical. You’ll need to format the tally cells so that they are oriented at 90˚ but otherwise this is straightforward. However, in the “vertical” orientation the cells do not copy/paste so nicely!

Getting data from Excel to R

Exercise 3.3.

Statistics for Ecologists (Edition 2) Exercise 3.3

Incomplete final line error on CSV import

Sometimes you can get an error message when trying to import a CSV file, most likely with the read.csv() or read.table() command. The “incomplete final line” error message arises when there is a “missing” return in the last data row of your CSV file. You need to add an extra line-feed at the end of the file.

There are two ways you can set about this:

  • Open the file in a text editor and add an extra line. Then import the file to R.
  • Send an additional line-feed directly to the file from R itself. Then import the file to R.

Whichever method you choose it is a good idea to make a backup of the file, just in case. If you open your “faulty” file in a spreadsheet it will look absolutely fine; it is only in a text editor or word processor that you can see the end-of-line characters (not all editors display hidden characters).

Edit file in text editor

The OS-based method of fixing this problem is to open the file in any kind of text editor. WordPad is a good choice in Windows as it can handle line-breaks more effectively (Windows likes end-of-line characters to be CR & LF, Mac uses LF and Linux CR, WordPad displays all correctly, Word does not). Go to the bottom of the file and if you cannot get the cursor past the end of the final line there is a missing linefeed. Simply press the Enter key on the keyboard to add the final linefeed and save the file.

Now when you try to import the file you should not get the error.

Send extra linefeed from R

You can send a linefeed to a file directly from R using the cat() command. Use a command like so:

cat("\n", file = file.choose(), append = TRUE)

The “\n” is a “newline” character (you do need the quotes). Note that you need to use append = TRUE otherwise you will overwrite the file with nothing except your newline character. If you are using Linux then you’ll need to specify the filename explicitly but in Windows or Mac the file.choose() part will allow you to choose the file.