Tally plots in R

Exercise 6.2.2.

Statistics for Ecologists (Edition 2) Exercise 6.2.2

Recently I saw a message in a forum asking about the difference between dot plots and histograms. This got me thinking and so I decided to work out how to make R produce a dot plot from scratch. These notes also supplement Chapter 6 (Graphics).

Tally plots in R

Dot charts as an alternative to the histogram

A histogram is a way of showing the frequency of your numeric data in a visual manner. The histogram looks more or less like a bar chart except that the bars are touching – the x-axis is a continuous scale rather than being discrete categories. Look at the following data:

> mydata = c(6, 7, 8, 7, 6, 3, 8, 9, 10, 7, 6, 9)

Stem-leaf plot

You can visualise the distribution using a stem-leaf plot:

> stem(mydata)
The decimal point is at the |
 2 | 0
 4 |
 6 | 000000
 8 | 0000
10 | 0

The stem() command does not give much flexibility when it comes to the bins separating the data categories but you can use the scale = n instruction. The default is 1 so making the value larger will increase the number of bin categories:

> stem(mydata, scale = 2)
The decimal point is at the |
 3 | 0
 4 |
 5 |
 6 | 000
 7 | 000
 8 | 00
 9 | 00
10 | 0

Making the scale smaller gives a different impression:

> stem(mydata, scale = 0.5)
The decimal point is 1 digit(s) to the right of the |
0 | 3
0 | 6667778899
1 | 0

The stem() command can be useful but it does not really match the histogram.

Make a frequency table with the table() command

Another method of looking at the data is to make a frequency table:

> table(mydata)
mydata
3  6  7  8  9 10
1  3  3  2  2  1

Not very visual but it does a job. It splits the data into chunks and shows the frequency for each. The table() command really only works sensibly on integer values (otherwise you end up with loads of “bins”).

Visualize frequency with a bar chart

The resulting table can be turned into a visual representation of the data if you make a bar chart:

> barplot(table(mydata))

The resulting bar chart gives you an impression of the frequency distribution:

The barplot is useful but can be misleading. The bars are discrete categories (bins or size classes) and are discontinuous. In the preceding barplot you can see that there is a jump from the 3-bin to the 6-bin. The barplot() command is very flexible and you can customize your plot in many ways but you cannot get around this problem.

A true histogram

A true histogram has a continuous x-axis and you can make one using the hist() command:

> hist(mydata)

The histogram can be jazzed up and customized in various ways, which I won’t delve into at this point. However, one important aspect is the control of the x-axis. The x-axis is a continuous scale and you can see the difference between this and the earlier barplot by looking at the position of the axis labels. In the barplot they are in the middle of each bar but in the histogram they are placed at the edges of the bars.

You can control the breakpoints using the breaks instruction. The default is breaks = “sturges”, which uses an algorithm to determine the breakpoints. You can also specify the number of breakpoints you want or even specify the “exact” position of the breakpoints by giving the values explicitly.

Developing a script to draw a tally plot or dot histogram

What I wanted was to make a chart that replaced the bars with dots, the number of dots in each column being equal to the frequency. One feature of the hist() command is that you can make a histogram without actually making the final plot. In other words you can calculate all the required statistics. I started by making a result object of the histogram data like so:

> hg = hist(mydata, plot = FALSE)

The result contains several elements in a list; useful elements are the mid-points of the columns and the counts (frequency):

> hg$mids
[1] 3.5 4.5 5.5 6.5 7.5 8.5 9.5

> hg$counts
[1] 1 0 3 3 2 2 1

I reasoned that I could use the $mids as the x-values in a regular plot. The y-values would come from the $counts data. A frequency of 3 would get plotted three times, at y = 1, y = 2 and y = 3. This meant I had to replicate the count data to make a sequence, which would have to be matched up to the x-data.

A loop of some sort seemed unavoidable and the number of times the loop would need to run would be equal to the number of bins, that is the number of bars. Put another way, it is the number of breaks-1. It is simplest to count the number of items in the $counts:

> bins = length(hg$counts)

To make the y-values I needed to make each frequency into a series, so a value of 3 would become 1, 2, 3. I also needed to take care of 0 values so I decided to make each frequency a series 0:frequency. Actually it was logical to do this the other way around freqency:0 so the loop becomes:

> yvals = numeric(0)
>  for(i in 1:bins) {
     yvals = c(yvals, hg$counts[i]:0)
  }

The first line simply creates a blank numeric vector. The loop creates the appropriate values and appends them to the vector. For the data under consideration this produces:

> yvals
[1] 1 0 0 3 2 1 0 3 2 1 0 2 1 0 2 1 0 1 0

Each count value is a sequence ending in zero, the count that was a zero remains so.

The x-values are derived from the $mids result, since I added an extra 0 to each y-value each item needed to be repeated a number of times equivalent to the count +1. This has the bonus of dealing with the 0 count, as a repeat of 0 would be “difficult”. A loop is needed again and it will run for as many times as there are bin categories.

> xvals = numeric(0)
>  for(i in 1:bins) {
     xvals = c(xvals, rep(hg$mids[i], hg$counts[i]+1))
  }
> xvals
[1] 3.5 3.5 4.5 5.5 5.5 5.5 5.5 6.5 6.5 6.5 6.5 7.5 7.5 7.5 8.5 8.5 8.5 9.5 9.5

The xvals and yvals cannot be used directly because there are zero items and we don’t want points plotted at 0. The simplest way to deal with this is to join up the values in a data.frame and then remove rows where y = 0.

> dat = data.frame(xvals, yvals)
> dat = dat[yvals > 0, ]

Now the data are ready to make into a plot. A regular scatter plot will do the job via the plot() command:

> plot(yvals ~ xvals, data = dat)

However, the points are too small and the plot does not look “tidy”.

The trick is to remove the axes, allow the points to spill over the plot area a little and to make the points larger. In addition, it is helpful to plot each point a little bit higher on the y-axis so that the bottom row do not overlap the axis too much. A few extra tweaks are also necessary to get the axis scales to come out right. After a bit of tweaking I get the final plot to appear thus:

The command uses the default breaks = “sturges” to work out the breakpoints, you can specify other breakpoints in exactly the same way as for the hist() command. The plotting symbols are set to pch = 19 (a solid circle) and enlarged somewhat with cex = 3. You can specify other values. The offset = 0.4 parameter plots each point slightly “upwards”. You can alter this offset and with the cex and pch parameters can get the appearance you want.

The biggest alteration you can make is with the graphics window. It seemed a lot of hassle to attempt to match the plot window size to the other parameters. It is easiest to simply use the mouse to resize the plot window to give the appearance you like. You can easily save the plot to a file once it is completed.

The hg_dot() command

When made up into a function the command lines look like the following:

## Dotplot histogram
## Mark Gardener 2013
## www.dataanalytics.org.uk
hg_dot <- function(x, breaks = "sturges",
                      offset = 0.4,
                      cex = 3,
                      pch = 19, ...) {

#   x = data vector
# ... = other instructions for plot

hg <- hist(x, breaks = breaks, plot = FALSE) # Make histogram data but do not plot
bins <- length(hg$count                      # How many bins are needed?
yvals <- numeric(0)                 # A blank variable to fill in

for(i in 1:bins) {                  # Start a loop
yvals <- c(yvals, hg$counts[i]:0)  # Work out the y-values
}                                  # End the loop

xvals <- numeric(0)                                 # A blank variable

for(i in 1:bins) {                                  # Start a loop
xvals <- c(xvals, rep(hg$mids[i], hg$counts[i]+1))  # Work out x-values
}                                                   # End the loop

dat <- data.frame(xvals, yvals)  # Make data frame of x, y variables
dat <- dat[yvals > 0, ]          # Knock out any zero y-values
 minx <- min(hg$breaks)  # Min value for x-axis
 maxx <- max(hg$breaks)  # Max value x-axis
  miny <- min(dat$yvals)  # Min value for y-axis
  maxy <- max(dat$yvals)  # Max value for y-axis

# Make the plot, without axes, allow points to overspill plot region
plot(yvals + offset ~ xvals, data = dat,
        xlim = c(minx, maxx), ylim = c(miny, maxy),
        axes = FALSE, ylab = "", xpd = NA,
        cex = cex, pch = pch, ...)
axis(1)   # Add in the x-axis

# Make results of original data, histogram and plot data
result <- list(hist = hg, original = x, plot.data = dat)
invisible(result)  # Save all the results invisibly
  } # end
## END

Once you run the command your chart will be created in whatever size your default graphics window is set to. Simply drag the window to a new size as appropriate.

The command produces a list result that contains the following:

  • the original data $original
  • the histogram statistics $hist
  • the values plotted $plot.data

If you assign a named object to the command you can access these results afterwards.

> hg = hg_dot(mydata)

> names(hg)
[1] "hist"      "original"  "plot.data"

You can get the R script here: Dot Histogram Script.

Comments are closed.