Dr. Mark Gardener 


Providing training for:

Statistics – A guideThese pages are aimed at helping you learn about statistics. Why you need them, what they can do for you, which routines are suitable for your purposes and how to carry out a range of statistical analyses. On this page:
<= Back to Introduction  Forward to Choosing the right analysis => 

See also: 


The natural world is variable so several measurements need to be taken. Summary statistics help make sense of these repeated measurements. 
Summarizing dataWhen you want to measure something in the natural world you usually have to take several measurements. This is because things are variable, so you need several results to get an idea of the situation. Once you have these measurements you need to summarize them in some way because sets of raw numbers are not easily interpreted by most people. In this section:The following sections give you an idea of the most useful summary statistics you can use. You can click on the links in the preceding list to jump direct to a section. 

Four key elements in data summary, centrality, dispersion, replication and shape (distribution). 
Basics of summarizing dataThere are four key areas to consider when summarizing a set of numbers:
You need to present the first three summary statistics in order to summarize a set of numbers adequately. There are different measures of centrality and dispersion – the measures you select are based on the the last item, shape (or data distribution). 

Averages are measures of centrality: Most appropriate average depends on data shape (distribution) 
AveragesAn average is a measure of the middle point of a set of values. This central tendency (centrality) is an important measure and is usually what you are comparing when looking at differences between samples for example. There are three main kinds of average:
Of these three, the mean and the median are most commonly used in statistical analysis. The most appropriate average depends on the shape of the data sample. 

Arithmetic mean: Measure of centrality for normally distributed data 
MeanThe arithmetic mean is calculated by adding together the values in the sample. The sum is then divided by the number of items in the sample (the replication).
The formula is shown above. The ∑ symbol represents "sum of". The n represents the replication. The final mean is indicated using an overbar. This shows that the mean is your estimate of the true mean. This is because you usually measure only some of the items in a "population"; this is called a sample. If you measured everything then you would be able to calculate the true mean, which would be indicated by giving it a µ symbol. The mean should only be used when the shape of the sample is appropriate. When the data are normally distributed the mean is a good summary of the average. If the data are not normally distributed the mean is not a good summary and you should use the median instead. 

Median: Measure of centrality that is not dependent on data shape (distribution) When data are normally distributed median and mean are very close 
MedianThe median is the middle value, taken when you arrange your numbers in order (rank). This measure of the average does not depend on the shape of the data. The "formula" for working out the median depends on the ranks of the values, you want a value whose rank is the (n/2)+0.5th like so:
If you have an odd number of values in your sample the median is simply the middle value like so:
The median is 7 in this case. When you have an even number of values the middle will fall between two items:
What you do is use a value midway between the two items in the middle. In this case midway between 4 and 7, which gives 5.5. The median is a good general choice for an average because it is not dependent on the shape of the data. When the data are normally distributed the mean and the median are coincident (or very close). 

Mode: Most frequent value in a sample Not much used in statistical analysis 
ModeThe Mode is the most frequent value in a sample. It is calculated by working out how many there are of each value in your sample. The one with the highest frequency is the mode. It is possible to get tied frequencies, in which case you report both values. The sample is then said to be bimodal. You might get more than two modal values! The mode is not commonly used in statistical analysis. It tends to be used most often when you have a lot of values, and where you have integer values (although it can be calculated for any sample). The mode is not dependent on the shape of your sample. Generally speaking you would expect your mode and median to be close, regardless of the sample distribution. If the sample is normally distributed the mode will usually also be close to the mean. 

Dispersion: How spread out the values are around the average High dispersion indicates high variability in a sample Most useful general measures of dispersion are: Standard deviation Choice of measure depends on data shape (distribution) 
DispersionThe dispersion of a sample refers to how spread out the values are around the average. If the values are close to the average then your sample has low dispersion. If the values are widely scattered about the average your sample has high dispersion.
The example figure shows samples that are normally distributed, that is, they are symmetrical around the average (mean). As far as dispersion goes, the principle is the same regardless of the shape of the data. However, different measures of dispersion will be more appropriate for different data distribution. There are various measures of dispersion, such as:
The choice of measurement depends largely on the shape of the data and what you want to focus on. In general with normally distributed data you use the standard deviation. If the data are not normally distributed you use the interquartile range. 

Standard deviation: A measure of dispersion for normally distributed data 
Standard deviationThe standard deviation is used when the data are normally distributed. You can think of it as a sort of "average deviation" from the mean. The general formula for calculating standard deviation looks like the following:
To work out standard deviation follow these steps:
The final result is called s, the standard deviation. In most cases you will have taken a sample of values from a larger "population", so your value of s is your estimate of standard deviation (the sample standard deviation). This is also why you used n1 as the divisor in the formula. If you measured the entire population you can use n as the divisor. You would then have σ, which is the "true" standard deviation (called the population standard deviation). In effect the 1 is a compensation factor. As n gets larger and therefore closer to the entire population, subtracting 1 has a smaller and smaller effect on the result. In most statistical analyses you will use sample standard deviation (and so n1). 

Top  Other measures of dispersion for normally distributed dataThere are other measures that can be used to represent dispersion when your sample is normally distributed. I will add notes about these at a later date. 

Interquartile range (IQR): A measure of dispersion for not normally distributed data Based on the ranks of the data items 
InterQuartile RangeThe interquartile range (IQR) is a useful measure of the dispersion of data that are not normally distributed (see shape). You start by working out the median; this effectively splits the data into two chunks, with an equal number of values in each part. For each half you can now work out the value that is halfway between the median and the "end" (the maximum or minimum). This gives you values for the two interquartiles. The difference between them is the IQR, which you usually express as a single value. The IQR essentially "knocks off" the most extreme portions of the data sample, leaving you with a core 50% of your original data. A small IQR denotes a small dispersion and a large IQR a large dispersion. As a byproduct of working out the IQR you'll usually end up with five values:
These 5 values split the data sample into four parts, which is why they are called quartiles. You can calculate the quartiles from the ranks of the data values like so:
If you are using Excel you can compute the quartiles using the QUARTILE function. 

Range is maxmin Not very useful as a measure of dispersion 
RangeThe range is simply the difference between the maximum and the minimum values. It is quite a crude measure and not very useful. The interquartile range is much more useful, and makes use of the maximum and minimum values in the calculation. 

Replication is the number of values in your sample 
ReplicationThis is the simplest of the summary statistics but it is still important. The replication is simply how many items there are in your sample (that is, the number of observations). The value n, the replication, is used in calculating other summary statistics, such as standard deviation and IQR, but it is also helpful in its own right. You should look at the dispersion and replication together. A certain value for dispersion might be considered "high" if n is small but quite "low" if n is very large. 

Data shape affects the kind of summary statistic and analytical approach Data shape relates to the distribution of values around the average 
ShapeThe shape of the data affects the type of summary statistics that best summarize them. The "shape" refers to how the data values are distributed across the range of values in the sample. Generally you expect there to be a "cluster" of values around the average. It is important to know if the values are more or less symmetrically arranged around the average, or if there are more values to one side than the other. There are two main ways to explore the shape (distribution) of a sample of data values:
The ultimate goal is to determine what kind of distribution your data forms. If you have normal distribution you have a wide range of options when it comes to data summary and subsequent analysis. 

Types of data distribution include: Normal (Gaussian) Normal distribution is called parametric Other distributions are nonparametric 
Types of data distributionThere are many "shapes" of data, commonly encountered ones are:
In general your aim is to work out if you have normal distribution or not. If you do have normal distribution you can use mean and standard deviation for summary. If you do not have normal distribution you need to use median and IQR instead. The normal distribution (also called Gaussian) has wellexplored characteristics and such data are usually described as parametric. If data are not parametric they can be described as skewed or nonparametric. 

Visualize data distribution with a frequency chart: 
Drawing the distributionThere are two main ways to visualize the shape of your data: In both cases the idea is to make a frequency plot. The data values are split into frequency classes, usually called bins. You then determine how many data items are in each bin. There is little difference between a tally plot and a histogram, they show the same information but are constructed is slightly different ways. 

A tally plot is a simple frequency chart that can be drawn in a notebook Split data into size classes (bins) and determine frequency of data in each size class 
Tally plotsA tally plot is a kind of frequency graph that you can sketch in a notebook. This makes it a very useful tool for times when you haven't got a computer to hand. To draw a tally plot follow these steps:
You will now be able to assess the shape of the data sample you've got.
The tally plot in the preceding figure shows a normal (parametric) distribution. You can see that the shape is more or less symmetrical around the middle. So here the mean and standard deviation would be good summary values to represent the data. The original dataset was:
The first bin, labelled 18, contains values up to 18. There are two in the dataset (17, and 16). The next bin is 21 and therefore contains items that are >18 but not greater than 21 (there are three: 21, 19 and 21). The following dataset is not normally distributed:
These data produce a tally plot like so:
Note that the same bins were used for the second dataset. The range for both samples was 1636. The data in the second sample are clearly not normally distributed. The tallest size class is not in the middle and there is a long "tail" towards the higher values (see shape statistics). For these data the median and interquartile range would be appropriate summary statistics. 

Histogram: A kind of bar chart showing frequency of data in various size classes 
HistogramsA histogram is like a bar chart. The bars represent the frequency of values in the data sample that correspond to various size classes (bins). Generally the bars are drawn without gaps between them to highlight the fact that the xaxis represents a continuous variable. There is little difference between a tally plot and a histogram but the latter can be produced easily using a computer (you can sketch one in a notebook too). To make a histogram you follow the same general procedure as for a tally plot but with subtle differences:
You can draw a histogram by hand or use your spreadsheet. The following histograms were drawn using the same data as for the tally plots in the preceding section. The first histogram shows normally distributed data.
The next histogram shows a nonparametric distribution.
In both these examples the bars are shown with a small gap, more properly the bars should be touching. The xaxis shows the size classes as a range under each bar. You can also show the maximum value for each size class. Ideally your histogram should have the labels at the divisions between size classes like so:
Note that this histogram uses slightly different size classes to the earlier ones. 

Shape statistics are numerical values that help you characterize the distribution (its shape): 
Shape statisticsVisualizing the shape of your data samples is usually your main goal. However, it is possible to characterize the shape of a data distribution using shape statistics. There are two, which are used in conjunction with each other:
If you are producing a numerical data summary these two values are useful statistics. 

Skewness is a measure of how central the average is Use SKEW in Excel 
SkewnessThe skewness of a sample is a measure of how central the average is in relation to the overall spread of values. The formula to calculate skewness uses the number of items in the sample (the replication, n) and the standard deviation, s.
In practice you'll use a computer to calculate skewness; Excel has a SKEW function that will compute it for you. A positive value indicates that the average is skewed to the left, that is, there is a long "tail" of more positive values. A negative value indicates the opposite. The larger the value the more skewed the sample is. 

Kurtosis is a measure of how "pointed" a distribution is Use KURT in Excel 
KurtosisThe kurtosis of a sample is a measure of how pointed the distribution is (see drawing the distribution). It is also a way to think about how clustered the values are around the middle. The formula to calculate kurtosis uses the number of items in the sample (the replication, n) and the standard deviation, s.
In practice you'll use a computer to calculate kurtosis; Excel has a KURT function that will compute it for you. A positive result indicates a pointed distribution, which will probably also have a low dispersion. A negative result indicates a flat distribution, which will probably have high dispersion. The higher the value the more extreme the pointedness or flatness of the distribution. 

Data summary: Centrality Shape of data determines which statistics are most appropriate Explore shape (distribution) using: 
SummaryYou should always summarize a sample of data values to make them more easily understood (by you and others). At the very least you need to show:
The shape of the data (its distribution) is also important because the shape determines which summary statistics are most appropriate to describe the sample. Your data may be normally distributed (i.e. with a symmetrical, bellshaped curve) and so parametric, or they may be skewed and therefore nonparametric. You can explore and describe the shape of data using graphs:
You can also use shape statistics: The shape of the data also leads you towards the most appropriate ways of analyzing the data, that is, which statistical tests you can use. 

Navigate:  <= Introduction Data Analysis Home  Choosing the right analysis=> Excel Tips & Tricks  Tips & Tricks for R  Learn R  MonogRaphs 

See my Publications about statistics and data analysis. Courses in data analysis, data management and statistics. 


Follow me... 

Top  DataAnalytics Home  Contact  GardenersOwn Homepage 