Choosing the right statistical test/analysis

Home » Data Analytics Articles » Statistics – a brief guide » Choosing the right statistical test/analysis

There are many different types of statistical analysis. Choosing the correct analytical approach for your situation can be a daunting process. In this section you’ll get an overview of the statistical procedures that are potentially available and under what circumstances they are used.

You should plan your statistical approach at the start of your project, before you collect any data. Different statistical tests have different requirements and planning in advance has various benefits:

Knowing the statistical approach will allow you to plan the way you collect your data.
You will save time because you’ll only collect relevant data.
You will save effort.

If you simply collect data and then look for a way to carry out an analysis you may find that you do not have quite what you need to answer your research question.

Knowing what type of project you have and what sort of data you will collect can be useful in determining the best analytical approach.

Types of Data

There are two main sorts of data variable you will collect:

Response – sometimes called dependent. These are the “things” that are affected by the other variables and/or experimental situation.
Predictor – sometimes called independent (or factor, or grouping). These are the variables (factors) that affect the response variable(s).

The response and predictor variables that you measure and record come in different forms. The form of data will affect the kinds of statistical approach you take.

Interval – these are “real” measurements, such as height, weight, abundance. You have numerical values that you can arrange in order and can tell the interval between measurements (determined by the precision).
Ordinal – these are measurements that can be ranked in order (of size) but you cannot tell the interval between measurements. For example: Large, Medium, Small.
Categorical – these are not measurements but classifications, e.g. red, blue, green.
Count or Frequency – these are essentially the same as categorical; you have a count (frequency) of observations for each of various categories.

The kind of data you collect affects the statistical approach. You may decide to use ordinal measurements to save time for example, but this will limit the kinds of analysis you will be able to conduct subsequently.

Types of Project

Knowing what kind of project you are undertaking can be a big help in working your way towards the most appropriate statistical approach. The following list covers a range of possibilities:

Descriptive
- Summary statistics – describing the general properties of data.
- Discipline-related description (e.g. community classification in ecology).
Differences – you split your data into groups.
- Differences between sample groups.
- Differences between sample data distributions.
- Clustering – splitting data into different groups.
Links – you join variables.
- Correlations and regressions – variables are generally continuous. These analyses are sometimes called Supervised Machine Learning.
- Associations – variables are always categorical.
  - 2-Classes – Chi squared tests using frequency data.
  - 1-Class – Goodness of fit tests using frequency data.
- Classification – hierarchies of “relatedness” (see also Patterns and classification). These kinds of analysis are sometimes called Unsupervised Machine Learning.
  - Similarity and dissimilarity.
  - Hierarchical classification.
  - Ordination – some methods of ordination allow you to link response and predictor variables.
- Patterns and classification
  - 1D – linear patterns (runs tests).
  - 2D – nearest neighbour (clusters).
  - nD – multi-dimensional patterns (ordination).
  - Classification – hierarchies of “relatedness” (see also Links).
    - Similarity and dissimilarity.
    - Hierarchical classification.
  - Time related – patterns in time data.

There are some miscellaneous analyses concerned with properties of the data and statistical tests:

Distribution – checking that data conform to a particular distribution (usually normal).
Power – checking the discriminatory power of your differences tests.
Assumptions – checking assumptions of particular tests, such as equality of variance.
Adjustments for multiple testing – the more tests you run the greater the chance of something being significant. There are methods to account for the running of multiple tests (particularly for differences tests).

There are also various miscellaneous categories that crop up in certain disciplines. In ecology for example:

Population estimates – working out the number of individuals in a given area.
Community classification – describing a community of plants (or animals).
Diversity measures – the number of different species and their relative abundance.
Indicator species – identifying species that are indicative of certain groups.

Most projects will fit into one (or more) of the preceding types. In the key you will be able to choose the most appropriate type for your situation and work your way towards an analytical approach (or perhaps several).

Recording data

How you record your data is important. If your data are written down in a sensible arrangement you can make sense of them more easily and carry out any statistical analysis more easily and effectively. Having a good data recording system is an important aspect of any project.

In general, you want to use a scientific recording layout for storing your data. In this format you have a column for each variable. Each row represents a single observation (replicate).

For example; here is a simple dataset recorded as two samples. The data show the length of jawbones (in mm) of golden jackal from male and female specimens.

Table 2. Data set out in sample format. Lengths of jawbones of male and female golden jackal.

Female	Male
110	120
111	107
107	110
108	116
110	114
105	111
107	113
106	117
111	114
111	112

In scientific recording format you would represent the data like so (only part of the dataset is shown):

Table 3. Data set out in scientific format with a response (dependent) variable and a predictor (independent) variable. Length of mandibles of male and female golden jackal.

Length (mm)	Sex
120	Male
107	Male
110	Male
110	Female
111	Female
107	Female

The first column is the response variable, the length of the jawbone, which you think is affected by the predictor variable. The second column shows the predictor variable (sex), which shows two levels, male and female.

Having this layout makes it easier for most statistical programs to deal with the data. It also helps you to manage your data. You can use pivot tables to rearrange the data and help you explore the dataset.

Key to statistical analysis

Follow the flow chart and click on the links to find the most appropriate statistical analysis for your situation.

Descriptive: describing data.
- Describing a sample of data – descriptive statistics (centrality, dispersion, replication), see also Summary statistics.
Data distribution: tests looking at data “shape” (see also Data distribution).
- Testing for the normal (parametric) distribution – Shapiro-Wilk test, Kolmogorov-Smirnov test, Anderson-Darling test.
- Comparing the shape of a sample to another sample – Kolmogorov-Smirnov test.
- Comparing the shape of a sample to a known distribution – Kolmogorov-Smirnov test.
Assumptions: testing the assumptions required for a statistical analysis.
- Equality of variance:
  - Data are normally distributed – Levene’s test, Bartlett test (also Mauchly test for sphericity in repeated measures analysis).
  - Data are non-parametric – Ansari-Bradley, Mood test, Fligner-Killeen test.
- Normality of the data – Shapiro-Wilk test, Kolmogorov-Smirnov test (also graphical methods e.g. histograms, Quantile-Quantile plots).
- Matching a particular data distribution – Kolmogorov-Smirnov test.
Differences: splitting data into “chunks” (sampling units) to explore differences between them.
- Comparing one sample to a particular value.
  - Data are normally distributed – one-sample t-test.
  - Data are non-parametric – one-sample U-test (Wilcoxon rank sum test).
- Comparing two samples.
  - Data are normally distributed – Student’s t-test (there are two versions, one assumes equal variance, the other does not).
  - Data are non-parametric – U-test (also called Mann-Whitney U test or Wilcoxon rank sum test).
- Comparing data with one predictor variable that has two levels – same as two-samples.
- Repeated measures.
  - Data are normally distributed – The t-test for matched pairs.
  - Data are non-parametric – Matched pairs U-test (Wilcoxon sign rank test).
- Matched Pairs – same as Repeated Measures.
- Comparing data with more than two samples (but only one predictor variable).
  - Data are normally distributed – analysis of variance (ANOVA), linear regression.
  - Data are non-parametric – Kruskal-Wallis test.
- Comparing data with more than one predictor variable.
  - Data are normally distributed – analysis of variance (ANOVA), linear regression.
  - Data are non-parametric – Friedman test, Quade test, generalized linear modelling.
- Comparing proportions – Proportion test (proportions are frequency data and so are more allied to tests of association).
- Comparing two data distributions (see also Data distribution) – Kolmogorov-Smirnov test.
- Comparing a response variable at different times (see also Time related data) – repeated measures ANOVA.
- Splitting a dataset into smaller units – see Patterns and Classification).
Discriminatory power.
- Working out what data you need to achieve a certain level of discriminatory power for various differences tests. The sample size (replication) is the most commonly calculated variable – Power test (various versions, for t-test, U-test and so on).
Links: linking variables together.
- Linking two interval or ordinal variables – correlation.
  - Data are normally distributed – Pearson’s Product Moment (also simple regression).
  - Data are non-parametric – Spearman’s Rank (Rho) or Kendall’s Tau.
- Linking a response variable with one or more predictor variables – regression.
  - Data are normally distributed – linear regression (also called supervised machine learning).
  - Data are non-parametric – generalized linear modelling (if data distribution is known, e.g. logistic regression for binomial data), or non-linear modelling.
- Linking a response variable to a time component (see also Time related data).
- Linking two sets of count or frequency data – Pearson’s Chi Squared association test.
- Linking one set of count or frequency data to another – goodness of fit test or G-test.
- Linking one data distribution to another – see Data distribution.
- Comparing proportions – proportions are frequencies (see also Differences) – Proportion test.
Patterns and classification: looking for patterns in data (e.g. clusters).
- One-dimensional – looking for a pattern in binary data, i.e. where there are only two options – runs test.
- Two-dimensional – looking for a pattern in the spatial arrangement of items (e.g. species, nests) – nearest neighbour.
- Multi-dimensional – several response variables (e.g. species abundance) and one or more predictor variables – cluster analysis, k-means analysis, ordination (e.g. principal coordinates, principal components, multidimensional scaling).
- Classification patterns
  - Clustering into groups – k-means, cluster analysis, ordination (e.g. multidimensional scaling).
  - Hierarchies of relatedness – hierarchical cluster analysis, dissimilarity.
- Time related data: when time is a major or minor component.
  - Looking for patterns in data collected over time.
  - Comparing something at different times (see Differences).
  - Linking some variable with changes over time (see Links).
- Population estimates: you are looking to estimate the size of a population of mobile organisms.
- Community classification: you are looking to define a type of community (plants or animals) according so some set scheme (such as the National Vegetation Classification).
- Diversity: you have several species, in one or more areas, and wish to explore the diversity.

Return to the top of the KEY.

My Publications

I have written several books on ecology and data analysis

An Introduction to R

Data Analysis and Visualisation

£35.00

Buy Now

Beginning R: The Statistical

Programming Language

£26.99

Buy now

Statistics for Ecologists

Using R and Excel

£34.99

Buy now

The Essential R

Reference

£44.99

Buy now

Community

Ecology

£39.99

Buy now

Managing Data

Using Excel

£24.99

Buy now

Register your interest for our Training Courses

We run training courses in data management, visualisation and analysis using Excel and R: The Statistical Programming Environment. Courses will be held at one of our training centres in London. Alternatively we can come to you and provide the training at your workplace. Training Courses are also available via an online platform.

Get In Touch Now

for any information regarding our training courses, publications or help with a data project