Dr. Mark Gardener

 
About

Providing training for:

  • Ecology
  • Data analysis
  • Statistics
  • R The statistical programming language
  • Data management
  • Data mining

Statistics – A guide

These pages are aimed at helping you learn about statistics. Why you need them, what they can do for you, which routines are suitable for your purposes and how to carry out a range of statistical analyses.

On this page:

<= Back to Summarizing Data| Introduction | Forward to Next Topic=>

Data Analysis Home


See also:

Plan your statistical approach before you collect any data

Planning:

Saves time
Saves effort
Informs your data collection methodology

Top

Choosing the right statistical analysis

There are many different types of statistical analysis. Choosing the correct analytical approach for your situation can be a daunting process. In this section you'll get an overview of the statistical procedures that are potentially available and under what circumstances they are used.

You should plan your statistical approach at the start of your project, before you collect any data. Different statistical tests have different requirements and planning in advance has various benefits:

  • Knowing the statistical approach will allow you to plan the way you collect your data.
  • You will save time because you'll only collect relevant data.
  • You will save effort.

If you simply collect data and then look for a way to carry out an analysis you may find that you do not have quite what you need to answer your research question.

Knowing what type of project you have and what sort of data you will collect can be useful in determining the best analytical approach. Read notes on types of project, types of data and recording your data next or jump direct to the key. The key should allow you to work out what statistical test is most appropriate for your project. At some future point I'll add notes about how to carry out some of the analyses.

Types of Project | Types of Data | Recording data | Key to statistics


Types of project:

Differences
Correlations & Regressions
Association
Classification
Spatial Patterns

Some disciplines have special type of project, e.g. in ecology:

Population
Community classification
Diversity

Top

Types of project

Knowing what kind of project you are undertaking can be a big help in working your way towards the most appropriate statistical approach. The following list covers a range of possibilities:

  • Descriptive
  • Differences – you split your data into groups.
    • Differences between sample groups.
    • Differences between sample data distributions.
    • Clustering – splitting data into different groups.
  • Links – you join variables.
    • Correlations and regressions – variables are generally continuous.
    • Associations – variables are always categorical.
      • 2-Classes – Chi squared tests using frequency data.
      • 1-Class – Goodness of fit tests using frequency data.
    • Classification – hierarchies of "relatedness" (see also Patterns and classification).
      • Similarity and dissimilarity.
      • Clustering.
      • Hierarchical classification.
    • Ordination – some methods of ordination allow you to link response and predictor variables.
  • Patterns and classification
    • 1D – linear patterns (runs tests).
    • 2D – nearest neighbour (clusters).
    • nD – multi-dimensional patterns (ordination).
    • Classification – hierarchies of "relatedness" (see also Links).
      • Similarity and dissimilarity.
      • Clustering.
      • Hierarchical classification.
  • Time related – patterns in time data.

There are some miscellaneous analyses concerned with properties of the data and statistical tests:

  • Distribution – checking that data conform to a particular distribution (usually normal).
  • Power – checking the discriminatory power of your differences tests.
  • Assumptions – checking assumptions of particular tests, such as equality of variance.
  • Adjustments for multiple testing – the more tests you run the greater the chance of something being significant. There are methods to account for the running of multiple tests (particularly for differences tests).

There are also various miscellaneous categories that crop up in certain disciplines. In ecology for example:

  • Population estimates – working out the number of individuals in a given area.
  • Community classification – describing a community of plants (or animals).
  • Diversity measures – the number of different species and their relative abundance.
  • Indicator species – identifying species that are indicative of certain groups.

Most projects will fit into one (or more) of the preceding types. In the key you will be able to choose the most appropriate type for your situation and work your way towards an analytical approach (or perhaps several).


Two sorts of variable:

Response (dependent)
Predictor (independent or factor)

Three kinds of data:

Interval
Ordinal
Categorical

The kind of data affects the statistical approach

Top

Types of data

There are two main sorts of data variable you will collect:

  • Response – sometimes called dependent. These are the "things" that are affected by the other variables and/or experimental situation.
  • Predictor – sometimes called independent (or factor). These are the variables (factors) that affect the response variable(s).

The response and predictor variables that you measure and record come in different forms. The form of data will affect the kinds of statistical approach you take.

  • Interval – these are "real" measurements, such as height, weight, abundance. You have numerical values that you can arrange in order and can tell the interval between measurements (determined by the precision).
  • Ordinal – these are measurements that can be ranked in order (of size) but you cannot tell the interval between measurements. For example: Large, Medium, Small.
  • Categorical – these are not measurements but classifications, e.g. red, blue, green.
  • Count or Frequency – these are essentially the same as categorical; you have a count (frequency) of observations for each of various categories.

The kind of data you collect affects the statistical approach. You may decide to use ordinal measurements to save time for example, but this will limit the kinds of analysis you will be able to conduct subsequently. You'll be able to see how the kind of data affects your analytical approach in the key.


Use recording layout to maximize usefulness and effectiveness

Each column represents a single variable (response and predictors)

Each row represents a single record, that is the observations or replicates

See Scientific Recording Format article in Writer's Bloc

Top

Recording data

How you record your data is important. If your data are written down in a sensible arrangement you can make sense of them more easily and carry out any statistical analysis more easily and effectively. Having a good data recording system is an important aspect of any project.

In general you want to use a scientific recording layout for storing your data. In this format you have a column for each variable. Each row represents a single observation (replicate).

For example; here is a simple dataset recorded as two samples. The data show the length of jawbones (in mm) of golden jackal from male and female specimens.

Female

Male

110

120

111

107

107

110

108

116

110

114

105

111

107

113

106

117

111

114

111

112

In scientific recording format you would represent the data like so (only part of the dataset is shown):

Length (mm) Sex
120 Male
107 Male
110 Male
110 Female
111 Female
107 Female

The first column is the response variable, the length of the jawbone, which you think is affected by the predictor variable. The second column shows the predictor variable (sex), which shows two levels, male and female.

Having this layout makes it easier for most statistical programs to deal with the data. It also helps you to manage your data. You can use pivot tables to rearrange the data and help you explore the dataset. See more about scientific recording format in the Writer's Bloc section.


Main KEY to statistical tests

Top

Key to statistical analysis

Start by selecting the most appropriate type of project or analysis from the list in the first column of the table below. The second column contains notes about options for the various project types to guide your decision. Use the hyperlinks to jump to sections containing new choices; keep going until you end up with an appropriate type of analysis. You may find that your project falls into more than one type or that you end up with several options at the end. You can use the back button on your browser to back-up and hit the key link to return to the start point at any time.

I am an ecologist so my examples will generally lie in that area. This is a work in progress; there are many omissions and alterations are inevitable, please be patient...

Start Here

Type Notes
Descriptive

Summary statistics and discipline-related:

  • Describing a sample of data – summary statistics.
  • Describing the elements of a community.
  • Autecological – describing the characteristics of a species.
Data distribution

Tests of data shape:

  • Testing for the normal (parametric) distribution.
  • Comparing the shape of a sample to another sample.
  • Comparing the shape of a sample to a known distribution (e.g. poisson).
Assumptions

Testing the assumptions required for a statistical analysis:

Differences

Splitting data into sampling "chunks" to explore differences between them:

  • Comparing one sample to a particular value.
  • Comparing two samples.
  • Comparing data with one predictor variable that has two levels.
  • Comparing data with more than two samples.
  • Comparing data with more than one predictor variable.
  • Comparing proportions – proportions are frequency data and so are more allied to tests of association (see Links).
  • Comparing two data distributions (see Data distribution).
  • Comparing a response variable at different times (see also Time related data).
  • Splitting a dataset into smaller units – see Patterns and Classification).
Discriminatory power Working out what data you need to achieve a certain level of discriminatory power for various differences tests (see also Differences). The sample size (replication) is the most commonly calculated variable.
Links

Linking variables together:

  • Linking two interval or ordinal variables – correlation.
  • Linking a response variable with one or more predictor variables – regression.
  • Linking a response variable to a time component (see also Time related data).
  • Linking two sets of count or frequency data – association.
  • Linking one set of count or frequency data to another – goodness of fit.
  • Linking one data distribution to another – see Data distribution.
  • Comparing proportions – proportions are frequencies (see also Differences).
Patterns and classification

Patterns:

  • One-dimensional – looking for a pattern in binary data, i.e. where there are only two options.
  • Two-dimensional – looking for a pattern in the spatial arrangement of items (e.g. species, nests).
  • Multi-dimensional – several response variables (e.g. species abundance) and one or more predictor variables.
  • Classification patterns – clustering into groups.
  • Classification patterns – hierarchies of relatedness.
Time related data

Time can be a major or minor component:

  • Looking for patterns in data collected over time.
  • Comparing something at different times (see Differences).
  • Linking some variable with changes over time (see Links).
Population estimates You are looking to estimate the size of a population of mobile organisms.
Community classification You are looking to define a type of community (plants or animals) according so some set scheme (such as the National Vegetation Classification).
Diversity You have several species, in one or more areas, and wish to explore the diversity.

Identify the most appropriate starting point from the preceding table. Click the links to go to the relevant sections. You can return to the start using the key link. In future there will be links to pages giving more details about how to carry out some of the statistical tests, with examples and so on. Please be patient...


Descriptive:

Summarizing Samples
Discipline specific

KEY

Top

Descriptive

There are two sorts of descriptive project to consider:

You should always summarize your data. At least you should know what data distribution you are dealing with, as this will affect the kind of analytical approach you can take.

Other kinds of descriptive project will be specific to your discipline. In ecology for example you may classify a community of plants or describe the "features" of a species. See Community classification for more details.


Data distribution:

Testing for normal distribution
Comparing two samples

KEY

Top

Data distribution

The shape of the data affects the kind of analyses you can carry out. The most common test is to compare to the normal distribution but you can also compare a distribution to another known distribution or to another sample. There are various choices:

  • Graphical analysis – histograms or quantile-quantile plots.
  • Comparing to the normal distribution – the Shapiro-Wilk test, Kolmogorov-Smirnov test.
  • Comparing two distributions – the Kolmogorov-Smirnov test.

In some kinds of statistical testing it is not the actual data that need to be normally distributed but the residuals (in ANOVA, Correlations and Regressions that use the properties of the normal distribution for example). You can check these residuals after conducting the test. If the residuals are non-parametric then you should attempt to transform data to improve the situation or find an alternative.


Assumptions:

Homogeneity of variance

KEY

Top

Assumptions

Most statistical analyses make assumptions about the data. Usually the shape of the data is the most important, and comparing the data distribution is an important element of the analysis.

Other assumptions are concerned with measures of dispersion. The options depend on the distribution of the data:

  • Parametric data – homogeneity of variance tests: Bartlett test, variance test.
  • Non-parametric data – differences in scale parameter: Ansari-Bradley, Mood test.
  • Non-parametric data – rank-based homogeneity of variance: Fligner-Killeen test.

Differences projects split data into sampling groups

KEY

Top

Differences between samples

Features of differences projects:

  • You are looking to split your data into chunks, which you then compare.
  • You may have one or more response variables.
  • You may have one or more predictor variables.
  • Predictor variables are usually factors (i.e. not numerical).
  • You can compare one group with one or more others or to a "fixed point".
  • Your comparison can be in matched pairs.
  • Your response variable can be interval or ordinal (but not categorical).
  • Your response variable can be a percentage or proportion (but see Links).
  • Your response variable can be a population estimate or diversity.

Now select the option that most corresponds to what you want to do.

Option Notes
One sample You have a single sample and want to see if the average is different from a fixed value.
Two samples You have two samples and wish to look at differences between them. If you have more than one predictor variable proceed as if you had multiple samples.
Matched pairs You have two samples, which are closely linked. Each observation in one sample corresponds to a single observation in the second sample.
Multiple samples You have more than two samples and wish to look at differences between them. You may have more than one response variable. You may have more than one predictor variable.
Proportions Your data are in the form of proportions (see also Links).
Population You have the population size estimate for one or more species.
Diversity You have the diversity for one or more sites/samples.

Choose the most appropriate option from the preceding table. Click the links to go to the relevant sections. You can return to the start using the key link.


You can compare a single numerical sample to a fixed value

KEY

Top

One-sample differences

If you have a single sample of numerical values you can compare the average (mean or median) to a fixed value. Your data can be interval or ordinal but not categorical.

There are two main choices, which depend on the shape of the data (the data distribution). If you have ordinal data or the data are not normally distributed then you will use the 1-sample Wilcoxon rank sum test. If you have normally distributed data you can use the one-sample t-test.

You can also use a Permutation test, which involves repeatedly re-sampling your dataset.


1-sample Wilcoxon rank sum test compares the median of one sample to a fixed value
One-sample Wilcoxon sign rank test for non-parametric data

If your data are ordinal or are non-parametric (i.e. the data do not have a gaussian distribution) you can use a one-sample Wilcoxon rank sum test. Essentially you "convert" the data sample values into ranks (from smallest to largest). The median value is then compared to a fixed value, which you specify, using the test.


1-sample t-test compares the mean of one sample to a fixed value
One sample t-test for parametric data

If your data are normally distributed (they must be interval level data) you can use the one-sample t-test to compare the mean of your sample to a fixed value.


Two samples can be compared using the U-test or the t-test, depending on data distribution

KEY

Top

Two-samples differences

If you have two samples of numerical values you can compare the average (mean or median) of the two samples. Your data can be interval or ordinal but not categorical. You can only have a single response variable and a single predictor variable. When you have two samples you effectively have a single predictor variable.

There are two main choices, which depend on the shape of the data (the data distribution). If you have ordinal data or the data are not normally distributed then you will use the Wilcoxon rank test (also called a U-test). If you have normally distributed data you can use the Student's t-test.

There are versions of both tests for the situation where the samples are not independent but are in matched pairs.

You can also use a Permutation test, which involves repeatedly re-sampling your dataset.


The U -test compares medians between two non-parametric samples

Wilcoxon sign rank test is used if samples are in matched pairs

KEY

Top

Two-sample Wilcoxon rank test (U test) for non-parametric data

When you have two samples that are ordinal or otherwise non-parametric you can compare their medians using a U-test. The original values are converted to ranks and these ranks are compared to assess the degree of overlap between the two samples.

There are two main options for the U-test:

  • Data samples are independent – Wilcoxon rank sum (U-test).
  • Samples are in matched pairs – Wilcoxon sign rank test.

If the samples are independent you use the regular U-test. If samples are in matched pairs you can use a version called Wilcoxon sign rank test.

Although the tests are independent of the shape of the data. The two samples should have a similar shape (see Data distribution and Assumptions).

If you have multiple samples you might use several pairwise U-tests, getting a p-value for each one. If so then you should modify the p-values to take into account the multiple tests. However, it may be better to use a multiple sample approach.


The t-test compares means between two parametric samples

Versions for:

Matched pairs
Equal variance
Unequal variance

KEY

Top

Two sample t-test (Student's t-test) for parametric data

If your data are normally distributed (usually this will be interval data) you can use the properties of the normal distribution to assess the difference in means between the samples. Student's t-test is the tool for the job.

There are three main versions of the two-sample t-test:

  • Data are in matched pairs – matched pair t-test.
  • Samples are independent and with equal variance – Student's t-test.
  • Samples are independent but with unequal variance – t-test with Welch (Satterthwaite) modification.

The "classic" t-test assumes equal variance for both samples. If the variance is unequal the Welch two-sample t-test should be used; this adjusts the degrees of freedom to compensate. There are tests for equality of variance (see Assumptions).

If you have multiple samples you might use several pairwise t-tests, getting a p-value for each one. If so then you should modify the p-values to take into account the multiple tests. However, it may be better to use a multiple sample approach.


Matched pairs:

Observations are not independent, one observation has a corresponding measurement in the other sample

KEY

Top

Matched pairs

If you have two samples that are "matched", you can use special versions of the t-test and the U-test. The test you use depends on the shape of the data. In matched pairs tests a single observation from one sample can be matched to a corresponding observation from the other sample. The samples are therefore not independent.

Examples of matched pairs include:

  • Observations taken on items at two different times. Each observation has two measurements, which match up (e.g. before and after some treatment).
  • Observations taken from a sampling unit, which has two "levels" e.g. the north and south sides of a tree or a sticky trap with a yellow half and a green half.

If you can pick an observation from one sample and find a natural partner in the second observation, then you've got a matched pair design. Just because you collected an observation from a quadrat and labelled it "1" does not mean it matches with a random observation in another site that you also labelled "1".

If you do have matched pairs the t-test is used for parametric data and the U-test for non-parametric data. See Two-sample tests for more information.

You can also use a Permutation test, which involves repeatedly re-sampling your dataset.


If you carry out multiple tests you need to modify the p-values to take the mulitiplicity into account

KEY

Top

Adjustments for multiple testing

If you have carried out several statistical tests you will have several p-values, one for each test. However, the more tests you run the greater the likelihood that something will be statistically significant. In this case you should modify your p-values to take into account the multiple tests.

If for example you had several samples you could use the t-test to explore differences between the samples pair-by-pair. You would end up with a p-value for each pairwize comparison. What you should do is to modify the p-values. The most conservative method is the Bonferroni correction. You multiply the p-values by the number of tests you carried out. There are other less conservative methods (e.g. Holm).

There are usually alternatives to multiple tests, for example instead of t-tests use ANOVA. Instead of U-tests use Kruskal-Wallis. These approaches test the overall situation and permit post-hoc testing of the pairwise comparisons with a less stiff penalty on the p-values.


Multiple sample testing. Method depends on shape of data and number of predictor variables

For normal data use ANOVA and post-hoc tests

For non-parametric data use:

Kruskal-Wallis
Friedman
Quade

Permutation tests are becoming popular as they are virtually independent of data shape

KEY

Top

Multiple samples differences

If you have more than two samples of numerical data your approach depends on the data distribution and the number of predictor variables:

Distribution Predictors Notes
Normal (Gaussian)
1
  • 1-way Analysis of variance (ANOVA). After the main test you can use post-hoc analysis to look at pairwise comparisons.
  • Linear modelling.
  • Multiple t-tests with modification of the p-values for multiple testing.
Normal (Gaussian)
>1
  • Analysis of variance (n-way ANOVA). After the main test you can use post-hoc analysis to look at pairwise comparisons.
  • Linear modelling.
Non-parametric
1
  • Kruskal-Wallis Rank Sum test. After the main test you can use post-hoc analysis to look at pairwise comparisons.
  • Multiple U-tests with modification of the p-values for multiple testing.
  • Generalized Linear Modelling. If you know the shape of the distribution you can use GLM and specify the appropriate distribution.
  • Permutation tests using least squares is a good alternative that is not reliant on any particular data distribution.
Non-parametric
2
  • Friedman test or Quade test. N.B. These only works for unreplicated block experimental designs. Post-hoc analysis is not sensible.
  • Generalized Linear Modelling. If you know the shape of the distribution you can use GLM and specify the appropriate distribution.
  • Permutation tests using least squares is a good alternative that is not reliant on any particular distribution.
Non-parametric
>2
  • Try to transform data to make it parametric or convert to ranks and carry out regular ANOVA.
  • Generalized Linear Modelling. If you know the shape of the distribution you can use GLM and specify the appropriate distribution.
  • Permutation tests using least squares is a good alternative that is not reliant on any particular distribution.

When you have more than two samples your options depend largely on the shape of your data. If you have normally distributed data (parametric) then analysis of variance is the way to go. You can carry out ANOVA when you have more than one predictor variable. regular ANOVA is sometimes called 1-way ANOVA to indicate that you have a single predictor. If you have 2 predictors then you use 2-way ANOVA and so on.

Once you've carried out ANOVA you can use post-hoc testing to "drill down" into specific pairs of samples.

If your data are non-parametric your options are limited. The non-parametric equivalent to 1-way ANOVA is the Kruskal-Wallis test. Once you have an overall result you can carry out post-hoc testing on pairs of samples.

If you have two predictors then the Friedman and Quade tests can work, but only for unreplicated block designs. Post-hoc testing using these tests is not available.

If you have more than two predictors, or do not have the unreplicated block design, your options are limited. You may be able to transform data to make it more normal (using logarithms or some other transformation). You could convert all the values to their ranks and carry out regular ANOVA; this is a very conservative test.

Permutation tests are becoming more popular as they are virtually independent of the shape of the data. However, they can be "computer intensive" as you need to randomly re-sample your data many times.


Analysis of Variance ANOVA used when data are normally distributed

MANOVA can be used with multiple response variables

Post-hoc testing takes place after the main ANOVA and conducts pairwise analyses

KEY

Top

Analysis of Variance – ANOVA

Analysis of variance is used when your data are normally distributed (but see Data distribution). The properties of the normal distribution are used to assess the differences between sample means. The simplest situation is where you have one response variable and one predictor variable with two levels, that is, you have two samples. In this case the result is equivalent to the t-test.

Most often you use ANOVA when you have a predictor variable with more than 2 levels, i.e. you have >2 samples. When you have 1 predictor (regardless of how many levels) the analysis is usually called 1-way ANOVA. If you have two predictor variables the process is known as 2-way ANOVA, and so on.

If you have more than one response variable there is a special version of ANOVA called MANOVA (multiple-ANOVA). Essentially this carries out ANOVA for each response variable, then adjusts the results to take into account the multiple tests.

ANOVA is closely allied to linear modelling and in many cases the two methods are inter-changeable. Usually your response variable is interval and the predictor variables are categorical when you run ANOVA. If you have a continuous variable as well, the approach is sometimes called analysis of covariance (ANCOVA).

After the main "result" your ANOVA will show you the probability that means between samples were different. You can focus on specific pairs of samples using post-hoc testing. Essentially post-hoc tests are a modified version of the t-test, which take into account the fact that you have run multiple tests.


The Kruskal-Wallis Rank Sum test is the equivalent of 1-way ANOVA for non-parametric data

KEY

Top

Kruskal-Wallis Rank Sum test

If your data are not parametric and you have a single predictor variable you can use the Kruskal-Wallis Rank Sum test. This is sometimes called a non-parametric 1-way ANOVA. If your predictor variable has only two levels (i.e. you have 2 samples) use the U-test instead.

Essentially the procedure converts the original values into ranks. The ranks are then assessed for the amount of overlap. You are comparing sample medians.

After the main "result" your test will show you the probability that medians between samples were different. You can focus on specific pairs of samples using post-hoc testing. Essentially post-hoc tests are a modified version of the U-test, which take into account the fact that you have run multiple tests.


For unreplicated block designs with non-parametric data:

Friedman test
Quade test

KEY

Top

Friedman and Quade tests

When you have two predictor variables and the data are non-parametric, your options are limited. The Friedman and Quade tests compare medians for unreplicated block designs. In other words, you have only one value for each combination of predictor variables.

The data should be arranged in groups and blocks. Each block will contain one observation from each group. The data are "converted" to ranks and these are compared across blocks. Differences between the groups are assessed.

These methods do not have a sensible post-hoc methodology. It is possible to look at pairwise comparisons but the power of the tests is so small that differentiation is all but impossible.


Permutation tests:

Resample the data many times and estimate p-values from the results.

Can be carried out for virtually any differences analysis

KEY

Top

Permutation tests

Permutation tests are virtually independent of the shape of the data. You can carry out differences tests using permutation procedures for any of the differences scenarios. Sometimes tests are called bootstrapping, randomization or Monte Carlo.

Essentially you take your samples and examine the differences in the average (you can use the mean or median, as seems most appropriate). Then you use permutation to shuffle the samples. You carry on many times (generally 1000 or so). After the permutations you end up with many "re-runs" of the original result. You can get approximate p-values by looking at how many times the permuted values were larger than the original.

You can run permutation tests on virtually any data that you would explore for differences in sample average. Such tests are becoming more common as computers become more powerful.


Compare proportions:

Proportion test
Binomial test

These use count (frequency) data so are allied more to goodness of fit tests than differences

KEY

Top

Proportions

If you have proportions you really have count data, that is you have a count of "success" and number of "trials". Proportion data fall under the categorical data heading, tests of proportions are more similar to tests of goodness of fit, rather than differences. There are two main options:

  • Proportion test – you have a list of counts and a corresponding list of trials, count÷trial gives the proportions. The proportion test gives an approximate p-value. You can have 2 or more proportions.
  • Binomial test – this gives exact p-values. You can also specify the hypothesized probability of success (not necessarily 0.5).

See also the Links section (Association tests).


Power tests allow you to determine discriminatory power of your tests under different conditions

KEY

Top

Discriminatory power

Power tests are designed to help you plan your data collection. The most common use for power tests is to determine the sample size(s) required to achieve a certain level of discriminatory power. There are power tests for the following:


Links between variables or samples:

Correlation
Regression
Association
Classification
Ordination

KEY

Top

Links between variables

When you are looking at links you are not looking for differences between sampling units but are generally interested in finding links between variables. There are several approaches:

  • Correlation – you have two data variables (interval or ordinal) and are looking for the strength of the relationship between them.
  • Regression (modeling) – similar to correlation but there is an assumption of a mathematical relationship between variables. You can have more than two data variables and categorical variables can be dealt with.
  • Association – you have categorical data (including count or frequency data). In classic chi squared tests you associate one set of categories with another set. Goodness of fit tests are similar but here you have one set of categories and match to a "fixed" set. This includes proportion tests.
  • Classification – involves linking data into groups (or splitting into groups) and forming clusters of items that are similar to one another (hence the links).
  • Ordination – usually you have many response variables. The various methods of ordination arrange data into "order" (hence the name) so that previously unseen patterns are revealed.

There is some crossover between approaches – I will keep correlation, regression and association in the Links heading and deal with classification and patterns separately. In the following table you can see some examples of project for the various approaches.

Type Notes
Correlation

You have two variables and are looking for the strength of the link between them:

  • The variables are interval or ordinal but not categorical.
  • The variables are matrices with the same number of rows and columns – Mantel tests.
Regression

You have 2 or more variables and are looking for the strength of the link as well as the mathematical formula:

  • Two interval level variables that are normally distributed – Pearson's product moment (see also Correlation).
  • One response (parametric) and more than one predictor variable – Multiple regression (linear modelling). The expectation is of a linear fit, i.e. a relationship of the form y = mx+c.
  • One response (binomial, i.e. two options) and any number of predictor variables – Logistic regression (a form of Generalized Linear Model).
  • A response variable with known distribution and any number of predictor variables – Generalized Linear Modelling (GLM).
  • A response variable and one or more predictors. The relationship between variables does not conform to a y = mx+c form – Non-linear modelling.
  • A response variable and one or more predictors where the variables do not conform a standard distribution or the fit of the model is not linear – see Non-linear modelling.
Association

Your variables are categorical (count or frequency data):

  • You are looking for the association between two sets of categories (you may also have additional grouping variables) – Chi squared tests.
  • You have one set of categories and want to test against a "fixed" set – Goodness of fit tests (see also Data distribution).
  • You want to compare the distribution of a sample with a known distribution or to another sample – see Data distribution.
  • You want to look at differences in proportions – Proportion or Binomial tests (see also Differences: Proportions).

Classification and Ordination

The basis for classification and ordination is the similarity (or dissimilarity) between samples. The general feature is that you usually have many response variables (e.g. species abundance). Methods include:

  • Splitting (or aggregating) samples into clusters, thus identifying similar groups.
  • Creating hierarchies of "relatedness" like a family tree.
  • Taking multi-dimensional data (i.e. many response variables) and presenting 2-dimensional patterns, thus spotting patterns not obvious from the original data.

Use the links in the Type column of the preceding table to jump to the most appropriate section or follow links in the Notes column to jump direct to specific topics.


Correlation assesses the strength of the link between two variables

KEY

Top

Correlation

In correlation you are looking for the strength (and direction) of the link between two continuous variables (but see Mantel tests). The variables can be interval or ordinal. There are three main options:

Generally speaking you will have your data arranged so that each row is a single observation and each column is a variable. You can also perform a correlation between two matrix objects, as long they have the same dimensions.


Parametric correlation

Pearson's product moment

KEY

Top

Parametric correlation

If your data variables are normally distributed you can use parametric correlation. The Pearson Product Moment correlation uses the properties of the normal distribution to determine the relationship.

This method is similar to linear modelling (regression) except that you only get the strength of the link between the variables. The coefficient varies between -1 and 1. The closer to unity the stronger the relationship. A positive correlation implies that as the value of one variable increases so does the other.

Just because you have a statistically significant correlation does not mean that there is direct cause and effect.


Non-parametric correlation:

Spearman's Rho
Kendall's Tau

KEY

Top

Non-parametric correlation

If your data variables are ordinal or otherwize non-parametric you cannot use the properties of the normal distribution. Methods of correlating variables in these instances convert the original data to ranks (i.e. a form of ordinal data) and assess the strength of the relationship using the "match-up" of ranks.

There are two general correlation coefficients for non-parametric data:

  • Spearman's Rho.
  • Kendall's Tau.

The coefficient varies between -1 and 1. The closer to unity the stronger the relationship. A positive correlation implies that as the value of one variable increases so does the other.

Just because you have a statistically significant correlation does not mean that there is direct cause and effect.

Although the tests are independent of the shape of the data. The two samples should have a similar shape (see Data distribution and Assumptions).


Matrix correlation:

Mantel tests

KEY

Top

Correlation between matrices

If you have two matrix objects, with the same number of rows and columns, you can carry out correlation to compare their relationship using Mantel tests.

You can think of the Mantel test as a multivariate correlation.

The Mantel tests use regular correlation coefficients but applied to the matrices under test. So you can specify:

  • Pearson.
  • Spearman.
  • Kendal.

Mantel tests use permutation to determine the statistical significance of the relationship between the two matrices.


Regression describes the links between variables as a mathematical model

Linear modelling
Multiple regression
Generalized Linear Modelling
Logistic regression
Non-linear modelling

KEY

Top

Regression and modelling

There are several forms of regression but you can think of it as an extension of correlation where not only do you look at the strength of the relationship between variables but also describe the mathematical properties of the link(s). There are forms of regression that can handle multiple predictor variables and data that are interval, ordinal or categorical.

In the following table you'll see the various sorts of regression with some notes about when each can be used.

Type Notes
Linear regression a.k.a
Multiple regression or
Linear modelling

In linear modelling the aim is to fit a general y = mx+c model to the data. The x term can be logarithmic or some other power function.

  • You have a response variable at interval level and one or more predictor variables. The residuals of the final model should be normally distributed.
  • You have several response variables at interval level and one or more predictor variables.
  • You have non-parametric data, one or more response variables and one or more predictors – see Non-linear modelling (and non-parametric regression).
Generalized Linear Modelling

GLM is similar to regular regression but the data can be non-parametric. If the data conform to a "standard" distribution (e.g. poisson, binomial, gamma, gaussian) then GLM can be used.

  • You have a response variable and one or more predictor variables. The residuals do not have to be parametric but should conform to a known distribution.
  • You have several response variables and one or more predictor variables. The residuals do not have to be parametric but should conform to a known distribution.
Logistic regression A version of Generalized Linear Modelling where the response variable is binomial, i.e. has two forms (e.g. presence or absence, 0 or 1).
Non-linear modelling and
Non-parametric regression

In non-linear modelling the regression equation does not conform to y = mx+c, hence is non-linear.

In non-parametric regression the residuals do not have to be normally dirstributed. There are methods to cope with linear or non-linear models.

  • non-linear modelling
  • generalized additive models
  • permutational anova
  • ordinal regression (probit regression)
   

Use the preceding table to guide you to the most appropriate kind of regression for your situation. Use the hyperlinks in the Type column to go to the appropriate section.


Linear regression (linear modelling) is closely allied to ANOVA

KEY

Top

Linear regression (multiple regression) and linear modelling

Linear regression (called variously, multiple regression or linear modelling) is a method closely allied to ANOVA. There is a general expectation of a relationship between the variables in the form y = mx+c (linear) and with parametric residuals. Once you have carried out the regression you should check the normality of the residuals.

Your response variables will usually be interval data but the predictor variables can be interval, ordinal or categorical. The "classic" form of multiple regression uses interval data (i.e. numbers on a continuous scale). When predictors are categorical the regression is most like ANOVA. When there is a mixture the method is sometimes called analysis of covariance (ANCOVA). You an also carry out linear modelling when you have several response variables.

Linear modelling is often used to assess the relative strengths (importance) of the various predictor variables on the response. In model-building you aim to add predictor variables one at a time, based on their importance, until you have incorporated only the statistically significant components.


Generalized Linear Modelling is linear regression using alternative data distribution

KEY

Top

Generalized Linear Modelling (GLM)

Generalized linear modelling (GLM) is very similar to linear regression. In GLM you use a model-fitting approach but the data do not have to be normally distributed. The general relationship between variables is still assumed to be y = mx+c but a range of data distribution types can be assessed; these include: poisson and gamma. If you use a gaussian distribution this is identical to linear regression.

You can also use a binomial response variable with GLM. This tends to be known as Logistic regression.


Logistic regression is a binomial GLM

KEY

Top

Logistic regression

If your response variable can take one of two forms you have binomial data (e.g. presence or absence, 0 or 1). In this case you use GLM, which is known as logistic regression. Your predictor variables can be interval, ordinal or categorical. Essentially you are assessing how likely it is that given values of your predictors will result in the response being one form or the other.


Non-linear modelling is used when y = mx+c is not the expected relationship

KEY

Top

Non-linear modelling

In non-linear model fitting there is not an expectation of a y = mx+c relationship. The method essentially reshuffles parameters and continuously re-fits the model until the "best" set of parameters are achieved.


Association tests use categorical (count or frequency) data

KEY

Top

Association

In association tests you have categorical data, that is count or frequency data. There are several options:

  • Linking two sets of categories – the chi squared test.
  • Linking one set of categories to a "standard" – goodness of fit tests.
  • Comparing data distributions – see Data distribution.
  • Comparing proportions – Proportion & Binomial tests (see also Proportions in the Differences section).

KEY

Top

Patterns and classification

Runs tests, nearest neighbour, clustering, k-means, ordination.


 

Ordination

In ordination you generally have many response variables. In ecology for example this is usually abundance of species. There is some crossover with methods of Patterns and Classification.


KEY

Top

Time-related data

There are several approaches when you have time-related data.

  • Looking for periodicity in the data.
  • Looking for differences between time intervals.
  • Linking a response variable to changes over time.

KEY

Top

Population

If you have populations from two samples (or more) or from the same population at 2 (or more) times, you can proceed as if you were comparing samples.


KEY

Top

Community classification

NVC, Ellenberg and miscellaneous methods.


KEY

Top

Diversity

You can compare diversity from one or more samples using a range of approaches.


  This page is still under construction... please come back soon.
Navigate:

<= Summarizing Data| Introduction | Next Topic=>

Excel Tips & Tricks | Tips & Tricks for R | Learn R | MonogRaphs


See my Publications about statistics and data analysis.

Courses in data analysis, data management and statistics.


Follow me...
Facebook Twitter Google+ Linkedin Amazon
Top DataAnalytics Home
Publications
Contact GardenersOwn Homepage