Abiodun Egbetokun, based at the National Centre for Technology Management in Ile-Ife, Nigeria, tackles a common challenge faced by doctoral students and young researchers: the choice of appropriate statistical tests. In this post, he offers four practical tips for making the right choice.
What kind of statistical analysis should I perform on my research data? As researchers, this is a question we often face. Over the years, I have developed some key steps for deciding on the most appropriate statistical tests to apply to my research data, especially if it is primary data.
Before proceeding, though, let me make a general observation. If you are only raising the question of how to analyse your data when you get to the data analysis stage of your research, then it is already too late. The statistical analysis you choose to perform will influence the kind of data you collect – and both of these are decisions that should be made at the research design stage.
Below are four of the most important tips that I have discovered for selecting the most appropriate data analysis method(s):
1. Consider your research question(s) or objective(s)
The story that a researcher tells with a research project can be summarized in one or more clear-cut statements or questions. These questions will dictate the data requirements, and thus the possible statistical tests that could be applied to the data.
Imagine, for instance, that a research team is aiming to determine whether a new anti-hypertension drug is effective. They will need data on two groups of people with hypertension – one of which will take the drug and the other of which will not. They will need to collect data on hypertension indicators in the two groups, both before and after the drug is administered. To analyse this data, any method that allows comparison between two groups will work well. The Chi-square test and independent sample t-tests are just two examples.
2. Consider the level of measurement of the variables
There are four well-known levels of measurement: nominal, ordinal, interval and ratio. For a detailed discussion of each of these levels, this article by Dr Barbara Sommer, a retired lecturer from the University of California, Davis, is a good place to start. The level you use to measure the dependent variable in your research dictates the statistical tests or mathematical operations that will be meaningful.
Take nominal data as an example. This is data that cannot be measured or ordered, but is allocated into distinct categories – for instance, religion or nationality – and, as such, it can only be summarized in frequencies and percentages. Meanwhile, ratio data (e.g. income, distance) can be analyzed by almost any kind of statistical technique. For example, it would be unreasonable to take the average (mean) of religions, but mean distance makes perfect sense. It is bad practice to apply average-based statistical tests to data based on Likert-scale variables.
3. Consider the distribution of the outcome variable
Distribution refers to all the different possible values of the data and how often they occur. A good understanding of distributions is very important for data analysis. For a more detailed discussion about distribution, take a look at this article, by Aswath Damodaran, Professor of Finance at New York University.
Several distributions are relevant for statistical purposes, and every statistical method is built upon a particular distribution. For instance, t-tests are built on standard normal distribution, while logistic regression is built on logistic distribution. The best statistical test to apply depends on the distribution of the outcome variable. If the variable follows a normal distribution (e.g. height, weight or IQ), a linear regression or a t-test will work well. On the other hand, if it is a binary variable (like whether or not someone has a disease), a Chi-square test or a test of proportions will be more appropriate.
All too often, I have seen papers in which the authors apply a linear regression to a variable that is not normally distributed, reflecting a poor knowledge of distribution. Plus, it is important to add that distributions and levels of measurement are closely connected – the level of measurement of a variable defines its distribution. For instance, discrete variables hardly ever follow a standard normal distribution.
4. Consider any specific peculiarities of the outcome variable
Sometimes research data will have peculiarities that should not be ignored in data analysis. For instance, it is possible that the values of a variable are not observable above a certain threshold (e.g. this could occur in a situation where a person weighs more than the maximum measurement on a bathroom scale). This is called censoring – and it’s important to choose a statistical method that takes this into account.
At other times, we may find that our discrete data contains many zero values (e.g. number of patents held by each individual in a population of scientists). This is known as zero-inflation. Sometimes, the zeros arise from different processes (such as when an individual has zero patents because they are not an inventor, or when an inventor actually has no patents at all). There are specific techniques for handling this kind of data, and any other technique is not likely to yield meaningful results.
To summarise, my four top tips on choosing statistical tests are not exhaustive. However, they provide a useful framework for researchers to consider meaningful data analysis. For further reading on the subject, I have personally found the website of UCLA’s Institute for Digital Research and Education very useful.
Abiodun Egbetokun has a PhD in Economics and earlier degrees in Mechanical Engineering and Technology Management. He is the Head of the Science Policy and Innovation Studies Department at the National Centre for Technology Management in Ile-Ife, Nigeria. Find out more about him at: www.egbetokun.com.