The independent t-test: usage when the outcome is skewed

In this practical we’ll practice using the independent t-test to analyse an outcome that is strongly skewed.

Rationale

What if, based on your research question, your data are apparently suitable for analysing via an independent t-test, but you found that the outcome was actually heavily skewed? This is particular common with count data (i.e. variables that can only take integer values from 0 onwards) where outcomes are typically right skewed because there are no negative counts, many low value counts and fewer higher value counts. To be clear there are more robust and tailored analytical approaches to analysing right-skewed data, such as generalised linear models (which are an extension of the linear model regression framework), but they are beyond this course. Therefore, given that the independent t-test assumes, at least approximately, normally distributed data within each comparison group, let’s look at one crude but often quite effective way we can still use the independent t-test to analyse an outcome with right-skew in relation to a binary independent variable.

Practice

Scenario

You and your colleagues have been tasked by the your district’s health authorities to help them deal with an outbreak of Ebola. Many villages are affected and case counts per village over the last month have been collected. The health authorities want to understand where best to focus their limited resources, so they want you to help them address a descriptive research question: what is the association between village density, specifically villages with <10 individuals per 100m² compared to those with ≥10 per 100m², and the mean number of Ebola cases.

Exercise 1: use the independent t-test to analyse how the mean number of Ebola cases per village differs between villages with <10 individuals per 100m² and those with ≥10 per 100m²

Load the “Ebola data.sav” SPSS dataset.

Again this is a simulated dataset. There are three variables and every observation is meant to represent a record from a separate village in a hypothetical region suffering from an Ebola outbreak. The variables contain the following data:

n_ebola_cases = number of Ebola cases per village over the last month.
n_chw = number of community health workers per village.
pop_density = relative population density per village (1 = <10 per 100m², 2 = ≥ 10 per 100m²).

Using an independent t-test we’ll analyse whether there is a difference in the mean number of Ebola cases per village over the last month between villages with <10 individuals per 100m² and those with ≥10 per 100m², i.e. between lower and higher density villages.

Step 1: check the assumptions of the independent t-test.

1. Continuous outcome

Technically an independent t-test assumes a continuous outcome variable, but as long as the following two assumptions are satisfied it’s fine to use a discrete outcome with an independent t-test. Here our outcome is a discrete numerical variable.

2. Independent groups

We will assume this holds here as we didn’t build non-independence into the data, but it may well not hold if this was real data. Can you think why? It’s likely that villages that are closer together have correlated Ebola rates, given spread between villages that are closer together is more likely than between villages that are farther apart. If this was a big issue we could most simply aggregate or pool villages that are closer together to create cluster summary values of the outcome, if these clusters were large enough that there was negligible transmission between clusters, or we could use a more sophisticated technique like a multi-level regression model.

3. The outcome is approximately normally distributed within each group

See the “Step 1: check the assumptions of the independent t-test” section in the Inferential analysis 1: the independent t-test” section above for a reminder of how to do this using histograms, but this time use the n_ebola_cases variable as the outcome and the pop_density variable as the rows grouping variable in your histograms. You’ll see that the distribution of Ebola cases per population density group is right skewed, particularly for the lower density group. Ideally we’d analyse such data using a special model for count data like a Poisson or negative binomial model, but those approaches are beyond the scope of this introductory course.

Instead, we can try and transform the outcome to address the skew and then use an independent t-test that assumes normal data. There are many possible ways to transform data, and many are quite complicated and again beyond this course. However, for right skewed data two common and simple transforms that are often sufficient are to:

Take the square root of every value.
Take the logarithm (usually the natural log) of every value.

These transformations will hopefully “pull in” the right skewed values so the distribution becomes more symmetrical and approximately normal (although it will never be truly normal). Log transforms will pull any skew in more strongly than square root transforms and are often preferred and probably best to try first. Note: as you cannot take the square root or logarithm of negative numbers you can only apply these transformations to positive values. Also, while you can take the square root of 0 (it’s 0) you cannot take the logarithm of 0 (it’s “undefined”). Therefore, if your outcome has any values of 0 then you cannot take its logarithm unless you first add a constant to every value. This is a bit of a fudge type approach and is really not ideal, but it is done and people usually add a small value, maybe most often 1 (or sometimes 0.1), but there’s no right or wrong number really.

4. Equal variances in each group

We can either use the histograms we produce to compare variances in each group or use the result of the Levene’s test once we run the independent t-test.

Step 2: transform the outcome

Let’s create a ln (natural log) transformed variable using the Compute tool to deal with the right-skew in the outcome’s distribution within each group.

First explore the variable by creating a frequency table. See the “Categorical variables” sub-section in the “Exercise: create a”Table 1” summarising the key characteristics of the SBP data study sample” section above if you need a reminder of how to create a frequency table. Note: these instructions were for categorical variables but you can use numerical variables too.
What do you notice about the range of the outcome? It includes 0. Therefore, we must add a constant before transforming via the logarithm. We’ll add 1 as it’s probably the most common value used in such a situation.
From the main menu go: Transform > Compute Variable. Then in the Compute Variable tool call the new variable ln_n_ebola_cases, but this time enter the transform command “LN(n_ebola_cases+1)” (without quotes) and then click OK. This command first adds 1 to every outcome value before taking the natural log.
Once you’ve computed your log-transformed variable check the distribution of values for the transformed outcome in each group using histograms again. You should see that they look more normal now.

Step 3: run the independent t-test

Run an independent t-test comparing the mean of ln_n_ebola_cases between the lower and higher groups of the pop_density variable. If you can’t remember how to refer back to the previous “Step 3: run the independent t-test” section above. When defining the groups enter Group 1 as “1” (the <10 individuals per 100m² group) and Group 2 as “2” to replicate the results I present, but you could of course compare them the other way round.

Step 4: understand the results tables and extract the key results

Again, refer back to the previous “Step 4: understand the results tables and extract the key results” section above if you need a refresher, but we are just extracting the same results here.

Step 5: report and interpret the results

You should get the following result using the equal variances not assumed set of results, given the Levene’s test is significant and the histograms indicate non-equal variances, which could be reported as follows (remember to explain your analysis process and justify it in your methods):

Comparing the <10 per 100m² group (mean Ebola cases per village = 1.94) to the ≥ 10 per 100m² group (Ebola mean cases per village = 6.63) there is a mean difference in the natural-log number of Ebola cases per village of -1 (95% CI: -1.1, -0.9).

Note: I’ve presented the group means on the original scale so they can be interpreted easily, and I’ve not bothered presenting the associated p-value above. Again it tells us nothing more and far less than the effect size or mean difference and the associated 95% confidence intervals.
However, this mean difference and its 95% CI is for our outcome but on the natural log scale (i.e. how we transformed it), so it isn’t easy to interpret: what does a mean difference of -1 ln Ebola cases mean? Luckily we can transform (back transform) this mean difference back onto the original scale by using exponentiation with the base e applied to each value (https://en.wikipedia.org/wiki/Exponentiation). This is actually maybe most easily and quickly done just using the Google search engine’s calculator functions. Just type “exp(X)” into Google, where X is either the value of the mean difference or the upper or lower confidence interval value, and it will back-transform those values back to their original scales.
Doing this for the mean difference and each 95% CI and you should get the following result:

Mean difference = 0.36 (95% CI: 0.32, 0.4).

So how do we interpret this now? We must take care because we have calculated a difference on natural-log transformed data (via the independent t-test) and then back-transformed that mean difference of ln-transformed values. What we actually then ultimately get is a ratio between the geometric mean (https://en.wikipedia.org/wiki/Geometric_mean) of the outcome in the two comparison groups, rather than a difference in the arithmetic mean of the outcome in the two comparison groups, as we get with an independent t-test where we do not ln-transform the data and then back-transform the results.
For example, ln(2) – ln(4) is -0.6931472, and if you calculate the exponential (with base e) of -0.6931472 you get 0.5, and the ratio of 2:4, i.e. 2/4 = 0.5. Or vice versa: if you calculate the exponential of ln(4) – ln(2) you get 2, and the ratio of 4/2 is 2! Therefore, the exponential back-transformed result now represents a ratio, but as we took the mean of the log-transformed values when we back-transform these would become geometric means, so the ratio is between the geometric mean of the outcome for each group. Therefore, like with risk/odds ratios as this result is now on a ratio or multiplicative scale the null value (i.e. the value of no difference between the two groups) is now not 0 but 1, because any number divided by itself = 1. And just like with risk/odds ratios we interpret the result in terms of the number of “times” our reference group’s mean value is compared to the comparison group.
Therefore, in a results section we can say that “in villages with <10 individuals per 100m² the geometric mean number of Ebola cases was 0.36 (95% CI: 0.32, 0.4) times the geometric mean number of Ebola cases found in villages ≥ 10 individuals per 100m².”
You can also view this ratio of means in percentage terms by converting the result using one of the following simple sums, which you may find easier to interpret:

When the exponentiated difference (D) is <1 the % decrease = (1 - D) x 100. When the exponentiated difference (D) is >1 the % increase = (D - 1) x 100.

So for our result we can calculate that the geometric mean number of Ebola cases was (1 - 0.36) x 100 = 64% lower in low density areas compared to high density areas (you should also transform each confidence intervals range value onto the percentage scale and present them along with the point estimate in any results section).
Note: geometric means are typically very similar to arithmetic means for outcomes that don’t have a huge range, i.e. that don’t span a number of orders of magnitude. Therefore, for many outcomes you can think of the geometric mean as being approximately equivalent to the arithmetic mean, but this won’t be the case for outcomes with big ranges spanning orders of magnitude from the smallest to the largest value.
Lastly, if we were comparing a numerical variable between two groups what if our data are still badly skewed despite transformation? Then we can use a non-parametric test, such as the Mann-Whitney U test (the most common fall back if the independent t-test cannot be used). We will not cover that in this class so we have more time for more sophisticated tests, but you should have no serious difficulties running and interpreting such a test now using one of the many online or text book guides available (see the MWU SPSS.pdf files in the “Computer Practical sessions” “Additional materials” folder on Minerva). The two big limitations of this test are the reduced power and the fact it only gives you a p-value to accompany your difference (which given the skewed data should arguably be summarised via the median) but no confidence intervals.

Additional exercise: use the independent t-test to analyse how the mean number of Ebola cases per village differs between villages with <5 community healthcare workers and those ≥5

Using the “Ebola.sav” dataset and the process outlined above use an independent t-test to analyse the relationship between community health worker number and Ebola case number.

Divide n_chw into two groups based on n_chw values <5 and those ≥5.
As we know the outcome n_ebola_cases is right skewed we must first transform it. We know a natural log (ln) transform works reasonably well, so transform this variable (or use the already transformed version you’ll have created earlier). Remember as there are 0s in the outcome we must also first add a constant before transforming. When I did the analysis I used a constant of 1, so I suggest you use this to avoid any differences, although they should be very minor.
Compare the number of Ebola cases per village for villages with <5 community health workers to those with ≥5 community health workers using the independent t-test.
Extract the mean difference and confidence intervals around this estimate and back transform them by exponentiation onto their original scale. Remember you can do this quickly via Google by Googling exp(x) where x is the ln-transformed mean difference/upper or lower confidence interval of the ln-transformed mean difference.
In the “Exercises” folder open the “Exercises.docx” Word document and scroll down to Independent t-test with a skewed outcome: the relationship between community health worker number and Ebola case rate within villages.
Write a couple of sentences reporting the results of your analysis. Include the basic descriptive statistics: sample size, group sizes and group outcome means (on the original scale). Also be sure to include sufficient details about the outcome variable and the comparison made, including how the two independent groups were defined, as well as the type of analysis used, and of course the key inferential results, but you also need to mention the fact that the data were ln-transformed prior to analysis to deal with the right skew/non-normality and that the mean difference and confidence intervals presented were then back-transformed. Remember you need to interpret the result carefully because the transformation and back-transformation mean that the mean difference is now no longer a simple mean difference in reality. See “Step 5: report and interpret the results” if you need reminding. Round results to one decimal place. You don’t need to explain anything about the study or interpret the clinical or practical importance of the result.
Write a sentence or two about the key limitations of this analysis in terms of interpreting the result.
Once you’ve completed this compare your reporting to the below example text.

Example results reporting text

Read/hide

Using an independent t-test I analysed the relationship between the number of community health workers per village (<5 compared to ≥5) and the number of Ebola cases per village. Out of a total sample size of 250 villages, 107 had <5 community health workers (arithmetic mean Ebola cases = 3.1) and 143 had ≥5 community health workers (arithmetic Ebola cases = 3.4). Due to a strongly right-skewed outcome I first transformed the outcome by taking the natural log of each outcome value (first adding a constant of 1 due to the presence of 0s) before back-transforming the resulting independent t-test results via exponentiation with the base e. This indicated that villages with <5 community health workers had a geometric mean number of Ebola cases that was 0.9 times (95% CI: 0.7, 1.04) the geometric mean number of Ebola cases among villages with ≥5 community health workers, or equivalently the geometric number of Ebola cases among villages with <5 community health workers was 10% lower than the geometric mean number of Ebola cases among villages with ≥5 community health workers.

Therefore, based on the confidence intervals there was no clear or statistically significant relationship between the number of community health workers in a village and the number of Ebola cases. The key limitation of this analysis is that it does not adjust for any other confounding variables, of which there are likely to be many (especially in an observational cross-sectional study like this). Therefore, this is likely to represent a biased estimate of the independent relationship between community health worker number per village and the number of Ebola cases per village in the target population.