Population characteristics
In this practical we will practice some common inferential analysis methods to estimate common univariate statistical properties of characteristics in a population.
Rationale
Previously we saw how we can describe useful properties about univariate characteristics in a sample, such as the mean of a numerical variable and the percentage frequency of one level of a categorical variable. However, we are usually ultimately interested in understanding these same useful statistical properties of our variables of interest but in our target population, not just our specific sample, and that’s what we’ll do here. To do this, we want to compute confidence intervals to go with the sample statistics for our univariate characteristics of interest, where the sample statistics are our best “point estimates” (i.e. single values) of the unknown population parameters of interest, while the confidence intervals allow us to make inferences about the likely range of values that those unknown population parameters lie within, given that we can only estimate them with some amount of sampling error. Usually, when we are estimating population parameters that are some univariate characteristic of a variable we call the variable an outcome variable, but to be clear there’s nothing special about an outcome variable as opposed to any other variable.
We will also see how we can calculate such measures for subgroups within the target population of interest.
If you are confident about the concepts of descriptive/sample statistics and inferential statistics, and with the idea and interpretation of a confidence interval, then feel free to skip to the exercises. However, if you are not confident about these topics then we suggest reading the following.
Descriptive/sample statistics and inferential statistics
Read/hide
It’s important to be very clear on the distinction between descriptive and inferential statistics.
Descriptive statistics
In summary, descriptive statistics, which are sometimes also called sample statistics or summary statistics, summarise statistical properties of individual variables (via “univariate” analyses) or summarise relationships between variables (typically via “bivariate” analyses) as they exist in your sample. For example, common univariate statistics are means, which summarise the typical value of a numerical variable, such the the mean systolic blood pressure in mmHg in the sample, and proportions/percentages, which summarise the frequency of occurrence of some event or condition, represented as one level of a binary/categorical variable, such as the proportion/percentage of smokers in the sample (as opposed to non-smokers).
Assuming you have no sources of bias in your study these descriptive statistics will reflect the truth about your sample. For example, if you have no bias then the true mean systolic blood pressure in mmHg in your sample will be the sample mean of all the systolic blood pressure values.
Inferential statistics
However, on their own these descriptive statistics do not allow you to make robust generalisations about the same statistical properties of individual variables or relationships between variables in your target population. Remember, when we are interested in the statistical properties of individual variables or relationships between variables in a target population we call the statistical quantities that reflect these properties “population parameters”, and we assume that they are fixed for the population and time point of interest. In theory, if we could take a census of the whole target population and measure our outcomes and relationships of interest without error then we could measure our population parameters without error. However, almost always our target population is far too large to do this and we need to take a sample, ideally using robust, representative, probability sampling methods, as we have seen. The equivalent sample statistic is then our best “point estimate” of the population parameter of interest, but as the name implies it is an estimate with an unquantified amount of error.
It is easy to see why this is the case. Let’s assume we want to infer the mean systolic blood pressure in mmHg for some target population that contains 100,000 individuals. Let’s assume we take a simple random sample of just 2 individuals and measure, without error, their systolic blood pressure in mmHg. The mean of these 2 systolic blood pressures values will be the true mean systolic blood pressure for our microscopically small sample of 2! However, does anyone believe that a mean of just two individuals’ systolic blood pressures, however representative they are, is likely to accurately reflect the true mean systolic blood pressure for the overall target population? Of course by chance it might be really accurate, but in general we’re very likely to get a sample that produces an estimated mean that is not reflective of the true population-level mean. And if we instead took another simple random sample of 2 other individuals and computed a new mean systolic blood pressure it would be very likely to be different from the first mean. So if each sample mean would vary, and we can usually only collect one sample, how can we tell how accurate our sample mean is?
This is the problem of sampling error and sample size. Each sample would likely produce a different estimate of our fixed population parameter, and depending on the sample size the estimate would be likely to vary more/less between each repeated sample.
Therefore, to let us infer the likely value of our population parameter we need to combine our sample statistic (i.e. our best point estimate) with some inferential measure. As we will discuss in the relevant lectures, this can be done most effectively by calculating the corresponding confidence intervals for the sample statistic, or a different approach involving hypothesis testing would be to compute an associated p-value. Note: this broad inferential approach is not just for when we are infering the statistical properties of individual variables in a target population, but also for when we are infering relationships in a target population. We combine the relevant sample statistic with a suitable inferential measure.
Descriptive statistics and descriptive studies/research
The terminology around descriptive statistics and descriptive studies can be a source of confusion. The key thing to remember is that descriptive statistics describe samples only, and are used in all studies to initially explore our data, plan our analyses, and describe the key characteristics of the sample so we can judge how representative our sample is compared to the target population in terms of these key characteristics. Whereas a descriptive study, or a study where one or more quantitative research questions are about description, is almost always about the goal of describing characteristics/relationships in a target population via sample and using inferential statistics (even if that target population is not clearly specified). You can certianly find studies that have only described a sample alone using sample statistics and no inferential measures, but in my experience that always seems to be because the authors have misunderstood statistical inference and don’t seem to understand what they are doing or how limited their results are.
So if studies say they are aiming to describe characteristics or relationships in a given target population, and they have taken a sample from that target population to do this, then if they know what they are doing then they mean that they are going to use inferential statistics (typically confidence intervals around point estimates) to try and infer the likely (but ultimately unknown) values for those characteristics/relationships in the target population.
Confidence intervals
Read/hide
Interpretation
Note: formally the interpretation of a confidence interval applies to a fully defined target population that you have sampled using a probability sampling method. Once you start trying to interpret a confidence interval in relation to a target population that you haven’t sampled from, or based on an analysis of a sample taken using some kind of purposive approach, then strictly speaking there is no robust, mathematical basis for that interpretation. However, you can certainly think about interpreting a confidence interval in relation to a different target population that you didn’t sample from, and you can certainly calculate confidence intervals based on purposively sampled samples, but you need to carefully consider how these factors affect the robustness and validity of the interpretation.
Confidence intervals take the form of a lower and upper value that surround the point estimate. For example, let’s say we estimate the mean age in a target population as 30, based on computing the mean age in a simple random sample from that population. Let’s assume we then compute a 95% confidence interval around that mean with lower and upper values of 25 and 35. We would usually write these statistics all together as:
Mean age: 30 (95% CI: 25, 35)
Where CI = confidence interval. Note, we separate the lower and upper interval via a comma not a hyphen “-”, as hyphens can be mistaken for negative symbols.
Strictly speaking confidence intervals are a mathematical statement about hypothetical resampling in relation to our point estimate of interest. Unfortunately, this can be quite a confusing, technical concept. More concretely, what this confidence interval is telling us is that if we were to repeat our random sampling process infinitely many times and each time we were to compute the mean age and its 95% confidence interval then 95% of those 95% confidence intervals would contain the true (but unknown) value of the mean age in the target population. Hence, we can never be certain what the exact mean age of the target population is based on the confidence interval, but it does give us an interval (i.e. a range of values) within which it is likely to lie. Yes, this is a frustratingly indirect statement about the true mean that we want to measure, but that is how the theory works, and that is the technically correct interpretation.
Note: many people, including many researchers/papers, would say we can be “95% confident” the true (but unknown) population parameter lies within the 95% confidence interval. However, technically speaking this phrase “95% confident” doesn’t have any precise, concrete, mathematical meaning. Also, the probability that the population parameter of interest lies within a given confidence interval is either 100% or 0%. We could know which of these scenarios was true if we were able to measure the whole population and calculate the population parameter. So it’s pretty questionable whether this is a valid or useful approach, but you will see it often.
Note: There is nothing special about a 95% confidence interval compared to a higher or lower level of confidence, such as a 90% confidence interval. However, for a given variable as you increase the confidence level the interval increases. For example, for our hypothetical situation above we may find a 99% confidence interval around our mean age of 30 is 10 and 50. While we may find a 80% confidence interval would be from 28 to 32. The interpretation remains the same. For example, for 99% confidence intervals we obtain confidence interval limits that, if we were to repeat the sampling and estimation and infinite number of times, would contain the true population parameter value 99% of the time.
Therefore, there is a trade-off in the usefulness of the confidence interval as you vary the confidence level. Too high a confidence level results in a high level of certainty but for a very wide interval, while too low a confidence level results in a low level of certainty but for a very narrow (precise) interval. The widespread use of 95% confidence intervals in the health sciences is largely just a convention, i.e. due to tradition. For a given confidence level, all else being equal, if we increase the sample size the interval will become narrower. Therefore, if you have sufficient sample size then a higher level of confidence would clearly be preferable.
Note: it is absolutely critically important to be aware, and when interpreting a confidence interval to always remember that, the accuracy or validity of the above interpretation of a confidence interval depends entirely on there being no bias in the study. Confidence intervals quantify the amount of variation in the sample data relative to the sample size. So they allow us to decide how closely we are likely to have estimated the population parameter of interest given the sample size and variation in the outcome, which may come from the fundamental/natural variation in the outcome’s values in that target population, and some extra random variation may come from non-differential error like truely random measurement error.
However, confidence intervals tell you nothing about any sources of bias, such as selection bias, or any non-random (differential) sources of error, like non-random missing data or non-random measurement error. Remember, study bias can occur throughout the entire study cycle, from planning the data collection tools through to reporting the study results. Therefore, if the study results are affected by any sources of bias, which is inevitable to some extent, then the accuracy and validity of the confidence interval will be affected, usually to an unknown extent. For example, using the age example from before: if older people were less likely to agree to participate in the survey that collected their age data, then this would be a form of selection bias, and the resulting 95% confidence interval and point estimate for the mean age of that population would be skewed downwards, and the confidence intervals may no longer include the true (but unknown) value of the mean age in the population 95% of the time. Again, it is usually not possible to estimate the effect of such biases, so they must be judged more qualitatively via rigorously exploring and understanding all possible sources of bias and the likely size of their impact.
Computing confidence intervals
Don’t worry: we will leave the computation of our confidence intervals up to the software. However, when computing a confidence interval we need to make an assumption about the type of probability distribution that we assume the sample statistic of interest comes from (i.e. the probability distribution we would see if we were to take many repeated samples, calculate the sample statistic each time, and plot their distribution on a histogram). Here we will just be focusing on estimating population-level means for numerical outcomes and percentages for the levels of a binary/categorical outcome.
For continuous variables a reasonable assumption about the appropriate probability distribution will often be the normal distribution, and if we plot the distribution of the variable’s values and they are approximately normal it is usually safe to assume a normal distribution will be appropriate for the confidence interval. If the distribution is skewed, as will usually be the case for discrete variables such as counts, we can try and transform the distribution of the outcome variable before computing the confidence interval, while still assuming a normal distribution when computing the confidence interval, or we can use a more appropriate distribution, but that is beyond this course.
For binary/categorical variables you can also assume a normal distribution, even though this will never be correct, as long as it is approximately correct. When the percentage being estimate isn’t too small or too large, then the normal distribution is usually an okay approximation. “Too small” and “too large” are usually taken to be >10% and <90%, but these are just really rough rules of thumb. However, we can compute confidence intervals that make assumptions about the likely probability distribution of the sample statistic that are more appropriate for binary/categorical variables. There are actually many such possible methods, with little clear evidence on which is best and under what circumstances. We will look at this in a bit more detail in the exercise.
Practice
Scenario
You and your colleagues have been tasked by the Kapilvastu district authorities in Nepal to help them understand the problem of high blood pressure and hypertension, and the associations between socio-demographic and health related characteristics and blood pressure level/hypertension. You have carried out a cross-sectional survey to address these aims, and collected data on systolic blood pressure, common socio-demographic characteristics, and some additional health-related characteristics. So far, you have cleaned and prepared the data, and computed descriptive statistics for key characteristics. As per your statistical analysis plan you now need to use the data to estimate the likely mean systolic blood pressure and prevalence (%) of hypertension for the whole target population and for women and men separately.
Exercise 1: estimate a mean and a proportion/percentage and their associated 95% confidence intervals in relation to the overall target population
Aim: estimate the mean systolic blood pressure (mmHg) and the prevalence (%) of hypertension and their associated 95% confidence intervals to allow statistical inferences to be drawn about the typical level of systolic blood pressure and the prevalence of hypertension in the overall target population.
First, load the analysis-ready “SBP data final.sav” SPSS dataset.
Next, follow the instructions below on how to compute two common univariate inferential statistics for numerical and categorical variables, specifically means and percentages, with associated confidence intervals to allow inference about these measures in the entire target population.
Sorry: at present there is no video-based set of instructions for this method.
Written instructions: calculating means, percentages and their 95% confidence intervals in SPSS
Read/hide
Computing means and their 95% confidence intervals for continuous/discrete numerical variables with approximately normal distributions
To compute a confidence interval for any sample statistic requires us to make an assumption about the probability distribution we believe the sampling distribution of the sample statistic to have come from. We can make this decision based on our knowledge of the likely data generating process and the shape of the distribution (as visualised via a histogram). If a numerical variable is approximately normally distributed then we can describe the typical value (or more technically the central tendency) via the sample mean and compute a 95% confidence interval around this mean, to allow us to make a statistical inference about the likely population mean.
When a numerical variable is approximately normal we can usually safely compute a confidence interval based on a normal distribution (i.e. we are unlikely to go too far wrong and compute a very biased confidence interval if this is the case). However, in practice most software will actually compute one based on a t-distribution. This is because a t-distribution is appropriate for continuous data that follow a a symmetrical, bell-curve shape, like the normal distribution, but it allows for more variation at smaller sample sizes by having wider “tails” at either end of the distribution. This can be shown to give more accurate confidence intervals that include the true population parameter at the desired coverage rate. For example, a 95% confidence interval for a mean based on a small sample size (e.g. <30) computed via the normal distribution might exclude the true population parameter >5% of the time, while a t-based distribution should exclude it only 5% of the time as desired. Technically, if you use a normal distribution to compute a confidence interval this is only valid when the sample size is infinite, and at larger sample sizes the normal distribution and the t-distribution become equivalent, so there is no real reason not to use the t-distribution over the normal when possible.
Note: technically speaking we are just considering analytical confidence intervals based on probability distribution functions here. However, there are empirical/numerical approaches such as bootstrap confidence intervals, which you may can read a bit about here
So let’s check that the variable is approximately normally distributed, otherwise the mean may not be the best measure of the typical value of the variable, and its 95% confidence intervals would not be accurate (technically speaking they may under or over cover the true interval on average).
From the main menu go to: Graphs > Legacy Dialogues > Histogram Then in the Histogram tool click on the sbp variable and add it to the Variable: box by clicking the blue arrow next to the box. Tick the Display normal curve box below. This will display a theoretical normal distribution curve based on the mean and SD of the variable, which helps you to see how closely the data match the distribution we would expect if they were normally distributed. Then just click OK.
What does the distribution of the data look like? I would say there is no evidence of any important skew.
Now let’s compute the mean of the variable and a t-based 95% confidence interval to allow us to make a statistical inference about the likely value of the true (but unknown) mean systolic blood pressure in mmHg in the entire target population.
To compute our t-based 95% confidence intervals we will use a “one-sample t-test”. There are other types of “t-tests” that we cover, which are used for comparing means between groups, but we don’t look at the one-sample t-test in detail as it is rarely used. If you wish to read about it a brief overview with SPSS instructions is here. In brief though, a one-sample t-test tests the hypothesis that the mean of the sample variable equals a given value that you choose. We are not interested in the hypothesis testing aspect of this test, but we can use it to give us the mean of a numerical variable and the t-based 95% confidence intervals for that estimated mean.
To do this from the menu go: Analyze > Compare Means > One-Sample T Test. Then add the sbp variable into the Variables: box using the arrow. Note that we leave the Test Value: box with the default 0 (see below for why). Then click OK to run the test.
You will get three tables returned in the results window. The “One-Sample Statistics” table presents the sample size for the variable, which is useful because you should always present the sample size (i.e. count) for any estimate and the relative frequency (e.g. percentage) of complete/missing data for the variable.
Note: by default SPSS removes any observations from the variable with missing values when computing the results.
- You can then look at the “One-Sample Test”. The first column we are interested in is the “Mean Difference”. Technically, the one-sample t-test compares the mean of the variable to an assumed population mean value that you choose. By default SPSS sets this value as 0, and indeed as above we left it as 0. Therefore, when the test compares the difference between the outcome mean and 0 it is subtracting 0 from the outcome mean, and of course anything minus 0 is just itself. Hence, the “Mean Difference” is actually just our outcome mean. You can also see the sample mean or point estimate of the population mean in the “One-Sample Statistics” table, but they will match. Then we can also see the t-based 95% lower and upper confidence intervals for the mean difference under the “95% Confidence Interval of the Difference” “Lower” and “Upper” columns. Again though, while these are technically the confidence intervals for the mean difference as the difference is between the outcome mean and 0 we can just treat them as confidence intervals for the outcome mean.
So we can see that the sample mean systolic blood pressure is 126.5 mmHg, and the 95% confidence intervals indicate that the population mean is likely to be between 124.9 and 128.2 mmHg, so we have a pretty accurate estimate given the sample size and variability in the outcome values (assuming no bias).
Computing proportions/percentages and their 95% confidence intervals for categorical variables
Note: this approach is applicable to both binary variables or categorical variables with three or more category levels. In the case of a categorical variable with three or more category levels we just treat each category level as a separate binary variable in practice (with each unit of observation either being in that category or not). For example, for the variable “education level” with category levels “none”, “primary”, and “more than primary” we estimate the proportion/percentage of individuals in the 1) “none” category vs “primary or more than primary”, 2) the “primary” category vs “none or more than primary”, and 3) the “more than primary” vs “none or primary”.
As above, to compute a confidence interval requires an appropriate probability distribution given the data. Clearly, a binary variable cannot formally be normally distributed: it can only take values of 0 and 1. However, as long as the underlying proportion is not “too extreme” we can actually treat it as an approximately normally distributed variable and compute normal-based confidence intervals. By “too extreme” the usual rule of thumb is the proportion for the variable must be >0.1 and <0.9. Beyond these limits the implied distribution becomes too non-normal for the approximation to be useful. For example, the lower/upper confidence interval will often be <0 or >1, and the bias in the accuracy of the confidence interval’s coverage becomes more substantial.
Alternatively and more appropriately, we can use a probability distribution formally applicable to a binary variable and our assumed data generating process, such as the binomial distribution. Generally, such a confidence interval is called a binomial proportion confidence interval.
Unlike for numeric variables that are approximately normal there are actually a large number of different approaches available though, and there is a surprisingly limited amount of research comparing these methods under different situations. However, the current rough consensus appears to be that the “Wilson score interval” method with the continuity correction generally does well. There is a Wikipedia page with a great summary of the situation [here] (https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Comparison_and_discussion).
SPSS offers the option to calculate a big range of these different binomial proportion confidence intervals all at the same time. Let’s see how we can do this for a categorical variable with three category levels for each category level.
From the menu go Analyze > Compare Means > One-Sample Proportions. Then add the htn variable from the Variable box into the Test Variable(s): box using the blue arrow.
Next click on the button labelled “Confidence intervals” at the top right. Then under Select Type(s) tick Clopper-Pearson (“Exact”), which is based directly on the binomial distribution, Wald, which just uses the normal distribution (often called normal approximation confidence intervals), Wilson Score and lastly Wilson Score (Continuity Corrected). Then click Continue.
Now we need to decide which category level we want to calculate our proportion and confidence intervals for. If you right click on the htn variable and select Variable Information you can see that the variable is numerically coded with 1 = hypertension and 0 = no hypertension. So we want to look at the proportion/percentage of individuals coded 1 (with hypertension).
To do this in the Define Success area select the button next to Value(s) and in the box to right of this button enter the value 1. Then click OK.
Note: if you have a categorical variable coded with letters or words instead of numbers you can also just enter the relevant letter or word into the Define Success: Value(s): box to compute the proportion and associated 95% confidence interval for that category level.
- All the results we want are in the “One-Sample Proportions Confidence Intervals” table. As you can see the results for each confidence interval method are on a separate row, and you can see which method is on which row under the “Interval Type” column. Under the “Observed” heading there are three columns giving the number of “Successes”, which means the count of observations (i.e. individuals) in the category we selected, the number of “Trials”, which just means the number of observations or the sample size of the variable, and the corresponding “Proportion”. You can of course multiply this by 100 to convert to the percentage scale. We then have the lower and upper 95% confidence intervals under the correspondingly named final two columns. As you can see for this variable and these methods they actually give very similar results, and indeed the method using the normal approximation (“Wald”) is pretty much the same as the other methods. You would see larger differences if estimating proportions nearer 0 or 1.
While all the results are pretty similar, some evidence suggests that the Wilson Score approach can perform better (give more reliable inferences), so using that result we can see that 25% of individuals (136/543) have hypertension in the sample, and the % in the target population is likely between 21.6% and 28.9%. So again, given our sample size we’ve got quite an accurate estimate of the population parameter (assuming no bias).
Exercise 2: estimate means and proportion/percentages and their associated 95% confidence intervals in relation to subgroups within the target population
Aim: for women and men estimate the mean systolic blood pressure (mmHg) and the prevalence (%) of hypertension and their associated 95% confidence intervals to allow statistical inferences to be drawn about the typical level of systolic blood pressure and the prevalence of hypertension in these two subgroups within the target population.
- We will continue to use the same dataset (“SBP data final.sav), and we will be computing the same measures. The only difference is we will do this separately for women and men. Follow the instructions below, which involve creating new variables before repeating the process from the previous exercise to compute the results of interest.
Sorry: at present there is no video-based set of instructions for this method.
Written instructions: calculating means, percentages and their 95% confidence intervals for subgroups in SPSS
Read/hide
Create new outcome variables for women and men
We will compute new outcome variables for each subgroup (sex), with the values of the outcomes missing for the subgroup that is not of interest (i.e. the female outcome variable will have missing values for all males in the dataset and vice versa). Then we will just repeat the same process as for exercise 1 to compute our desired measures for each variable, and they will therefore be specific to each subgroup only.
First we’ll create a new women-only systolic blood pressure variable. From the main menu go Transform > Compute Variable.
Let’s call our new variable “sbp_fm”. In the Compute Variable tool window that appears in the Target Variable box type “sbp_fm”. Now double click on the sbp variable in the variable list. It should appear in the Numeric Expression box. If we don’t add any other commands to this logical expression all it’s telling SPSS to do is to create a new variable with the name given in the Target Variable box, where the values are just the same values as those in the sbp variable. So effectively just copying row-by-row the values from sbp and pasting them into sbp_fm. However, we only want to do this for those values from the women in the dataset.
Therefore, next in the bottom left of the tool window click the blue-bordered If.. button. This is where we will tell SPSS to only copy and paste if the observations are from women. At the top of the tool window select the button that is next to the text Include if case satisfies condition:. Double click on the Sex (male/female) variable in the variable list. It should appear in the box below the text reading “Include if case satisfies condition:”. Add the text “= 2” so that the full text in the box reads:
sex = 2
- You can also just copy and paste this command. This is a logical condition telling SPSS to only carry out our logical expression (copying and pasting) for observations where sex = 2. If you right click on the sex variable and select Variable Information you can see that this categorical variable is numerically coded and the value 2 represents women.
Note: if you wanted to do this with a categorical variable that was string coded you would just type the relevant string. For example, if the sex variable actually had values of “male” or “female” to select males you would type (including the quotation marks):
sex = “male”
Finally, click Continue to get back to the original tool window and then click OK. You should see a new variable appear to the right of your existing variables called sbp_fm. If you scroll down you will see that when you get to the men in the dataset the new variable only has missing values. This is what we want.
Now repeat the above process but call your new variable sbp_m and use the following command in the If.. logical selection part:
sex = 1
You should now have two new variables for the systolic blood pressure values of just the women and just the men in the dataset.
You can now repeat this process for the hypertension variable, creating two new variables: one for women’s hypertension outcomes and one for men’s hypertension outcomes.
You can then repeat the process used in exercise 1 to estimate the mean systolic blood pressure and the % prevalence of hypertension and their associated 95% confidence intervals for women and men separately. Simply use the separate sex-specific systolic blood pressure and hypertension outcome variables that you have created in place of the overall ones you used previously. SPSS will automatically ignore/exclude any missing values, which for each sex-specific outcome variable will include the values for all individuals of the opposite sex.
You can check that your results match what they should be revealing the below results.
Read/hide
Systolic blood pressure (mmHg)
Women
- Mean systolic blood pressure = 120.5 mmHg (95% CI: 118.1, 122.8)
Men
- Mean systolic blood pressure = 132 mmHg (95% CI: 129.7, 134.2)
Hypertension (confidence intervals based on Wilson Score method)
Women
- % hypertension = 16% (95% CI: 12%, 20.9%)
Men
- % hypertension = 33.2% (95% CI: 28%, 38.9%)