Complex sample/survey design analysis

In this practical we will look at how we can appropriately analyse data that has been sampled using sampling methods including “complex” features. However, we will be using the same broad methods of analysis that we have already looked at, such as estimating population means/percentages, and linear and logistic regression.

Rationale

Overview

Complex sample design data are datasets (i.e. sample) that have come from a probability sampling method that has any combination of the following three “complex design features”, with all three often being present together:

Stratification
Clustered sampling (often multi-stage clustered sampling)
Unequal probabilities of selection for sampling units

It’s also often referred to as “complex survey design” data, but really a survey is not a study design and you could sample data for any study design using sampling methods with complex features like this, so the more general term would be complex sample design data.

Such data may have come from a single cross-sectional sample, multiple cross-sectional samples (i.e. multiple samples across time of different sampling units), or be true longitudinal data (i.e. multiple samples across time of the same sampling units).

This type of data is most commonly encountered/analysed in global health research when working with large-scale, typically national-scale, cross-sectional household surveys, such as the Demographic and Health Surveys (DHS: https://dhsprogram.com/), the Multiple Indicator Cluster Surveys (MICS: https://mics.unicef.org/), the WHO STEPwise Approach to surveillance Surveys (STEPS: https://www.who.int/ncds/surveillance/steps/en/), and many countries’ own household surveys. This is because large-scale household surveys almost always use all three of these key complex sampling design features of stratification, cluster sampling, and unequal probabilities of selection of sampling units. The reason they do this is usually for both pragmatic, logistical, and statistical/research reasons. Specifically, stratified sampling is often employed to both allow strata (typically an administrative level such as the region, often further stratified into urban and rural areas) to be independently sampled to ensure that each strata has a sufficient sample size to allow inferences to be made at the strata-level. Cluster sampling is primarily employed to reduce the logistical costs of sampling across huge geographical areas. While stratification and the multi-stage cluster sampling methods employed result in unequal probabilities of selection for sampling units (typically individuals).

Standard methods of analysis, or more technically the maths underlying those methods, such as the independent t-test and linear regression, assume that the data being analysed come from a simple random sample. However, any of the complex sampling design features mentioned above result in a violation of that assumption, which if not adjusted for would result in either biased point estimates (e.g. biased regression coefficients) and/or biased standard errors of those point estimates (i.e. biased variance estimates), which means biased confidence intervals and p-values. More specifically, not adjusting for clustering typically results in falsely narrow confidence intervals/falsely small p-values, while not adjusting for stratification can mean the analysis “misses out” of the benefits of a stratified sample, which can reduce the overall variation in the outcome and produce more precise results for a given sample size. And not adjusting for unequal probabilities of selection of sampling units typically results in both biased/inaccurate point estimates of characteristics/relationships and falsely narrow confidence intervals/falsely small p-values. Therefore, it is critical that we account or adjust for any complex sampling design features in our analyses to obtain unbiased results.

Luckily there are modifications to all commonly used analytical methods, including t-tests and regression models (i.e. linear and logistic and others that we’ve not looked at). And also fortunately we don’t have to worry about the underlying maths. With statistical software like SPSS primarily all we have to do is tell the software which variables code for, or identify, the different complex design features, and then select the appropriate complex sampling design version of the analysis we want to do. Then we proceed with our analysis like we would if we were analysing data from a simple random sample, i.e. using the processes outlined in the earlier practical sessions.

In any high quality complex sampling dataset that used stratification, cluster sampling, and unequal probabilities of selection of sampling units in its sampling design, there will be a variable that indicates which strata each observation (sampling unit) comes from and a variable that indicates which cluster sampling unit each observation comes from. There will also be a third variable that contains the survey weights, which are used to adjust for the unequal probabilities of selection of sampling units.

A very brief description of survey weights

Read/hide

Sampling with unequal probabilities of selection of sampling units results in unrepresentative samples (i.e. unrepresentative of the target population). This may occur when using stratified sampling with disproportionate strata, or when using a multi-stage cluster sampling approach. To obtain representative results we must therefore “map” the survey sample back onto the target population. To do this survey data producers calculate/produce sampling weights. In the analysis these weights upweight (or increase the contribution) of observations from sampling units that were undersampled compared to proportion of the target population that they represent, and they downweight (or decrease the contribution) of observations from sampling units that were oversampled compared to proportion of the target population that they represent. These weights are then usually combined with two other sets of weights. One that attempts to correct for any missing observations, and one that uses any pre-existing, recent, robust data on characteristics of the target population (such as a recent census) to adjust for any remaining lack of representativeness still reflected in the sample. Once all three types of weight are combined this results in a final or ultimate survey weight variable.

Therefore, once we’ve identified these three variables and “told” SPSS which variables they are we can analyse our data and largely forget/ignore the fact that we are doing a complex sampling design version of our chosen analysis. The type of results we get and their interpretation will, broadly speaking, not change apart from in a few areas that we will look at in this practical session.

Preparing for sampling design data analysis: a checklist

The following is an excellent future reference checklist of steps to undertake when preparing to analyse sampling design data in a real analysis. It is taken from the Applied Survey Data Analysis textbook mentioned previously. For this practical we will assume we have carried out steps 1-3 and step 5 to save time, and some of the issues mentioned are beyond the scope of this introductory course/session, but if you were to do sampling design analysis for a real study or later in your career you should educate yourself enough to understand all these terms and the issues they refer to (e.g. by buying and reading the recommended textbook).

Read/hide

Review the documentation for the data set provided by the data producer, specifically focusing on sections discussing the development of the final survey weights and sampling error (standard error) estimation. Contact the data producer if any questions arise.
Identify the correct weight variable for the analysis, keeping in mind that many survey data sets include separate weights for different types of analyses. Perform simple descriptive analyses of the weight variable, noting the general distribution of the weights, whether the weights have been normalized, and whether there are missing or 0 weight values for some cases. Select a few key variables from the survey data set and compare weighted and unweighted estimates of descriptive parameters (e.g., means, proportions) for these variables to understand the effect the weights have.
Identify the variables in the data set containing the “sampling error calculation codes” that define the sampling error calculation model [JPH: these are just the variables defining the strata and clusters]. Examine how many clusters were selected from each sampling stratum (according to the sampling error calculation model), and whether particular clusters have small sample sizes. If only a single sampling error cluster is identified in a sampling stratum, contact the data producer or consult the documentation for the data set for guidance on recommended variance estimation methods. Determine whether replicate sampling weights are present if sampling error calculation codes are not available, and make sure that the statistical software is capable of accommodating replicate weights (Section 4.2.1).
Create a final analysis data set containing only the analysis variables of interest (including the survey weights, sampling error calculation variables, and case identifiers) [JPH: this is optional. It’s nice to keep your dataset tidy and no bigger than necessary for speed and manageability, but it obviously doesn’t affect the analysis in anyway]. Examine univariate and bivariate summaries for the key analysis variables to determine possible problems with missing data or unusual values on the individual variables.
Review the documentation provided by the data producer to understand the procedure (typically nonresponse adjustment) used to address unit nonresponse or nonresponse to a wave or phase of the survey data collection. Analyse the rates and patterns of item missing data for all variables that will be included in the analysis. Investigate the potential missing data mechanism by defining indicator variables flagging missing data for the analysis variables of interest. Use statistical tests (e.g., chi-square tests, two-sample t-tests) to see if there are any systematic differences between respondents providing complete responses and respondents failing to provide complete responses on important analysis variables (e.g., demographics). Choose an appropriate strategy for addressing missing data using the guidance provided in Section 4.4 and Chapter 12 [JPH: we will not look at missing data issues in this practical, but most high quality sampling design datasets, such as the DHS surveys, include weights that incorporate adjustments for missing data that, while not perfect by any means, go some way to addressing bias from missing data by simply including the weights in the analysis as you should be doing anyway].
Define indicator variables for important analysis subclasses. Do not delete cases that are not a part of the primary analysis subclass. Assess a cross-tabulation of the stratum and cluster sampling error calculation codes for the subclass cases to identify the distribution of the subclass across the strata and clusters defined by the sampling error calculation model. Consult a survey statistician prior to analysis of subclasses that exhibit the “mixed class” characteristic illustrated in Figure 4.4. Make sure to employ appropriate software options for unconditional subclass analyses if using TSL for variance estimation [JPH: we will look at this issue in more detail later in this practical session].

Practice

Scenario

You have been tasked by your ministry of health to use your country’s recent DHS survey data to explore associations between socio-demographic characteristics and the time women report it takes them to collect water.

Exercise 1: describe the association between location and time taken to collect water

The dataset we will use here is one of the DHS’s “model datasets”, and as such it was produced by the DHS for practicing analysis of DHS survey data. It is therefore representative of the types of data and data features (e.g. sampling design features) found in DHS surveys, but it doesn’t contain any real data from any country. DHS surveys typically result in the production of a number of datasets, primarily a household-level dataset and datasets for women, men, and children, plus potentially others. You can read more about DHS datasets here: https://dhsprogram.com/data/

However, we will just use the “individual recode” dataset in the SPSS .sav format (DHS helpfully makes their datasets available in other formats too). This contains data from the female respondents of the survey, who consist of all members of a selected household who are aged 15-49. This dataset is included in the “Datasets” folder of the MSc & MPH computer sessions practical files, and is called “ZZIR62FL.SAV”. The “odd” naming is due to DHS’s naming conventions, which you can read about here: https://www.dhsprogram.com/data/File-Types-and-Names.cfm

More specifically, we will use the dataset to answer the descriptive research question “what factors are associated the time women report it takes someone from their household to reach the household’s source of drinking water?”

Step 1: prepare and explore the data

First of all we will identify, explore, and edit as necessary the key complex design variables of the survey. We will then explore our independent and outcome variables, and then produce descriptive statistics to describe the sample’s characteristics in terms of the variables measured, as would commonly be found in the “Table 1” of a paper. Then we will produce our analytical model results.

Load the “ZZIR62FL.sav” SPSS dataset.

Video instructions: prepare and explore the data

Written instructions: prepare and explore the data

Read/hide

Basic variable details obtained from the documentation describing the dataset

V005 = Women’s individual sample weight (6 decimals) - numerical
V021 = Primary sampling unit - nominal
V023 = Stratification used in sample design - nominal
V025 = Type of place of residence - nominal
V115 = Time to get to water source - numerical (minutes)
V152 = Age of household head - numerical (years)
V190 = Wealth index - ordinal

Prepare and explore the data

As per the checklist steps 1-3 for any sampling design analysis you must first thoroughly review the survey technical/methodological documentation to understand the design of the survey and sampling process, and to understand which variables contain the stratification, clustering, and weighting information, and to understand any modifications of those variables that are required. To save time we will assume we have done this, and that we have removed all variables from the dataset other than these design variables and the independent and outcome variables.

Therefore, open the ZZIR62FL.SAV dataset. If a window appears with a message starting “IBM SPSS Statistics is running in Unicode encoding mode…” just click Yes. We can assume that from reading the relevant technical/methodological literature on this survey we have identified the relevant complex design variables and our variables of interest and removed all other variables from the dataset. The information on the complex design variables can be found in the “DHS Guide to Statistics” (https://dhsprogram.com/data/Guide-to-DHS-Statistics/Analyzing_DHS_Data.htm), but you should also review the survey specific literature (each DHS survey has a final report with specific methodological details relevant to that survey, along with a full copy of the questionnaire used that you should always review).

Explore and edit the complex design variables

Therefore, V005 identifies the survey weights, V021 identifies the clusters (often called primary sampling units as they are the first stage of a multi-stage cluster sample), and V023 identifies the strata. However, and this shows why you must read survey methodological literature carefully, all DHS weights are stored multiplied by 1,000,000. This is for very technical computer science related reasons about floating point errors that we don’t need to go into here, but the point is that to use the weight variable we therefore need to first divide all values by 1,000,000.

To do this from the main menu go: Transform > Compute Variable. Then in the Target Variable: box give our new variable a suitable name, say “survey_weight”, and in the Numeric Expression: box enter the following:

V005 / 1000000

Then click OK. You can now delete the old V005 variable if you wish.

Before we go any further let’s rename the other variables for ease of reference. We’ll use the following names:

V021 = cluster
V023 = strata
V025 = location
V115 = time_water
V152 = age_hhh
V190 = wealth

And let’s also change the variables cluster, strata, location, and wealth to the variable type nominal so SPSS treats them as categorical variables (as they use numerical coding SPSS automatically treats them as numerical variables unless told otherwise).

Now let’s explore the complex design variables. For a quick overview of their descriptive statistics remember you can just right click on each variable in either the Data view or Variable View and click on Descriptive Statistics. However, you should first change the strata (V023) and cluster variables. As our strata and cluster variables are both categorical this is probably sufficient, but it’s worth producing a histogram for the numerical survey_weight variable to visualise its distribution. What do the descriptive statistics and histogram show you for the three complex design variables?

What do the descriptive statistics and histogram show you for the three complex design variables?

Read/hide

There are no missing values for any of the variables. The observations are typically evenly distributed across clusters, but much less evenly distributed across strata. However, there are no concerning issues, such as a single observation in a cluster or strata, that might indicate data errors. The histogram of the weights shows a strongly right-skewed distribution with some potentially concerning extremely high weight values. If this was a real analysis it would be worth looking into these more closely, but we’ll assume they’re all fine.

Explore and edit the data variables

Next let’s explore each of the data variables in turn. Again for a quick univariate exploration we can just use the same process as we used to explore the complex design variables.

What do the descriptive statistics and histogram show you for the four data variables?

Read/hide

There are no missing values for location, and there is a ~40%:60% split between urban and rural located households of the women surveyed. There are 13 missing values of our outcome variable time_water, and a histogram indicates that the variable appears to have a strongly “bimodal” distribution with two distinct “peaks”, but we’ll come back this shortly. There are 6 missing values of age_hhh, and a histogram indicates that the variable is fairly normally distributed. There are no missing values of wealth, and observations are evenly distributed across it’s five levels. This is to be expected because in DHS surveys the numerical household wealth index is categorised into five wealth quintiles by evenly separating increasing values of the variable into five levels/groups.

So back to the outcome variable. This is a good example of why a careful data exploration is important prior to any analysis, and why understanding your survey data and methodological documentation is so important. You can access the questionnaire that the model dataset is based on here: https://dhsprogram.com/pubs/pdf/DHSQ6/DHS6_Questionnaires_5Nov2012_DHSQ6.pdf

Scroll down to question 104 “How long does it take to go there, get water, and come back?”, which is where our variable comes from.

What can you see about the possible response values?

Read/hide

Respondents can provide the number of minutes it takes them to go and collect water and come back, but for those who don’t know their response is coded as 998:, which is the second peak we see on the histogram. Therefore, the distribution is not bimodal at all, we’re just not looking at only the true/non-missing values.

Therefore, let’s just look at the distribution of non-missing values for our outcome. Instead of deleting all observations where the response was “don’t know” to our question of interest we can tell SPSS to treat the value 998 as a missing value for this variable. In the Variable View for the time_water row click on the cell under the column Missing and click on the box with three dots in that appears to the right of the cell. In the Missing Values box that appears click the Discrete missing values button and in the first box enter 998 and then click OK.

Now re-run the histogram and what do you see?

Read/hide

The distribution of values is clearly strongly right-skewed. Remember that linear regression assumes that the residuals are approximately normally distributed not the raw values of the outcome, and therefore once we adjust for all our independent variables our model may be valid. However, with such strong right-skew it looks likely that we’ll have to transform the outcome (another option would be to use a regression model for count data, such as the Poisson or negative binomial regression model, but that’s beyond this course).

Explore bivariate relationships

Let’s look at the relationships between our independent variables and our outcome to check for any clearly non-linear relationships. In a real analysis we would also want to check for the presence of strong interactions, ideally that we would expect from theory, but that is beyond the scope of this practical and course.

As previously for numerical independent variables we use scatter plots.

From the menu go: Graphs > Legacy Dialogues > Scatter/Dot, then select the default top-left Simple scatter option and click Define. Then add age_hhh and water_time to the appropriate X Axis: and Y Axis: boxes respectively and click OK.

It’s hard to see much of a relationship there, and certainly no clear non-linear relationship.

Then for the two categorical independent variables again we can use box plots.

From the main menu go: Graphs > Legacy Dialogues > Boxplot, then select Simple and click Define. Add the water_time variable into the Variable: box and location into the Category Axis: box and click OK.

It’s also hard to see much of a clear relationship here, but the median time for urban looks slightly higher than the median time for rural, but clearly there’s a lot of variation in both category levels. If you repeat the graph but for wealth what do you see? Again, personally I can’t see any clear relationships here.

Step 2: create the SPSS “complex samples plan file”

Video instructions: create the SPSS “complex samples plan file”

Written instructions: create the SPSS “complex samples plan file”

Read/hide

As mentioned previously in order to undertake valid analysis of a complex design dataset we must first tell SPSS which variables code for the complex design features in our sample. We only need to do this once.

To do this from the main menu go: Analyse > Complex Samples > Prepare for Analysis. Then with the Create a plan file button selected click Next. Then in the Save Data As window that opens we need to first create a complex samples plan file that SPSS can store the information in about which variables code for the complex design features. Therefore, navigate to a suitable folder and enter a suitable name for the plan into the File name: box, such as “water_survey”, and then click Save. You can now identify the complex design variables coding for the strata, clusters, and weights to SPSS. Note: if you are working with data that comes from a complex sample using few than all three of these design features you can of course just identify those.
In the Analysis Preparation Wizard simply add the strata variable into the Strata: box, the cluster variable into the Clusters: box, and the survey_weight variable into the Sample Weight: box. You can then just click Finish, because the default options are that you can set on the next pages by clicking Next are suitable.

You have now created your sampling plan file which can be reused for any analysis of this dataset, and we can now run any analysis that can account for a complex design using this file and SPSS will adjust for the stratification, clustering and weights.

Step 3: describe the sample

Video instructions: describe the sample

Written instructions: describe the sample

Read/hide

Note: if you feel you are short on time then skip this exercise and move onto the next exercise that focuses on producing the inferential results.

We will calculate the mean for water_time and age_hhh and the frequency and percentage for each category level of location and wealth. Let’s start with the numerical independent variables.

From the main menu go: Analyse > Complex Samples > Descriptives, then in the Complex Samples Plan for Descriptives Analysis tool just click Continue as the complex samples plan file will be automatically selected. If for any reason it is not click Browse and locate and select it before clicking Continue.
Now in the Complex Samples Descriptives tool add time_water and age_hhh to the Measures: box. Then click the Statistics button and ensure that the Mean button is ticked and click Continue. For some reason there is no option to calculate the standard deviation or median. The range isn’t affected by the complex design features though so you could still calculate that using the standard tool.

You’ll see that water_time has a mean of 24.61 and age_hhh has a mean of 45.48. These are both adjusted for the complex design features of the sample. To compare to an analysis ignoring these features right click on the water_time variable and select the Descriptive Statistics option. You’ll see that the mean for water_time is 23.74 when you ignore the complex design features. In this instance it’s not a big difference, but this is the affect of accounting for the weights variable.

Now let’s calculate the frequency and percentage for each category level of location and wealth.

From the main menu go: Analyse > Complex Samples > Frequencies, then in the Complex Samples Plan for Descriptives Analysis tool just click Continue as the complex samples plan file will be automatically selected. If for any reason it is not click Browse and locate and select it before clicking Continue.
Now in the Complex Samples for Frequencies Analysis tool add location and wealth to the Frequency Tables: box, then click the Statistics button and select the Population size, Table percent, and the Unweighted count tick boxes, and then click Continue and then OK.

The resulting tables are not helpfully laid out! The Estimate column gives the frequencies adjusted for the weights in the top half of the table and the corresponding percentages adjusted for the weights in the bottom half of the table. Then the Unweighted count gives the frequencies not adjusted for the weights in both the top and bottom halves of the table, i.e. the values are just repeated for some unclear reason.

For both means and frequencies and percentages you would usually want to provide the weight-adjusted descriptive statistics because they reflect the target population that you are making inferences to, given that the unadjusted sample results will be, by the sampling design, not representative of that target population while the adjusted results aim to be.

Step 4: run the complex design adjusted linear regression model

Video instructions: run the complex design adjusted linear regression model

Written instructions: run the complex design adjusted linear regression model

Read/hide

We are now ready to produce our analytical inferential statistics via our linear regression model. Again we can just use our existing complex samples plan file to adjust for the complex design features using the appropriate complex design linear regression tool.

From the main menu go: Analyse > Complex Samples > General Linear Model. Remember a general linear model is a synonym for a linear regression.
On the Complex Samples Plan for General Linear Model tool click Continue. Now add water_time into the Dependent Variable: box and location into the Factors: box.
Next click the Statistics button and in the Model Parameters area ensure the Estimate and Confidence interval boxes are ticked, along with the Model Fit and Sample design information boxes below. Then click Continue.
We can ignore the Hypothesis Tests button and the Estimated Means button, but click the Save button and in the Save Variables are ensure the Predicted Values and Residuals boxes are ticked then click Continue.
We can also ignore the Options box so just click OK.

Step 5: check the assumptions of the complex design linear regression model

However, as always before interpreting the results we should check the model’s assumption are not violated. In particular, we should check that the residuals are approximately normally distributed and not heteroscedastic. To check the first of these just plot a histogram of the saved residual values (the variable should be called Residual).

What do you see?

Read/hide

The residuals are clearly strongly right-skewed!

Therefore, we can either try and transform the outcome or use an alternative analysis. For now let’s see how we would interpret the results assuming that the assumptions were not violated, and then we will leave it as an exercise for you to re-run the model with a transformed outcome to see if that solves things, and then interpret the back-transformed results.

Note: in a real analysis you should check all assumptions are not violated. Just follow the instructions in the “Step 2: check the assumptions of the linear regression” section in the linear regression practical for guidance, and use the Residuals and Predicted variables produced by the complex design adjusted linear regression we just ran accordingly.

Step 6: understand the results tables

After running the model we see four tables (scroll up from the residual histogram you should have just created and anything else you’ve done since running the model).

The Sample Design Information table tells us some important things about our sample and design. In particular, it tells us how many cases (in our case women) were missing or not missing from the analysis, i.e. our effective sample size: these are the valid and invalid N values, as well as the total unweighted sample size (Total N). It also tells us the estimated size of the target population, but this is only valid if the weights are not “normalised” to have a mean of 1, which DHS weights areso you can ignore this. It also tells us the number of strata (Strata) and clusters (Units), which serves as a check if all strata and clusters were included in the analysis or if any were left out due to missing observations in those strata/clusters.

The Model Summary table tells us the R² value for the model, although frustratingly not the more robust and useful adjusted R² value!

We can ignore the Test of Model Effects table as this is an ANOVA type table and not really useful.

Then we have our familiar Parameter Estimates table where we get our key inferential results about the relationships between the independent variables and the outcome in the model. These are interpreted in exactly the same way as the linear regression coefficients from a linear regression that does not adjust for any complex design features. Therefore, refer back to the “Step 6: report and interpret the results” section of the linear regression practical, and in particular the sub-sections “Numerical variables” and “Categorical variables”, for a reminder if needed.

Exercise 2: describe the mean time (in minutes) women report it takes them to collect water

Use the “Descriptives” tool within the “Complex Samples” tool to estimate the mean time women report it takes them to collect water. It should be now be pretty straight forward what to do, but make sure to click on the “Statistics” button to ensure you tell SPSS to calculate the necessary statistics to address this descriptive inferential question.

What are plausible values given the confidence intervals of the estimate?

Read/hide

It looks like the likely population mean is between 23.6 and 25.6 minutes. So we appear to have an extremely precise estimate of the population mean, assuming we have no/minimal bias in the study!

Complex sample/survey design analysis

Rationale

Overview

A very brief description of survey weights

Further reading

Preparing for sampling design data analysis: a checklist

Practice

Scenario

Exercise 1: describe the association between location and time taken to collect water

Step 1: prepare and explore the data

Basic variable details obtained from the documentation describing the dataset

Prepare and explore the data

Explore and edit the complex design variables

Explore and edit the data variables

Explore bivariate relationships

Step 2: create the SPSS “complex samples plan file”

Step 3: describe the sample

Step 4: run the complex design adjusted linear regression model

Step 5: check the assumptions of the complex design linear regression model

Step 6: understand the results tables

Exercise 2: describe the mean time (in minutes) women report it takes them to collect water