Preparing a dataset for analysis

In this practical we will practice some basic steps around processing, cleaning and exploring a dataset, which should be carried out before doing any analyses (even if you are working with data collected from another study that is supposed to be “ready for analysis”).

Rationale

The first step before doing any analysis is to understand and prepare the dataset. If you are working with data from someone else’s study that they have made available for others to use (known as secondary data, or sometimes existing data) the data may have been prepared already, but even then you should still go through these processes to ensure you understand the dataset in detail and to ensure the data really are ready for analysis. You may see this stage referred to as data preparation, data processing, data munging, data cleaning, data cleansing etc. These terms can mean different things in different contexts and there doesn’t appear to be any universally accepted terminology, but the key idea is that prior to analysing a dataset to produce results we want to ensure the following:

All variables are suitably named and labelled.
All variables are set to the appropriate variable type.
All categorical variables are coded in the way you want.
All variables are free from obvious errors.

In addition there are often other key features of a dataset that must be checked prior to analysis, but these will be dataset specific and we don’t have time to go into the detail here. However, broadly speaking you would want to think creatively about any ways you can check the accuracy of each variable. For example, if you have a variable for age and education level you could check that, based on individuals’ ages, their education levels are plausible (e.g. all infants are recorded as having no education).

To save time we will just do these steps for some of the variables in the SBP Excel data dataset to see how they are done, but then use a fully cleaned and prepared dataset for the analyses.

First we’ll load a typical basic Excel dataset that lacks the extra formatting features of an SPSS format dataset, and that also contains typical errors present in raw datasets such as wrongly coded categorical variable observations and clear errors in some numerical variables, so we can see how to deal with these common issues when preparing datasets.

Practice

Scenario

You and your colleagues have been tasked by the Kapilvastu district authorities in Nepal to help them understand the problem of high blood pressure and hypertension, and the associations between socio-demographic and health related characteristics and blood pressure level/hypertension. You have carried out a cross-sectional survey to address these aims, and collected data on systolic blood pressure, common socio-demographic characteristics, and some additional health-related characteristics. As per your statistical analysis plan you first need to clean and prepare the data for analysis.

Exercise 1: loading Excel data

SPSS can load all the main types of data you are likely to need to load, including CSV files, Excel spreadsheets, and data files from various other statistical programs like Stata and SAS. During these practical sessions we will use datasets saved in SPSS’s own format (.sav), which preserves all the useful information on things like variable names and labels, categorical variable coding labels etc that we will see explained below. However, if you use SPSS in the future it is likely you will have to load data in an Excel format at some point. Therefore, the first thing we’ll learn to do is to load an Excel spreadsheet.

Video instructions: load Excel data into SPSS

Written instructions: load Excel data into SPSS

Read/hide

Go back to the main window in SPSS and in the main menu click File > Open > Data. Then in the Open Data tool window that opens navigate to the folder where you unzipped the MSc & MPH statistics computer sessions practical files and go into the “Datasets” folder in the same way you would locate a folder in Windows Explorer. Then to allow you to see Excel files look for the Files of type drop-down menu beneath the area in the Open Data tool where the folders and files are displayed, and select the All Files option here. Then click on the “SBP Excel data.xlsx” dataset and then click Open (or double click on the file). A tool window will pop-up called Opening Excel Data Source. Ensure the Read variable names from the first row of data box is ticked and the Worksheet drop down menu is on “SBP data [A1:I557]” and then click OK. You will now see the raw data displayed in the Data View with the variable names in the top row and the data values in the cells. You will see there are nine variables.

Exercise 2: set variable properties, code categorical variables, and check categorical variables for obvious errors

Now using our “raw” Excel dataset (that just contains the variable names and their values) let’s see how to set key variable properties, check that all categorical variables are correctly coded and free from obvious errors, and check that all numerical variables are free from obvious errors. For this we will need to understand the key details of our variables. If you are working with data that is not your own then assuming you are using a high quality, well documented dataset, this information will often be presented in the form of a “codebook”. We will present the codebook for this dataset in the form of a table below:

*SBP data* dataset variable details
Variable name	Variable type	Variable description (units/coding of levels)
id	Numerical:discrete	Observation ID
sbp	Numerical: continuous	Systolic blood pressure (mmHg)
htn	Categorical: nominal (binary)	Hypertension (SBP >= 140 mmHg, 0 = no/1 = yes)
age	Numerical: discrete	Age (years)
sex	Categorical: nominal (binary)	Sex (1 = male/2 = female)
ses	Categorical: nominal	Socio-economic status (1 = low/2 = medium/3 = high)
salt	Numerical: discrete	Salt consumption (g/day)
bmi	Numerical: continuous	BMI (kg/m2)
ace	Categorical: nominal (binary)	ACE inhibitor usage for at least 3 months (1 = yes/2 = no)

Video instructions: set variable properties, code categorical variables, and check categorical variables for obvious errors

Written instructions: set variable properties, code categorical variables, and check categorical variables for obvious errors

Read/hide

First we’ll make sure our variables are correctly named, labelled and set to the correct data type, and that all categorical variables are correctly coded. Note: for some of these exercises we will just edit a few of the variables for practice, but leave the rest alone. This is okay because when we move on we will be using a fully updated version of the dataset so we don’t need to update more than enough to understand the processes. To make these updates we need to know what variables are in our dataset, what they each measure, and what their units/category level coding should be. This information would either be available to you if you were the one who collected the data, or it would be made available in a “codebook”, which lists every variable in a dataset alone with basic information such as the variable’s name, units/category levels, often the amont of missing data for the variable (if any), and sometimes also key descriptive statistics for the variable. Refer to the codebook table above for the key details of all variables in this SBP Excel data dataset when updating the variables’ details.

Variable names

Variable names are short terms used to quickly identify each variable. Variable names should be short, clear, understandable and unambiguous. They cannot have spaces in, but you can join words to make a short phrase with underscores (e.g. “age_at_death”). On the Variable View tab look at the Name column. Here you see the variables’ names. We’ll leave these as they are, but just so you can see how to edit them click on the variable name “id”, delete it and re-enter it.

Variable labels

Variable labels are fuller (but still concise) descriptions of each variable, and can be written with spaces in. Good practice is to include a description of what the variable is measuring, plus the units if it’s a numerical variable or the category levels if it’s a categorical variable. Next in the Variable View look along to the Label column where the variable labels can be stored (there are none in this dataset yet as Excel cannot store such information). Let’s add some for practice:
For the variable sbp click on the variable label cell and enter “Systolic blood pressure (mmHg)”, i.e. the brief description of the variable with the units included take from the codebook table.
Now click on the variable label for ses and enter the description provided in the codebook table (you can just copy and paste it from the table).
Feel free to update the variable labels for some other variables if you wish, but once you’ve know how to do it you can move on.

Variable types

It can be very important that variable types are correctly set in SPSS (and all stats packages), because it can affect the way analyses and other things like graphs may work, with incorrectly defined variable types resulting in errors. To set the variable type go to the Variable View and click on the relevant cell under Measure (it will either say “Scale” or “Nominal”) and select the correct type from the drop-down menu. SPSS will set the Measure based on the nature of the variable: for any variables with any non-numeric (i.e. letters or special characters) values the variable Type will be String and the variable Measure will be Nominal, and for any variables with only numeric (discrete or continuous) values the variable Type will be Numeric and the variable Measure will be Scale. However, these automatic choices may not be correct for the dataset!
For our dataset let’s start with sbp. What do you notice about its Type and Measure? The Type is String and the Measure is Nominal. What about if you click on the Measure cell? Only nominal and ordinal options are available! This means that at least one value in the variable is non-numeric, i.e. a letter or special character. Unfortunately, there are no easy ways to check which values are non-numeric (there are ways using more complicated functions, but we don’t have time for that here), but if you create a frequency table (see later for how to do this) and scroll through all the values you may be able to spot the problem values. However, to save time here the issue is observation number 93, which has a value of 1o3, i.e. with a letter “o” instead of a zero.
To update the erroneous value go to the Data View and either scroll down to the value (observation 93) and update it, or click on the sbp column variable name/heading to select the whole variable, then press ctrl and f to open the Find and Replace - Data View tool (below the Find tab it should say “Column: sbp”), then click on the Replace tab and then in the Find text box enter “1o3” (without the quotes) and then in the Replace text box enter “103” (without the quotes) and press the Replace button at the bottom to correct the value.
Once the error has been updated go back to the Variable View window and update the variable Type for sbp to Numeric, because only Numeric type variables can be Scale variables (sorry for the confusing terminology, I don’t like SPSS’s choices!). Click on the relevant cell for sbp under Type and click the little box with three dots that appears to the right. Then select Numeric and then OK. As there are no more values with strings or non-numeric characters in you should now be able to update the variable’s Measure to Scale!
You can go through all other variables and ensure the variable Type and Measure characteristics are set correctly if you wish, but once you’re happy how this is done feel free to move on.

Value labels: coding categorical variables

Categorical variables must use some type of coding scheme to be understandable. This just refers to how categorical variables’ levels are stored (e.g. our sex variable has two levels: male and female). There are two approaches:

String/character coding: each categorical level is referred to via a string (i.e. a set of letters, numbers or special characters). E.g. sex might take either the value “male” or “female”.
Value/numerical coding: each categorical level is referred to via a value, i.e. a number, with the number linked to a “value label” that then describes what that level represents. E.g. sex might take either the value 1 or 2, where 1 is linked to the value label “male” and 2 is linked to the value label “female”.

Here you can see why it can be critical to tell SPSS the type of each variable. If you record sex as male = 1 and female = 2, then your variable’s values will be 1s and 2s. If you don’t tell SPSS that this is actually a categorical variable then it will automatically detect the numerical values and treat it as a discrete/continuous variable, which can lead to various mistakes when running analyses.

You can use either approach in SPSS and you shouldn’t have generally have any problems, as long as you include value labels. String coding can be easier and safer as you don’t have to keep refering back to what each numerical code means, and you’re less likely to accidentally type/select the wrong category level, e.g. “male” when you meant “female”, compared to accidentally typing/selecting the wrong numerical code. Although if you set up value labels and make sure they are displayed in SPSS generally this is not an issue. However, as value coding is often used and requires more understanding of SPSS to add/edit we will follow this approach so you can get the practice. The datasets we will use after this session though will use string coding for the categorical variables.
So let’s set-up value labels for our categorical variables sex and ses. To do this we must ensure that they are correctly coded with numbers and that suitable value labels are attached those numbers. In “real life” we would repeat this process for all other categorical variables. Before making any changes though we must first check each categorical variable to find out what levels exist and if there are any errors – you might be surprised. The easiest way to do this is to generate a frequency table: a table listing the number of cases the percentage of cases with each level.
From the main menu go: Analyze > Descriptive Statistics > Frequencies. Note that SPSS tries to work out the type of variable based on the values present, e.g. if there are any non-numerical characters it will assume a categorical variable. In SPSS categorical variables are represented by three coloured circles, and numerical variables by a small ruler symbol. However, these assumptions will not necessarily be correct. Let’s look at sex and ses. Either double click on the sex and ses variables in turn or click once on each while holding the shift button and then click the right pointing arrow to add them. Then ensure the Display frequency tables button is ticked and click OK. The results will appear in a new output window, which is the other main window in SPSS where all results will be displayed along with any error messages.
Now look at the frequency table to see what levels are present for sex and ses and how they are coded. Note that sex has been entered as a string variable (with levels recorded as words) and ses has been entered as a numeric variable (with levels recorded as numbers). Now you can see why frequency tables are so useful for data cleaning. For sex you can see that there are five levels, and it’s clear that three are errors (both the fact there are clear spelling mistakes/odd values and the fact they only occur once indicate they are mistakes). Specifically the levels “1”, “FM” and “mal” need correcting.
We therefore need to first clean/correct the error values, and then convert all values to numerical values and then add the labels. Look at the codebook above and the “Description (units/level coding)” column for the correct coding to apply to each level. To correct the mistakes go back to the Data View and click the top of the sex column (click on the word sex) to highlight the whole column. Then hold ctrl and f to open the Find and Replace - Data View tool. Then in the Find text box enter “1” (without quotes) and in the Replace with text box enter “male” (without quotes) and then click Replace All. Repeat this find and replace process for all other levels that are errors with the obvious correct word. Then use the find and replace process to change all “male” values to the number 1, and all “female” values to the number 2. We can now add value labels to these numerical codes. Be sure to add labels to the correct numerical values or they won’t display.
Now go back to the main window Variable View and click on the Values cell for sex where it currently says “None”. You’ll see a small box appear to the right of the cell with three dots in it – click on this to open the Value Labels tool. Then click the Value text entry box and type 1. Then click the Label text entry box and type “male”. Then click Add. Then repeat for Value = 2 and Label = “Female” (make sure you remember to click Add). Then once both value labels are added click OK.
Next let’s correct (if necessary) and code ses. Go back and look at the frequency table for ses. We can see there are four levels with one obviously incorrectly coded level of “11” only observed once. Clearly this should be coded as “1”. Use the find and replace process detailed above to replace level “11” with “1”, and then attach the correct value labels to the levels 1, 2 and 3 based on the information provided in the codebook table (1 = low, 2 = medium and 3 = high).
Note: variable labels and value labels are not stored in Excel files, so to retain them you must first save your dataset in SPSS “.sav” format. This can be done by going from the main menu to: File > Save As, and then selecting a folder and file name. The .sav format should be the default selected, but ensure that you save your updated dataset in this format.

Exercise 3: check numerical variables for obvious errors

Video instructions: check numerical variables for obvious errors

Written instructions: check numerical variables for obvious errors

Read/hide

All we can really hope to find are clearly erroneous values, such as values that are outside the plausible range of values for the variable based on what it is measuring or what we restricted it to via study design, but more modest errors won’t be detectable.
If you have a numerical variable with relatively few distinct/unique values, such as age, then probably the easiest check is another frequency table. Produce a frequency table for age and what do you notice about the extreme values (i.e. the lower and upper ends)? One person has an age of 4. This is clearly an error, as the study only recruited participants aged 18+. What to do? There are three choices: 1) leave it as it is, 2) update it or 3) delete it. Option 1 is not a good idea because we know it must be an error. Option 2 would be ideal, but you can’t always do this. In this case an age of 4 could actually be an age of 24, 34, 40, 41 etc, so we would have to be able to go back and contact the participant to check their age if we wanted to update this. Where this is not possible you could conduct a sensitivity analysis by running your analysis with this age value as 40 and then again with the value updated to 49. That way you see what happens when you assume it was actually either end of the possible value range of the likely true value. Or you may take option is 3 and delete the value (but this should be reported in your methods). However, here we’ll assume we were able to check and the correct age should have been 40. To update this go to the Data View, right click on the age variable name and select Sort Ascending. The data should now be ordered with the participant with age 4 at the top (or you could use the find and replace process detailed above). Update their age to 40.
When numeric variables have lots of different unique values, such as when they include decimals, using a frequency table becomes less easy. A better option is to produce a dotplot (also called a dotchart or Cleveland dotplot/dotchart), which is like a histogram but each value is displayed as a dot, with dots from observations of values in a similar range (i.e. in the same “bin”) being stacked like a histogram. As BMI has very few (maybe no) repeated values a frequency table will be huge. Let’s produce a dotplot instead. From the main menu go: Graphs > Legacy Dialogues > Scatter/Dot. Then click Simple Dot > Define. Then in the Simple Dot graph tool add the variable bmi into the X-Axis Variable box. Then click OK. Looking at the graph you’ll see a huge stack of values of similar values, and then far out to the right on the x-axis you should clearly see an anomalous value (a circle) that must be an error because it’s so high. Double click on the graph which will open it in a new window where you can edit the graph. Now click on the error value, which should highlight all circles in yellow, and then click on it again and only the error circle should be highlighted. Then right click on the circle and choose Go to Case. This will take you to the relevant observation in the Data View where you can update it. Again we have three choices, and let’s assume we could go back and get the correct value either from the participant or an earlier dataset (e.g. if weight and height were measured separately), and it was actually 28.175.
Now re-run the dotplot for BMI and what do you see? Again there is a value to the right of the distribution that is a far way to the right (higher) than any other BMI value. This may be considered an “outlier”. Outliers are extreme values that are far greater/smaller than any other values in a numerical variable. There are no clear thresholds beyond which a value is considered an outlier, but as you can see probably the easiest way to spot possible outliers is via a dotplot. Once you’ve identified any possible outliers such as this very high BMI value you should check to see if they are obvious errors and correct them if possible. However, if they appear to be genuine values the recommended approach is to conduct any analyses that involve the relevant variable(s) with and without the outliers present, i.e. repeat your analysis but delete the observations (e.g. individuals) who have the outliers and see what changes. Then simply clearly report the results of both of these analyses and any implications for the interpretation of your results. This is known as a sensitivity analysis. Broadly speaking, if removing outliers makes an important difference for your results then your results are probably not very robust to start with. Where you have plenty of good quality data removing the odd outlier from an analysis will rarely make much difference unless it’s a relatively hugely larger/smaller value. As you’ll have seen from the dotplot there are no outliers in this dataset.

A final note on data cleaning

We will end our data cleaning there having seen how to clean and prepare all variables as best as we can in isolation. However, with a real dataset these steps would just be the basics, and we would want to look further to check for errors and anomalous values by also looking at relationships between variables. For example, if we looked at a scatter plot of sbp vs age we may see a suspiciously high value for sbp for a very young participant, i.e. one that would be very unlikely in such a young individual. We could then check this with any prior data or the participant if we were able to.

Note: generally speaking you should not delete any observations (e.g. patients) from any variables when preparing your data. When it comes to analysing your data the presence of missing values in some variables may be a problem, and in the most simplistic approach you may have to conduct “complete case” analyses by removing any observations with missing values for variables in the analyses. However, this is not without implications for possible bias (https://stefvanbuuren.name/fimd/ch-introduction.html), and you should always compute and present the frequency of missing data if present.

Note: if you are concerned that one or more observations for one or more variables are errors but you cannot confirm this then a good approach is to run your analyses with and without those data points, and then in your results explain if this made any difference to your conclusions and discuss the implications accordingly. This is known as a sensitivity analysis.