Multiple linear regression interactions & non-linear associations

In this practical we will look at how we can use linear regression with interactions to describe more complicated associations between two covariates and a continuous outcome.

Rationale

Uses

Previously we saw how we can use linear regression to describe associations between continuous, discrete or categorical covariates and continuous outcomes (and sometimes also discrete outcomes, if the modelling assumptions are not too badly violated), but we only looked at describing bivariate associations. Here, we look at how we can describe interactions between two covariates or non-linear associations to describe more complicated associations that often exist.

Key terminology and concepts

Interaction = When the size and/or direction of a association between one covariate and an outcome depends on the value of another covariate (or vice versa).
Non-linear association = A association between a continuous/discrete covariate and a continuous/potentially discrete outcome that is not linear (i.e. not a straight line), or that you do not want to assume is linear. Therefore, within a linear regression modelling context such a association cannot be modelled using a single coefficient and requires polynomial terms for the covariate of interest.
Polynomial terms = A continuous or discrete covariate with an exponent/power, e.g. age-squared, age-cubed etc.

Practice

Scenario

We wish to describe important and potentially complex associations between key socio-demographic and relevant health related characteristics and individuals’ systolic blood pressure using the data collected in the “SBP int & nl” dataset. Interactions and non-linear associations are often subtle/weak so the interactions and non-linear associations in this dataset have been created to clearly illustrate these types of associations.

As these are descriptive research questions and we want to describe the associations of interest as they exist (to inform policy and practice), we will use repeated linear regression models where we model the associations of interest without adjusting for any other covariates.

Exercise 1: to explore how we can describe categorical by categorical interactions, we will describe the population-level association between sex (female/male) and smoking status (no/yes) combined subgroups (our covariates) and systolic blood pressure (mmHg - our outcome) using linear regression.

Load the “SBP int & nl.sav” dataset.

Step 1: explore the data

Written instructions: explore the data

Read/hide

Again, it recommended to thoroughly explore you data in terms of: 1) the distribution of your outcome and covariates, to ensure you understand them well and are taking a suitable modelling approach (e.g. linear regression rather than another type of regression), and 2) what functional form of model makes good sense given the associations between your covariate(s) and outcome variable(s) - particularly whether there are any clear/strong interactions or non-linear associations, which we will look at here.

Univariate exploration

To understand the distribution of your variables you can use histograms and boxplots for numerical variables and barcharts for categorical variables. Let’s run a histogram for our outcome variable.

From the main menu go: Graphs > Histograms. Add sbp_sex_smoke into the Variable: box, tick the Display normal curve box and click OK.

The overall distribution appears to be bimodal - i.e. there are two peaks. This is often indicative of there being two separate and distinct groups within the data that systematically differ in terms of their outcome values.

Next, let’s look at a bar chart for our sex and smoke variables.

From the main menu go: Graphs > Bar. Then click the Simple option and Define. Add the variable sex to the Category axis: box, and within the Bars represent area tick the % of cases option and click OK. What do you see? Repeat for the smoke variable.

Sex is perfectly equally distributed, i.e. 50:50, but only 20% of participants are smokers.

Bivariate exploration

First, let’s explore the distribution of sex and smoking status together. This is visualising a cross-tabulation of the two variables.

From the main menu go: Graphs > Bar. Then click the Simple option and Define. Add the variable sex_smoke to the Category axis: box, and within the Bars represent area tick the % of cases option and click OK. Note: it seems to be pretty tricky to create such combined variables in SPSS so I actually just did this in another software package, but you could do it easily enough in Excel too using the CONCAT function.

The female smoker group is half the size of the male smoker group.

Next let’s look at the distribution of the outcome in relation to the four subgroups of interest (female-non-smoke, female-smoker, male-non-smoker, male-smoker). We will use a boxplot to display the median, interquartile range and outliers for systolic blood pressure values in each subgroup.

From the main menu go: Graphs > Boxplot. Then select the Simple option and click Define. Add the sbp_sex_smoke variable into the Variable: box and the sex_smoke variable into the Category Axis: box and then click OK.

It looks like median systolic blood pressure is lowest and roughly equal for female and male non-smokers and higher for both female and male smokers, but with male smokers having slightly higher median systolic blood pressure than female smokers.

Step 2: run the linear regression model

Sorry there are no video-based instructions.

Written instructions: run the linear regression model

Read/hide

Remember linear regression allows you to look at associations between a numerical outcome variable and any number of numerical or categorical covariates, assuming all the assumptions of the method are met (we’ll check these out shortly). So let’s see how we build, run and estimate our linear regression model in SPSS.

From the main menu go: Analyze > General Linear Model > Univariate.
Next, in the Univariate tool we add our outcome variable sbp_sex_smoke to the Dependent Variable: box. Then we add our covariate of sex_smoke to the Fixed factors: box. We can ignore everything else.
Next, click the Contrasts button and in the Change Contrast area set Contrast: to “Repeated”.
Next, click the Save button and under Predicted Values tick the Unstandardized box, and then under Residuals tick the Unstandardized box, and then under Diagnostics tick the box for Cook’s distance.
Lastly, click the Options button, and then under Display tick the Parameter estimates box and click Continue and then OK. The output window will then pop up with the results, but first…

Step 3: check the assumptions of linear regression

Before we look at the results that appear in the output window we must first check whether we can treat the results as robust and valid. Our results are only potentially valid if the assumptions of the linear regression model have been met/hold, i.e. if they have not been violated. When it comes to interpretation we would of course have to consider all other potential sources of bias. Below we’ll just list the modelling assumptions (see the previous linear regression practical for more details) and how to check them.

1. Continuous outcome

We know by definition the outcome is continuous.

2. Independent observations

We know based on the study design, specifically the sampling design, that we have a simple random sample.

3. Normally distributed residuals

We can again plot a histogram of our residuals to assess the approximately.

From the main menu go: Graphs > Histogram. Then add the RES_1 variable into the Variable: box (note: if you run further models the subsequent residual variables will be automatically named RES_2, RES_3 and so on - we will therefore refer to subsequent residual variables as RES_X, with X being an unknown number depending on how many models you have run), tick the Display normal curve box and click OK.

The residuals appear to be very reasonably approximately normal so we can safely assume this assumption has not been violated in our model.

4. Linearity of the associations between the residuals and the numerical variable(s)

This assumption does not apply to this model as we only have a single categorical covariate. It would only apply when continuous/discrete covariates are involved.

5. Homoscedasticity: constant variance of the residuals across model predicted values

We can again plot the residuals against predicted/fitted values, but as we just have one covariate with four groups we won’t see a cloud of points but instead four separate lines of points. This is because the model only includes one categorical covariate with four groups and so the model can only predict four possible distinct values depending on the group we are predicting a mean outcome for. However, we can still compare whether the spread of the residuals is approximately equal for each group.

From the main menu go: Graphs > Scatter/Dot, then select Simple Scatter and click Define. Add the RES_1 variable into the Y Axis: box and the PRE_1 variable into the X Axis: box and click OK.

There does not appear to be any clear and substantial difference in the spread of the residuals at each predicted value.

Step 4: consider additional possible issues (before interpreting the results)

No extremely influential observations

Again, we will create a scatterplot of Cook’s D against the observation ID variable (which is just a simple count from the first to the last observation).

Remember in the main menu we go: Graphs > Scatter/Dot. Then select the Simple Scatter option and click Define. Then add the Cook’s D variable COO_1 to the Y Axis: box and the id variable to the X Axis: box and click OK.

There are no clear outliers, i.e. observations that are much more influential on the results than most observations.

Step 5: understand the results tables and extract the key results

So now that we have verified that the assumptions of the model are not clearly violated, we can finally interpret our results.

The “Between-Subjects Factors” just shows us the sample size per group for our covariate and we we will again ignore the “Tests of Between-Subjects Effects” and focus just on the final table called “Parameter Estimates”. See the first exercise for the previous linear regression practical for more information on what this table contains.

First we will consider the point estimates for the various coefficients. The name/categorical level is given in the first column called “Parameter” and the point estimates of the coefficients are given in the second column called “B”.

Remember that the intercept is the expected mean of the outcome when all covariates = 0. We just have one categorical covariate, and each category group will be represented within the model via separate dummy coded binary (0/1) variables. Therefore, the intercept represents the expected mean of the outcome when the sex_smoke variable = 0. We can work out which level has been coded as 0 (or the reference) easily. Look at the very bottom of the table and it says “a. This parameter is set to zero because it is redundant.” We can then look for the covariate in the “B” column that is 0 and has a superscript “a” next to it. Under the “Parameter” column this is the covariate level m_yes (listed as “sex_smoke=m_yes”). Therefore, the intercept represents the expected mean of the outcome (i.e. systolic blood pressure in units of mmHg) for male smokers.

Note, for string-coded categorical variables, by default SPSS sets the reference category based on the names of the levels in reverse alphabetical order. So if you have levels called “a” and “b” then “b” would be the reference. As we just have the one categorical covariate, the other coefficients therefore represent the difference between the expected mean of the outcome for each group compared to the reference group, which again is set to m_yes, i.e. male smokers.

For example, we can see that male non-smokers have an expected mean systolic blood pressure that is -19.4 mmHg lower than the expected mean systolic blood pressure for male smokers (best point estimate), and the 95% confidence intervals tell us that the likely value for this mean difference in the target population is between -21.3 and -17.5 mmHg (and if we want to frame this in terms of a null-hypothesis significance test then we can see the two-sided p-value for the null hypothesis that the coefficient = 0 under the “Sig.” column, which is <0.001).

Then if we want to explore comparisons to groups other than the reference group we can look below at the “Contrast Results (K Matrix)” table. Here the first colum refers to the contrast or comparison of interest in terms of levels. These are as per the “Parameter estimates” table, so level 1 = fm_no, level 2 = fm_yes, level 3 = m_no and level 4 = m_yes (the reference level for the “Parameter estimates” table). So we can see that the final comparison of level 3 to level 4 matches that seen in the “Parameter estimates” table for level 3 (m_no). As always, if you want to swap any comparisons to the other way around just flip the signs and swap the confidence intervals.

Practical importance

As always, when it comes to making practical interpretations we need to consider whether the size and direction of plausible associations would be important clinically or for public health considerations.

Exercise 2: to explore how we can describe continuous by categorical interactions, we will describe how the population-level association between age (years) and systolic blood pressure (mmHg - our outcome) varies by sex (female/male) using linear regression. Note, you could equally frame this question as looking how the association between sex and systolic blood pressure varies by age. It’s “two sides of the same coin”.

Step 1: explore the data

Univariate

First, let’s explore the distribution of the outcome via a histogram.

From the main menu go: Graphs > Histograms. Add sbp_age_sex into the Variable: box, tick the Display normal curve box and click OK.

The overall distribution appears to be approximately normal but potentially there is some multi-modality, i.e. more than one peak/mean.

Next, let’s explore the distribution of age via a histogram.

From the main menu go: Graphs > Histograms. Add age into the Variable: box, tick the Display normal curve box and click OK.

It’s hard to interpret but age appears more uniformly distributed, or maybe a complex mix of a few peaks. Remember that linear regression can model continuous/discrete covariates of any distribution. It’s the distribution of the model residuals that must be approximately normal. It’s always good if the same size is reasonably equally distributed across the range of any continuous/discrete covariates, as this gives the model more information to estimate the association more accurately.

We’ve already explored the distribution of the sex variable, so need to do so again.

Bivariate

Let’s look visualise the distribution of systolic blood pressure values by age and sex.

From the main menu go: Graphs > Scatter/Dot. Then ensure Simple Scatter is selected and click OK. Then add the sbp_age_sex variable into the Y Axis: box, the age variable into the X Axis: box, and add the sex variable into the Set Markers By: box (which will colour points by sex), and then click OK.

It looks like there is a linear association between age and systolic blood pressure for both sexes, but the association is steeper for men.

Step 2: run the regression model

Sorry there are no video-based instructions.

Written instructions: run the linear regression model

Read/hide

From the main menu go: Analyze > General Linear Model > Univariate.
Next, in the Univariate tool we add our outcome variable sbp_age_sex to the Dependent Variable: box. Then this time we add our covariate of age to the Covariate(s): box and our covariate of sex to the Fixed factors: box.
Next, click the Model button and in the Specify Model area click the Build terms: button. Then in the Build Term(s): area under Type: click on the drop-down menu and choose Main effects. Then click on sex and drag it into the Model: area on the right and release the click (or you can also click the blue, right-pointing arrow below the drop-down menu to add them). Repeat for the sex covariate. Both covariates should now be in the Model: area. Now click on the same drop-down menu again and choose Interaction. Then click on sex, hold shift, then click on age (so they are both highlighted) and, while still holding the left-button on the mouse down, drag them both into the Model: area and release. Under the age and sex covariates you should now see the interaction term “age*sex”. Now click Continue at the bottom of the tool.
Next, click the Save button and under Predicted Values tick the Unstandardized box, and then under Residuals tick the Unstandardized box, and then under Diagnostics tick the box for Cook’s distance.
Lastly, click the Options button, and then under Display tick the Parameter estimates box and click Continue and then OK. The output window will then pop up with the results, but first…

Step 3: check the assumptions

We would again need to verify all the assumptions. We only provide details where things are slightly different from the last exercise.

1. Continuous outcome 2. Independent observations 3. Normally distributed residuals 4. Linearity of the associations between the residuals and the numerical variable(s)

As we now have a continuous/discrete covariate we would need to plot the residuals against the age covariate to check there are no trends. We can do this using a scatter plot.

From the main menu go: Graphs > Scatter/Dot, then select Simple Scatter and click Define. Add the RES_X variable into the Y Axis: box and the age variable into the X Axis: box and click OK.

There are no clear trends.

5. Homoscedasticity: constant variance of the residuals across model predicted values

Step 4: consider additional possible issues

No extremely influential observations

These can be checked as before, but there are again no concerns.

Step 5: understand the results tables and extract the key results

Again, we will focus on the “Parameter estimates” table. For our categorical covariate of sex we can see that the male level has been set as the reference level. Again, we can determine this by looking for which level = 0 and has the superscript “a” next to it, which links to the statement “The parameter is set to zero because it is redundant” at the bottom of the table.

Therefore, the intercept now has no real-world meaning because it represents the expected mean outcome when all covariates are 0, which here is for males aged 0. We could “centre” age so it is the difference from the mean of age and then the intercept would represent the expected mean outcome for males at the mean age (across all participants), but we are interested in the association between age and systolic blood pressure and how it varies by sex, so there is no benefit.

As we included an interaction between age and sex in the model the coefficient for age represents the slope (or more precisely the expected change in the mean of the outcome for every 1-unit increase in the covariate, i.e. for every year older) when sex = 0, which here represents males. We can therefore see that the model predicts that for every year older a male is their mean systolic blood pressure is expected to increase by 0.51 mmHg. The coefficient named “[sex=fm]*age” can then be interpreted as the model-predicted difference in the association (slope) between age and systolic blood pressure in females compared to the corresponding association (slope) in men. We can therefore see that the association (slope) between age and systolic blood pressure in females is 0.29 mmHg lower (because the coefficient is negative - the best point estimate) than the corresponding association (slope) in males, and the likely value for this difference in the target population is between -0.34 and -0.24 (with the p-value for the null-hypothesis that this difference in slopes = 0 given in the “Sig.” column).

Another way to understand this is that the model predicts that the association (slope) between age and systolic blood pressure in males is 0.51 while in females it is 0.51 + (-0.29) = 0.22, i.e. very roughly half as steep. If we wanted to get the predicted association (slope) for females and its 95% confidence intervals we could just recode the sex variable so that the female level was set to the reference level and re-run the model. Again, SPSS chooses the level where the name of that level starts/is made of letters that are lower in the alphabet. So you could e.g. recode it as “z_female” to ensure that female becomes the reference level. Then the age coefficient would represent the association (slope) for females.

Practical importance

Exercise 3: to explore how we can describe continuous by continuous interactions, we will describe how the population-level association between age (years) and systolic blood pressure (mmHg - our outcome) varies by BMI (kg/m2), or equivalently how how the population-level association between BMI (kg/m2) and systolic blood pressure (mmHg - our outcome) varies by age (years), using linear regression. Again, this is “two sides of the same coin”.

Step 1: explore the data

Univariate

First, let’s explore the distribution of the outcome via a histogram.

From the main menu go: Graphs > Histograms. Add sbp_age_bmi into the Variable: box, tick the Display normal curve box and click OK.

The overall distribution appears to be approximately normal but with a little right-skew.

Next, let’s explore the distribution of age and BMI via a histogram.

From the main menu go: Graphs > Histograms. Add bmi into the Variable: box, tick the Display normal curve box and click OK.

BMI appears fairly uniformly distributed.

We’ve already explored the distribution of the age variable, so need to do so again.

Bivariate

Let’s look visualise the distribution of systolic blood pressure values by BMI. We’ve already explored the distribution of systolic blood pressure by age, so need to do so again.

From the main menu go: Graphs > Scatter/Dot. Then ensure Simple Scatter is selected and click OK. Then add the sbp_age_bmi variable into the Y Axis: box, the bmi variable into the X Axis: box and then click OK.

It looks like there is a linear association between BMI and systolic blood pressure.

Step 2: run the regression model

Sorry there are no video-based instructions.

Written instructions: run the linear regression model

Read/hide

From the main menu go: Analyze > General Linear Model > Univariate.
Next, in the Univariate tool we add our outcome variable sbp_age_bmi to the Dependent Variable: box. Then this time we add our covariates of age and bmi to the Covariate(s): box.
Next, click the Model button and in the Specify Model area click the Build terms: button. Then in the Build Term(s): area under Type: click on the drop-down menu and choose Main effects. Then click on age and drag it into the Model: area on the right and release the click (or you can also click the blue, right-pointing arrow below the drop-down menu to add them). Repeat for the bmi covariate. Both covariates should now be in the Model: area. Now click on the same drop-down menu again and choose Interaction. Then click on age, hold shift, then click on bmi (so they are both highlighted) and, while still holding the left-button on the mouse down, drag them both into the Model: area and release. Under the age and bmi covariates you should now see the interaction term “age*bmi”. Now click Continue at the bottom of the tool.
Next, click the Save button and under Predicted Values tick the Unstandardized box, and then under Residuals tick the Unstandardized box, and then under Diagnostics tick the box for Cook’s distance.
Lastly, click the Options button, and then under Display tick the Parameter estimates box and click Continue and then OK. The output window will then pop up with the results, but first…

Step 3: check the assumptions

We would again need to verify all the assumptions. We only provide details where things are slightly different from the last exercise.

1. Continuous outcome 2. Independent observations 3. Normally distributed residuals 4. Linearity of the associations between the residuals and the numerical variable(s)

As we now have two continuous/discrete covariate we would need to plot the residuals against both to check there are no trends. We can do this using a scatter plot.

From the main menu go: Graphs > Scatter/Dot, then select Simple Scatter and click Define. Add the RES_X variable into the Y Axis: box and the age variable into the X Axis: box and click OK. Repeat for bmi.

There are no clear trends.

5. Homoscedasticity: constant variance of the residuals across model predicted values

Step 4: consider additional possible issues

No extremely influential observations

These can be checked as before, but there are again no concerns.

Step 5: understand the results tables and extract the key results

Again, we will focus on the “Parameter estimates” table. The intercept now represents the expected mean of the outcome when both age and BMI = 0, so clearly it has no real-world interpretation (but again, we could centre both covariates and then it would be interpretable).

The bmi and age coefficients then represent the model-predicted association (slope) between BMI and systolic blood pressure when age = 0 and between age and systolic blood pressure when BMI = 0, so similarly they are not very meaningful.

The “bmi * age” coefficient represents either the difference in the association (slope) between BMI and systolic blood pressure for a 1-unit increase in age, or equivalently the difference in the association (slope) between age and systolic blood pressure for a 1-unit increase in BMI. Therefore, we can see that the association (slope) between age and systolic blood pressure increases (becomes more steeply positive) for every 1-unit increase in BMI by 0.031 mmHg (best point estimate), with the likely value for the increase in the target population is between 0.027 and 0.035, and equivalently the association (slope) between BMI and systolic blood pressure increases (becomes more steeply positive) for every 1-unit increase in age by 0.031 mmHg (best point estimate), with the likely value for the increase in the target population is between 0.027 and 0.035.

It is usually most helpful to interpret one or both associations at key values of the other covariate. For example, the association (slope) between age and systolic blood pressure at key values of BMI. This can be done by using the model to predict expected mean values of the outcome (and confidence intervals for those values) for representative values of age and key values of BMI (or vice versa). However, in SPSS this is quite tricky and time consuming, so we won’t look at this further in this session.

Practical importance

Exercise 4: to explore how we can describe non-linear associations between continuous/discrete covariates and continuous outcomes, we will describe the population-level non-linear association between age (years) and systolic blood pressure (mmHg - our outcome).

Step 1: explore the data

Univariate

First, let’s explore the distribution of the outcome via a histogram.

From the main menu go: Graphs > Histograms. Add sbp_age2 into the Variable: box, tick the Display normal curve box and click OK.

The overall distribution appears to be approximately normal.

We’ve already explored the distribution of age previously, which is approximately uniform.

Bivariate

Let’s look visualise the distribution of systolic blood pressure values by age, which will now be non-linear by design for this practical.

From the main menu go: Graphs > Scatter/Dot. Then ensure Simple Scatter is selected and click OK. Then add the sbp_age2 variable into the Y Axis: box, the age variable into the X Axis: box and then click OK.

It’s hard to see but there is a hint of a slight inverted-u shape to the trend in the points.

Step 2: run the regression model

Sorry there are no video-based instructions.

Written instructions: run the linear regression model

Read/hide

First we need to create our age squared variable. From the main menu go: Transform > Compute Variable. Then in the Target Variable: box add a name for our age squared variable. I will use age2. Then in the Numeric Expression: box add age ** 2. The “**” indicates you are raising the variable age to a power, in this case 2, i.e. squaring it. Then click OK.

Now we can run the model.

From the main menu go: Analyze > General Linear Model > Univariate.
Next, in the Univariate tool we add our outcome variable sbp_age2 to the Dependent Variable: box. Then this time we add our covariates of age and age2 to the Covariate(s): box.
Next, click the Model button and in the Specify Model area click the Build terms: button. Then in the Build Term(s): area under Type: click on the drop-down menu and choose Main effects. Then click on age and drag it into the Model: area on the right and release the click (or you can also click the blue, right-pointing arrow below the drop-down menu to add them). Repeat for the age2 covariate.
Next, click the Save button and under Predicted Values tick the Unstandardized and Standard Error boxes, and then under Residuals tick the Unstandardized box, and then under Diagnostics tick the box for Cook’s distance.
Lastly, click the Options button, and then under Display tick the Parameter estimates box and click Continue and then OK. The output window will then pop up with the results, but first…

Step 3: check the assumptions

We would again need to verify all the assumptions. We only provide details where things are slightly different from the last exercise.

1. Continuous outcome 2. Independent observations 3. Normally distributed residuals 4. Linearity of the associations between the residuals and the numerical variable(s)

Note: if there is a non-linear trend in the association between age and systolic blood pressure if our model adequately models this via our quadratic term then the residuals should show no trends as they should be approximately evenly distributed around the model-predicted mean, which itself will have a non-linear association with age. Let’s check.

From the main menu go: Graphs > Scatter/Dot, then select Simple Scatter and click Define. Add the RES_X variable into the Y Axis: box and the age variable into the X Axis: box and click OK.

There are no clear trends.

5. Homoscedasticity: constant variance of the residuals across model predicted values

Step 4: consider additional possible issues

No extremely influential observations

These can be checked as before, but there are again no concerns.

Step 5: understand the results tables and extract the key results

Again, we will focus on the “Parameter estimates” table. The intercept now represents the expected mean of the outcome when both age and age2 are 0, so clearly it has no real-world interpretation (but again, we could centre both covariates and then it would be interpretable ).

We cannot usefully interpret the coefficient for age because it depends on age2, i.e. the association is non-linear and cannot be described by a single slope value. Therefore, typically we would compute predicted values for a range of values of our covariate using the model (along with and their confidence intervals) and plot those to visualise the non-linear association and allow inference for practical interpretation.

It is again not straight forward to do this in SPSS, but it’s not too complicated to at least plot the predict outcome values from the model against the observed covariate values, so let’s do that.

From the main menu go: Graphs > Scatter/Dot, then select Simple Scatter and click Define. Add the PRE_X variable into the Y Axis: box (where _X may be 4 if you’ve just made one model per exercise but it may be higher - used the highest value as that will be the most recent predicted values variable) and the age variable into the X Axis: box and click OK.

Remember to pay attention to the axes. Without further editing these are the defaults, and a restricted y-axis range will make any curve appear more curved than a more honest perspective when looking across a wider range of possible outcome values.

Practical importance