Multiple binary logistic regression

In this practical we will look at using linear regression to estimate relationships between any type of independent variable and a continuous outcome.

Rationale

Overview

Like multiple linear regression multiple binary logistic regression allows us to analyse the relationships between one or more continuous and/or categorical independent variables and an outcome, but unlike multiple linear regression the outcome in multiple binary logistic regression must be binary. Like with linear regression “simple logistic regression” may be a term used to refer to logistic regression with one independent variable only, while “multiple linear regression” refers to including more than one independent variable, as would be the case for any analysis aiming at causal inference by adjusting for confounding, but there’s no qualitative difference, and we’ll just refer to “logistic regression” from here on.

Also, as touched on in the lecture there are different versions or extension of binary logistic regression like multinomial logistic regression for categorical outcomes with any number of category levels, but as these are more advanced we will not be looking at them further. Therefore, because “binary logistic regression” is far more commonly used/seen than any other form of logistic regression we’ll just refer to it as logistic regression from here on, which is common practice (i.e. if you see “logistic regression” in the literature you can assume it’s referring to binary logistic regression).

You can actually analyse binary outcomes using linear regression, and you will sometimes see such analyses in the literature (sometimes called a “linear probability model”) and you can often get reasonably useful (i.e. reasonably accurate/unbiased) results. However, generally this is not recommended because of the often substantial disadvantages and problems with this approach. For example, the model predictions may range outside 0-1 (the only values your outcome can take in reality when viewed as a probability of the event occurring), and your inferences are likely to be biased because the assumptions of linear regression cannot be met in such a situation, producing biased confidence intervals and p-values. Although logistic regression is more complex mathematically the basic idea of creating a linear model (where the terms are additive) with numerical or categorical independent variables is the same, although the interpretation of the terms differs substantially.

Practice

Scenario

We wish to describe important associations between key socio-demographic and relevant health related characteristics and the probability that an individuals has hypertension using the data collected in the SBP final dataset. As these are descriptive research questions we will use repeated logistic regression models where we model the relationship between each characteristic (independent variable) of interest without adjusting for any other independent variables. This will produce unadjusted or crude associations that reflect the relationships as they appear without any assumption that they may reflect causal relationships. One or more of these associations may of course reflect causal relationships, but it’s unlikely they will be good (accurate) estimates of causal relationships because this is an observational study so adjusting for the inevitable confounding effectively, and without making potentially worse mistakes like introducing collider bias, is a huge challenge that is beyond the scope of this module, but see the materials in the linear regression lecture on Minerva for an introduction to the complex world of causal inference using observational studies/data.

Exercise 1: describe the population-level relationship between socio-economic status and the probability of having hypertension using logistic regression

Load the “SBP data final.sav” dataset.

Step 1: explore the data

Written instructions: explore the data for a logistic regression

Read/hide

The same basic reasoning applies to our data exploration process as with the linear regression modelling process, so refer back to the “Step 1: explore the data” section in the linear regression practical if you want a reminder of the theory/justification for the following data exploration process.

In practical terms though we would examine the distribution of each variable on its own using the same methods discussed in the linear modelling practical, but when examining the relationships between the outcome and each independent variable we would have to use different graphical approaches.

Categorical independent variables

For exploring the relationship between categorical independent variables like socio-economic status and binary outcomes we can just use a normal barchart where the Y-axis is the proportion of the outcome in each category level on the X-axis. Let’s see how to do this for socio-economic status:

From the main menu go: Graphs > Chart Builder. If an information box appears saying “Before you use this dialogue…” just click OK. Now in the Gallery tab at the middle left click on Bar and then double click on the top leftmost picture of a barchart (Simple Bar) to add it to the plot. Next drag the htn variable from the Variables: box onto the Y-Axis? dotted box on the chart image. Similarly, drag the ses variable onto the X-Axis? box. Look to the right of the tool window at the Element Properties tab in the Statistics box. You should see a drop down menu under Statistic:. This should already read “Mean”, but if not click on it and select mean. This tells SPSS to calculate the mean of the Y-axis variable, htn, for each value of the X-axis variable, which are now repeated when in each bin. As htn is binary and coded 1 and 0 it the mean will be the proportion of 1s or proportion of individuals with hypertension. Now just click OK.

What can you see?

What does the barchart of socio-economic status vs proportion of individuals with hypertension show?

In a real analysis you would explore all bivariate relationships this way first but we’ll move onto quantifying the relationship via logistic regression now.

Cell frequencies

Unlike with linear regression we have one more data exploration process that we must do. When looking at relationships for a logistic regression model an important issue to look out for is so-called “sparse data”, which means we have small (or empty) “cell sizes”, which really means few (or no) observations of one of the levels of our binary outcome (e.g. either htn = yes or htn = no) in one or more levels of one or more categorical variables (e.g. if say all socio-economic status = low individuals were classed as htn = yes). This is explained in more detail during the “Additional potential problems” section below, but briefly if there are zero observations of one outcome level within one or more independent categorical variable levels that will cause the model to fail. If there are just a few you may see very large biased coefficients and confidence intervals. How few is few? There’s not set number, but <5 is often cited as a cause for concern.

We can explore this issue using cross tabulations of the outcome variable against the categorical independent variables.

Go: Analyze > Descriptive Statistics > Crosstabs. Add htn to the Row(s): box and ses to the Col(s): box then click OK.

In the table that appears look at each cross-tabulation “cell” to see the number of observations in that cell, i.e. each possible combination of the independent categorical variable’s level and the outcome’s level, which for ses vs htn would include 1) low-yes, 2) low-no, 3) medium-yes, and 4) medium-no, 5) high-yes and 6) high-low. What can you see?

What does the cross-tabulation show?

Read/hide

There appear to be reasonable numbers of observations of both levels of the outcome for all levels of ses. See the “Additional potential problems” for more discussion of the possible problems you can face if the number of observations in a cell is too low.

Step 2: run the logistic regression model

Written instructions: run the logistic regression model

Read/hide

We will use SPSS’s Generalized Linear Model tool to build our logistic regression model. The generalised linear modelling framework is a more comprehensive and coherent way of (mathematically) representing the full range of possible regression type models, of which linear regression and logistic regression are specific types. SPSS somewhat confusingly and redundantly also has a Binary Logistic tool under the Regression menu, which will also create a logistic regression model, but like the Linear (Regression) tool in the Regression menu it is also more restrictive because it won’t accept categorical variables without us first converting them to separate dummy variables, and it also does not offer all the options available in the Generalized Linear Model tool.

From the menu go: Analyze > Generalized Linear Models. This will open the Generalized Linear Models tool which has multiple tabs along the top of the window.
In the first tab Type of Model look for the “Binary Response or Events/Trials Data” set of options and select Binary logistic. This tells SPSS we want to create a logistic regression model from among the many other possible generalised linear models (note: linear regression is also a specific type of generalised linear model, and could be run from this tool via the “Scale Response” Linear option).
Then click on the Response tab and add the htn variable to the Dependent Variable: box. Then click on the Reference Category… button and under “Reference Category” select First (lowest value), which sets the lowest value in the htn variable, 0, as the reference category (i.e. htn = no as the reference category) and click Continue (without recoding the outcome variable by changing this option we could make our results refer to the likelihood of not having hypertension if this suited our needs better).
Then click on the Predictors tab and add the categorical variable ses into the Factors: box.
Then click on the Model tab and highlight the ses variable in the Factors and Covariates: box. Then under the “Build Term(s)” options ensure the Type: is set to Main effects (this should be the default) and click the right facing arrow to add the variables into the Model: box. This just tells SPSS we want to have the independent variable as part of the model. If we don’t do this we just get an “intercept-only” model, which effectively just gives us the mean of the outcome.
We will ignore the Estimation tab and just accept its defaults (this is where you can alter how the model is estimated).
Next click on the Statistics tab and in the “Print” options tick the Include exponential parameter estimates box. Then untick the Model information, Goodness of fit statistics and Model summary statistics boxes. These are all results relate to/computed based on the model, but they are not necessarily of any use for our purposes of describing associations.
We will ignore the EM Means tab and just accept its defaults (this is where you can tell SPSS to calculate certain predicted outcomes given certain contrasts between categorical variable levels other than those displayed by default via dummy coding, i.e. where each level is compared to a reference level).
Next click on the Save tab and tick the Cook’s distance option to save a variable of the Cook’s distance values for each observation in the dataset. We’ll come back to this, but it provides a measure of influence of each observation on the values of the regression coefficients.
Finally click OK.

As with the linear regression practical, before interpreting the results we should check the assumptions of the model are not violated, and if they are make any necessary changes before re-running the model.

Step 3: check the assumptions of the logistic regression

Unlike with linear regression logistic regression has fewer assumptions and they are not all easily checked. Specifically, logistic regression no longer assumes linearity in the relationship between the independent variables and the outcome, nor does it assume normality or homoscedasticity in the residuals. However, it does have some assumptions as discussed below.

1. Binary outcome

Self-explanatory! However, note that we can code the variable either way around depending on whether we want our results in relation to the likelihood of observing a given “event” or the “non-event”. For example, for our hypertension variable we could obtain results in relation to the likelihood of having hypertension or of not having hypertension if we coded the outcome as 1 = hypertension or 1 = not hypertension respectively, or equivalently we can use the relevant SPSS model building options as explained in the “Step 2: create the logistic regression model” section to change the reference level.

2. Independent observations

As with linear regression logistic regression assumes observations are independent of one another, which we can assume for our study data here as it came from a simple random (cross-sectional) sample. However, if our study design implies this is not the case then we need to either use a method that can cope with this (beyond the scope of this course) or modify our data.

The alternatives aren’t exactly the same as for linear regression though. For longitudinal (multi-time point) data we can again use logistic regression if we just use data from one time point. However, unlike with linear regression we cannot calculate a change value between two time points and analyse that with logistic regression because outcomes of 0-0 and 1-1 at two time points will both give us the same change score, but mean very different things. Similarly if we have nested or clustered observations, such as patients within clinics, we cannot calculate summary outcome values for each cluster, such as means, and use logistic regression as the values will no longer be binary. We may however calculate the proportion of events per cluster and, assuming the assumptions are met, use linear regression to analyse this cluster-level summary (proportion) outcome. Like with linear regression there are multi-level logistic regression models that can explicitly model clustered (or multi-level) data.

3. Linearity of relationships between the log-odds and numerical independent variables

This is similar to the linear regression linearity assumption, but differs in a key way. With logistic regression the model estimates the predicted log-odds of the outcome rather than the actual value of the outcome (0/1), also called the logit transformation of the outcome, for each observation (e.g. individual), given/conditional on the values of the independent variable(s) for that observation. The logistic regression model assumes that there is a linear relationship between any numerical independent variable(s) and the log-odds of the outcome variable. In theory we can test this by looking at the relationship between the residuals and the predicted/fitted values or numerical independent variables, but the resulting plots can be difficult to interpret usefully as you can still get curved patterns but due to the nature of logistic regression not have a model that has violated this assumption. See the following for more discussion:

https://stats.stackexchange.com/questions/121490/interpretation-of-plot-glm-model
https://stats.stackexchange.com/questions/45050/diagnostics-for-logistic-regression

Hence, we won’t produce such plots here and instead recommend that you think very carefully about the model you are building in terms of whether you have allowed for any strong/important non-linear relationships and/or interactions that evidence suggests may be present.

Step 4: consider additional problems

1. Adequate sample size

Although you should always consider your sample size when conducting any statistical analysis logistic regression requires more data than a hypothetically equivalent linear regression to achieve the same level of precision in its estimates, because there is less statistical information in an outcome than can only take two values. There are various rules of thumb. A common one is having at least 10 observations (e.g. individuals) that have the least frequent outcome (i.e. that either have the event or do not have the event of interest, depending on whether having the event or not is more common in the sample) for each independent variable in your model. For example, if you are including multiple independent variables in your model to adjust for confounding (e.g. with a goal of estimating a causal relationship) assume you have three independent variables. If the overall proportion of your least frequent outcome, like not having hypertension, is 0.1, then you would need a minimum sample size of (10 x 3) / 0.1 = 300. All such rules are just rough guides, and it doesn’t mean you can’t build a model with less data, but they give you a good idea about the likely amount of data you need for decent precision. If you do have a much smaller sample size than these guides you’ll typically find you have very wide confidence intervals for all your coefficients, making meaningful inference difficult, and it increases your chances of getting “sparse data” or “complete separation” (see next).

2. No complete separation/sparse data

Complete separation refers to the situation when one or more levels of one or more categorical variables levels have 100% 1s or 100% 0s observed for the outcome (e.g. every male had “hypertension” or did not have “hypertension”). In such a situation a standard logistic regression model cannot estimate the coefficient or standard error for that categorical variable level because there is no statistical variation in the outcome. The solutions are: 1) remove the entire variable, or 2) if possible recode the variable so that you “collapse” or merge two or more category levels together so that all levels have at least one (ideally more) of each possible observation (i.e. a 1 or a 0), but the newly merged category levels must make logical sense for this to be a viable solution.

Sparse data refers to the situation where there are very few 1s or 0s observed in the outcome variable in one or more levels for one or more categorical variable levels (also referred to in this context as “cells”, e.g. the male and female cells for the variable sex). In such a situation with a standard logistic regression model the estimated coefficients and/or standard errors for that categorical variable level will often be very large and biased.

We checked the cell counts for each level of the outcome in the data exploration stage and found there were only 3 htn = yes observations in the ace = yes cell. There are rules of thumb for judging if a cell has sparse data, but really it’s often easiest to just run the model and look at the results and if problems have occurred you will see extremely large (positive or negative) coefficient(s) and standard errors/confidence intervals. You can also check the cell counts for sex and ses if you wish, and you should do this for all categorical variables in a real data analysis situation, or you can move on as we know from earlier data exploration that they are fine.

3. No extremely influential observations

Lastly we again need to check that there are no observations that have an excessively influential effect on the results of the logistic regression model. We can again use a version of the Cook’s distance statistic for logistic regression to check which observation(s), if removed, would change the coefficients substantially. When we ran the logistic regression via the Generalized Linear Model tool we told SPSS to calculate and save the Cook’s distance values for each observation in a new variable, so as with the linear regression let’s graph these against the observation ID. Graphs > Legacy Dialogues > Scatter/Dot and select the Simple Scatter option and click Define. Add the CooksDistance variable into the Y Axis: box and the id variable into the X Axis: box and click OK.

What do you see?

What do you see on the Cook’s Distance plot?

Step 5: understand the results tables and extract the key results

As we are now satisfied that the assumptions for our logistic regression model are not violated we can interpret the results. You can either scroll up in your output window or you may wish to re-run the model. By default, SPSS produces a whopping 8 tables, most of which are not that useful for us.

“Model information” provides some basic information about our model.
“Case Processing Summary” provides information on how many observations were included and excluded from the model (all those with any missing outcome and/or independent variable data are automatically excluded).
The “Categorical Variable Information” and “Continuous Variable Information” tables provide descriptive information about all the categorical and continuous (also discrete numerical) variables included.
Finally, we get to the key “Parameter Estimates” table, which provides us with very similar analogous information as the linear regression parameter estimates table.

Parameter estimates table columns explained

Parameter

Again, each row is for a different “parameter” or “logistic regression coefficient”, which means a separate term in the logistic regression model. For numerical variables this means one row per variable. However, because each level of a categorical variable is actually treated as a separate “dummy variable” (coefficient) as standard in a logistic regression model each categorical variable level has its own row.

B stands for betas, because in the logistic regression model when represented mathematically the effect (or coefficient) of each variable is usually represented by the Greek letter beta. The betas are more commonly referred to as the parameter estimates or the (logistic regression) coefficients. They tell us the estimated direction (positive or negative) and size of effect each parameter (i.e. variable) in the regression model has on the outcome variable. However, with logistic regression the model parameters as originally estimated by the model are on the log-odds scale. Therefore, for all numerical independent variables they represent the expected mean change in the log-odds of the outcome variable being 1 for every 1-unit increase in the independent variable. For categorical variables with the default dummy coding the coefficients represent the expected mean difference or change in the log-odds of the outcome variable being 1 between each level and the reference level (e.g. male compared to female). Remember by default SPSS sets the category level coded with the highest value as the reference level, and the reference level always has a coefficient value of 0.
A note on the intercept. Here the intercept now represents the log-odds or odds (for Exp(B)) of the outcome variable being 1 when the values for all numerical independent variables are set to 0, and for the reference levels of all categorical variables. Again, this will often not have a useful interpretation and is often not reported in a logistic regression results table.

Std. Error

These are the standard errors for each coefficient, which estimate the sampling variability of the coefficients in the wider population. This is used when calculating the 95% confidence intervals and p-value.

95% Wald Confidence Interval (Lower and Upper)

These are the lower and upper 95% confidence intervals for each coefficient based on the Wald method of calculation (essentially assuming a normal distribution).

Hypothesis Test (Wald Chi-Square, df and Sig.)

This section of the table provides hypothesis tests for each coefficient based on the Wald chi-square statistic and the given degrees of freedom. The p-value is in the “Sig.” column, and is again a two-tailed p-value. Assuming the true value of the coefficient is 0, this p-value represents the probability of obtaining a coefficient at least as great as that observed (positively or negatively) due to sampling error alone.

Exp(B)

These are the exponentiated coefficients, i.e. eβ. Trying typical exp(x) into Google where x is one of the coefficients on the log-odds scale. You will see it is now the Exp(B) value. By exponentiating the coefficients, we can transform them from the log-odds scale the odds ratio scale, which allows for more easily and intuitive interpretation. Therefore, they now have the following interpretations. For numerical variables they represent the multiplicative change in the odds of the outcome variable being 1 for every 1-unit increase in the independent variable. For categorical variables they represent the multiplicative difference in the odds of the outcome variable being 1 for each level of the categorical variable compared to the reference level. By multiplicative we mean on the multiplicative scale, so an odds ratio of 2 means the odds are multiplied by 2 (i.e. double) for every 1-unit increase in a numerical variable or compared to the reference level of a categorical variable. Similarly, an odds ratio of 0.5 means the odds are multiplied by 0.5 (i.e. halve) for every 1-unit increase in a numerical variable or compared to the reference level of a categorical variable. Therefore, on the multiplicative scale the null or no effect value is 1.
A note on the intercept. Here the intercept now represents the odds (for Exp(B)) of the outcome variable being 1 when the values for all numerical independent variables are set to 0, and for the reference levels of all categorical variables. Again, this will often not have a useful interpretation and is often not reported in a logistic regression results table.

95% Wald Confidence Interval for Exp(B) (Lower and Upper)

These are the (Wald-based) 95% confidence intervals for the exponentiated coefficients, i.e. for the estimated odds ratios.

Note: typically, only the exponentiated coefficients (and their 95% confidence intervals) are presented in the results from a logistic regression model, due to them being much easier and more intuitive (but still not that intuitive!) to interpret than those on the original log-odds scale.

Step 6: report and interpret the results

Let’s see how we interpret the results in the parameter estimates table. We will only look at the exponentiated coefficients, i.e. the odds ratios regression coefficients, and their confidence intervals as they are more easily interpreted and, as explained above, only the odds ratio scale results are typically presented instead of the original scale log-odds results. We’ll also keep things briefer than for the same linear regression sections as most of the concepts apply here in the same way, but we’re just dealing with ratio measures of relationships rather than differences/increases/decreases.

Understanding and interpreting odds ratios

Odds ratios are just ratios, like risk ratios, but when interpreting their direction and size mistakes can be easily made. First of all, unlike linear regression coefficients which can range (in theory) from negative infinity to positive infinity, odds ratios can range (in theory) from 0 to positive infinity. Because they are ratios the null value or no relationship/no effect value is 1, not 0 like for linear regression coefficients. This is because ratios are the result of dividing two numbers, so if those numbers are the same then the result is 1, whereas with linear regression the coefficients are differences, where a difference of 0 represents no difference. Therefore, odds ratios <1 represent a negative relationship and odds ratios >1 represent a positive relationship.

As odds ratios are ratios they are on the “multiplicative scale”, and they therefore represent factors by which one set of odds is multiplicatively related to another. For example, an odds ratio of 2 indicates that the odds of an outcome in the group of interest are 2 times the odds of the outcome occurring in the reference or comparison group. Take care though. It is not correct to say that an odds ratio of 2 indicates that the odds in the group of interest are 2 times higher than the odds in the reference group. To talk about odds ratios in terms of the odds for the group of interest being relatively higher or lower than the odds for the reference group you should first convert the odds ratio to a percentage using the following formulae:

When the odds ratio is >1 you can calculate the % increase in odds as:

(OR – 1) x 100

For example, for an odds ratio of 4.2:

(4.2 – 1) x 100 = 320% increase in odds

When the odds ratio is <1 you can calculate the % decrease in odds as:

(1 – OR) x 100

For example, for an odds ratio of 0.45:

(1 – 0.45) x 100 = 55% decrease in odds

Note: you should of course refer to the upper and lower odds ratio confidence interval range when discussing results, and those values can be similarly transformed as above if desired.

Let’s look at some examples to make this all clearer. We’ll start with categorical independent variables as these are arguably easier to interpret than numerical independent variables.

Categorical variables

With the standard dummy coding of categorical variables (https://stats.idre.ucla.edu/spss/faq/coding-systems-for-categorical-variables-in-regression-analysis-2/) logistic regression coefficients for categorical variables represent the model-predicted mean (or more loosely the average) ratio between the odds of the outcome for the categorical level (or group) of interest compared to the odds of the outcome for the reference or comparison level (or group), while holding the effect of all other independent variables constant, i.e. they measure the mean independent relationship. The choice of which categorical level is set as the reference group is up to you.

In the parameter estimates table we see that in the “B” column ses=3 (high) has a value of 0 and the associated footnote (as with the linear regression results) indicates that this level has been set as the reference (we could change this if we wanted to). Therefore, the other ses level coefficients are odds ratios in relation to this level. Specifically, when we look at the “Exp(B)” columns of the exponential transformed “B” values we can see that for ses=1 (low) the model indicates that individuals with low socio-economic status are expected to have mean odds of hypertension that are 1.4 times the expected mean odds of hypertension for individuals in the high socio-economic group, but the confidence intervals include the null value of 1 (the lower confidence interval is 0.826). Therefore, in the target population we have no clear evidence of whether the odds of hypertension are greater or less for the low socio-economic group compared to the high one. Similarly, the model indicates that individuals with medium socio-economic status are expected to have mean odds of hypertension that are 1.017 times the expected mean odds of hypertension for individuals in the high socio-economic group, and the confidence intervals again include the null value of 1. Therefore, in the target population we have no clear evidence of whether the odds of hypertension are greater or less for the medium socio-economic group compared to the high one.

Practical importance

Similar considerations apply when trying to interpret the practical importance of the results of a logistic regression analysis of a study as those discussed for a linear regression. Therefore, see the discussion about interpreting the practical importance of numerical independent variables and categorical independent variables in the linear regression practical notes above (in particular see the “Numerical variables” section).

However, as shown in the lecture on logistic regression, it is also much more difficult to interpret the results of a logistic regression in terms of the real world importance for clinical/public health practice and policy etc than those from a linear regression for two reasons.

First, odds ratios are relative measures of a relationship and you cannot interpret their absolute impact without knowing what the odds (or probability/prevalence) of the outcome are in the reference group. Therefore, an odds ratio of 100 might not represent much of an absolute increase in the probability of occurrence of some event if that event is rare, while an odds ratio of 2 might represent a large absolute increase in the probability of occurrence of some event if that event is common. For example, if the probability of developing a rare cancer is 0.001% then a risk factor that increased the odds of developing the cancer 100 times (i.e. an odds ratio of 100) would only result in a probability of developing the cancer 0.1%. While if the probability of developing a common cancer is 25% then a risk factor that increased the odds of developing the cancer just 2 times (i.e. an odds ratio of 2) would result in a probability of developing the cancer of 40%.

Second, odds ratios are ratios of odds, and for most people odds are not as intuitive as probabilities of an outcome. However, we can use the following formula to approximate the relative risk (or risk ratio), which is arguably more easily interpreted as it’s in terms of the probabilities of the outcome:

Approximate relative risk = adjusted or crude odds ratio / (1 − p0 + (p0 x adjusted orcrude odds ratio))

As explained in the lecture the adjusted/crude odds ratio is the adjusted/crude odds ratio for the independent variable of interest (it will only be a crude estimate if the independent variable of interest is the only independent variable in the model), while p0 is the risk in the baseline/reference/control/comparison group. Also as explained in the lecture, in an observational study with multiple variables in a logistic regression model p0 will vary depending on both the value of the independent variable of interest and all other independent variables. Therefore, in such a situation we can try and estimate a range (lower and upper value) of plausible baseline risks to use in the calculation along with the range of adjusted odds ratios implied by the 95% confidence interval of the adjusted odds ratio. We will not look not at this further here, but be aware of the challenges with interpreting results from logistic regression models in practice, and if you plan to use logistic regression (and indeed sophisticated analyses more generally) frequently in the future then the best advice is to learn how to do them in R (free) or Stata where you can then use their functions for estimating predicted probabilities of the outcome at different values of the independent variables, making it very easy to calculate risk ratios or even better risk differences along with the actual predicted probabilities, making for much more intuitive practical interpretation of your model results!

“Non-significant results”

The same advice applies when reporting and discussing non-significant results from a logistic regression as for a linear regression. Therefore, see the “‘Non-significant results’” section in the linear regression practical session for details.

Regression tables

The same advice applies when reporting and presenting results from a logistic regression as for a linear regression in terms of presentation via a table. Therefore, see the “Regression tables” section in the linear regression practical session for details. However, an additional comment would that it is most common for logistic regression results table to just present the exponentiated regression coefficients, i.e. the odds ratio scale regression coefficients, rather than the original log-odds scale regression coefficients, which are harder to usefully interpret. You should always make it explicit and clear though what form of results you are presenting. You can always include these in the same table or a separate table (e.g. as supplementary materials) if you need to. Also, there are various pseudo-R²s for logistic regression, but they are often criticised for their less than ideal meaning and interpretability. Hence, unless required you do not need to present an “equivalent” R² value. However, it is usually recommended to present the likelihood ratio (chi-square) test result p-value, which compares your model to an intercept-only model, as this at least indicates whether your model explains more variation in the data than a simple intercept-only model.

Methods

As usual in a methods section you should clearly explain why you used a logistic regression analysis and exactly what you did, including how all the variables were coded/what units they were in, if you modified any variables, how you dealt with any missing data etc.

Exercise 2: describe the population-level relationship between bmi and the probability of having hypertension using logistic regression

Step 1: explore the data

Numerical independent variables

Specifically, as we have a binary outcome when exploring the relationship between numerical independent variables and our binary outcome we have to use a form of barchart, where we first create a “binned” or grouped version of our numerical independent variable (i.e. a new variable which takes a single value for a range of values of the original independent variable).

Let’s see how to do this for bmi. First create the binned independent variable.

From the main menu go: Transform > Visual Binning. Add bmi to the Variables to Bin: box and click Continue. At the top of the window that apppears look for the Name: fields and in the editable field called Binned Variable: give your new binned variable a name, e.g. bmi_bin. Then at the bottom right click on the Make Cutpoints… button. In the middle of the tool window click the Equal Percentiles Based on Scanned Cases option and in the Intervals - fill in either field are click on the Number of Cutpoints: box and enter a suitable number of cutpoints. It’s hard to know what is a suitable number but typically the more bins the better, as you can see relationships at a finer scale, but for more bins you need more data (sample size). For bmi let’s choose 15, but you can always remake this variable with more/fewer bins as needed. Therefore, enter 15 in this box and then click Apply at the bottom. Then click the Make Labels button just below the Make Cutpoints… button and then click OK. Click OK on the information box that appears.

Now we can plot the mean of our binary outcome, which is equivalent to the proportion/percentage, for every bin we’ve just created, using a type of barchart, although we’ll use a “histogram” tool to create it.

From the main menu go: Graphs > Chart Builder. If an information box appears saying “Before you use this dialogue…” just click OK. Now in the Gallery tab at the middle left click on Bar and then double click on the leftmost picture of a graph (Simple Bar) to add it to the plot. Next drag the htn variable from the Variables: box onto the Y-Axis? dotted box on the chart image. Similarly, drag the bmi_bin variable onto the X-Axis? box. Look to the right of the tool window at the Element Properties tab in the Statistics box. You should see a drop down menu under Statistic:. This should already read “Mean”, but if not click on it and select mean. This tells SPSS to calculate the mean of the Y-axis variable, htn, for each value of the X-axis variable, which are now repeated when in each bin. Now just click OK.

What can you see?

What does the barchart of BMI vs hypertension show?

Read/hide

There appears to be a fairly linear increase in the proportion of individuals with hypertension as BMI values increase.

Step 2: run the logistic regression

Repeat the instructions from exercise 1, but replace the ses independent variable with the bmi variable and make sure you add it to the Covariates: box not the Factors: box, otherwise you’ll get estimated differences in the probability of hypertension between each distinct value of bmi and a reference value!

Step 3: check the assumptions

See exercise 1, but now we don’t have to worry about cell sizes as there are no cells: we’re working with a continuous independent variable.

Step 4: consider additional problems

See exercise 1.

Step 5 and 6: understand the results tables and extract the key results and report the results

See exercise 1 for full details. In brief though, we can see our key result in the “Parameter estimates” table “Exp(B)” and associated 95% confidence intervals columns.

Specifically, we can see that the odds ratio for the slope between bmi and hypertension is 1.7. Therefore, the model is indicating that for every 1-unit increase in the value of bmi the mean odds of having hypertension are expected to increase 1.7 times, and the 95% confidence intervals indicate that this relative increase is likely to be between 1.5 and 1.9 times in the target population.