The library offers a range of helpful services. All of our appointments are free of charge and confidential.
Linear regression is used when we want to make predictions about a continuous dependent variable (also called an outcome variable) based on one or more independent variables (also called predictor variables). This requires a single continuous outcome variable, and one (simple linear regression) or more (multiple linear regression) predictor variables.
Note that this is a parametric test; if the assumptions of this test are not met, you could / should transform your data (and check all assumptions again). For a refresh on how to check normality, read the “Explore” procedure on the Descriptive Statistics SPSS LibGuide page. For this test, normality is calculated on the RESIDUALS (see below for more details). Ideally, your p-value should be > .05, your histogram should approximate a normal distribution (i.e., a standard “bell-shaped curve”), and the points on your P-P plot (see below) should be fairly close to the line.
Reminder! Normality of the residuals is one of the assumptions of a linear regression. By running the above steps, we should have a new column called “ZRE_1” (standardized residuals) in our data view; this is the column you should run through the “Explore” procedure to check normality of the residuals.
Running the above steps will generate the following output: a Model Summary table, an ANOVA table, a Coefficients table, and a Charts section (with a normal P-P plot). Additionally, running the Explore procedure will generate the Tests of Normality table.
The Model Summary table indicates the R-Square and Adjusted R-Square values. These are two methods to determine the proportion of the outcome variable’s variance that is being accounted for by the use of the predictor variable(s) included within the model. The R-squared value provides an indication of the goodness of fit of the model produced; we can see that roughly 9.1% of the variance of the outcome variable (fake_data1) is being accounted for by this model (gender and fake_data2).
The ANOVA table indicates whether the model you indicated significantly predicts the outcome variable. Here, a p-value less than .05 in the “Sig.” column is required; if your p-value is greater than .05, this indicates poor model fit (and the regression model should NOT be interpreted further).
The Coefficients table is the regression proper. This includes a constant row (generally ignored unless you are creating a prediction equation), one row for each continuous predictor variable, and n - 1 rows for any categorical predictor variables (where n is the number of categories of the categorical predictor variable). If you do not have the correct / expected number of rows in this table, your regression has been run incorrectly; ensure you have created the appropriate dummy variables for your categorical predictor(s).
The Charts section includes a Normal P-P Plot, which can be used to visually assess the normality of the residuals assumption. To pass the normality of the residuals assumption, the data points should fall close to or on the line; if the data fall far from the line, the normality of the residual assumption has failed and you could / should consider transforming your data (and checking all assumptions).
The Tests of Normality table returns both the Kolmogorov-Smirnov statistic and the Shapiro-Wilk statistic of the standardized residual. If p > .05 (in the “Sig.” column), you have passed the normality of the residuals assumption for linear regression; if p < .05 (in the “Sig.” column), you have failed the normality of the residuals assumption and you could / should consider transforming your data (and re-checking all assumptions).
Logistic regression is used when we want to make predictions about a binary dependent variable (also called an outcome variable) based on one or more independent variables (also called predictor variables). This requires a single categorical (binary) outcome variable, and one (simple logistic regression) or more (multiple logistic regression) predictor variables.
There is an assumption that must be met if you are including one or more continuous predictor variables in your logistic regression model: linearity. The assumption of linearity assesses if the relationship between the logit transformation of the outcome variable and any continuous predictor variable(s) are, in fact, linear. NOTE: you only need to check linearity for continuous predictor variables, not for categorical predictor variables.
To test this assumption, we will use the Box-Tidwell test. To do this, we include all predictors in the model as normal, but we also include an interaction term for each continuous predictor. For example, if your model includes continuous predictor variable “X”, when you check for linearity your model must include “X” as well as “X * ln(X)” (the interaction).
To pass the linearity assumption, the interaction term(s) (e.g., X * ln(X)) must be non-significant (e.g., the value in X * ln(X)’s “Sig.” column must be > .05); if one or more of the interaction terms are significant (e.g., p < .05), you have failed the assumption of linearity.
To proceed with the logistic regression if you have failed linearity, you could categorize (e.g., “bin”) your continuous variable(s), or you could try a transformation [expert tip: start by transforming the predictor variable, and if that doesn’t work, you can try also transforming the outcome variable; if a transformation is required to produce a linear relationship between a predictor variable and outcome variable, all subsequent models must incorporate both the untransformed predictor variable and the transformed predictor variable]. Here, we have failed linearity since the Age * ln(Age) interaction term returned p = .007. To continue with this example, we will use the age_binned variable (0 = age < 18; 1 = age 18+).
When producing a logistic regression model with multiple predictor (independent) variables, there is an additional assumption that must be met: multicollinearity. Multicollinearity is the assumption that the predictor variables within a multivariable model are predicting a sufficiently different aspect of the outcome (dependent) variable, so that we are not including variables accounting for the same variability. Essentially, this assumption checks that each predictor in the model is actually explaining / accounting for unique variance within the model.
To measure the amount of multicollinearity that exists between two predictor variables, we can use the Variable Inflation Factor (VIF). As VIF values increase, the likelihood of multicollinearity being present within the model increases. VIF values < 3 indicates very low correlation between predictor variables, while VIF values between 3 – 8 indicates some correlation (and a potential risk of multicollinearity), and VIF values > 8 – 10 indicates high correlation (and likely multicollinearity).
Next, let’s check the VIF scores for sex and age_binned before we check whether these can be used to predict survival. We check multicollinearity like so:
Looking at the Coefficients table, in the VIF column we see low values (approximately 1.0) for both the categorized age variable and sex. Based on the cutoff values specified above, this indicates we have passed the assumption of multicollinearity, and can proceed with the regression.
Note: We do not consider the VIF values between the untransformed and transformed versions of the same variable (e.g., comparing the VIF of age and age_squared, if you went that route), as they are inherently correlated.
If you have passed all of your assumptions, you can move on to the logistic regression.
Running the above steps will generate the following output: the Omnibus Tests of Model Coefficients table, the Model Summary table, the Hosmer and Lemeshow Test table, and the Variables in the Equation table.
Before looking at the Variables in the Equation table (the regression proper), we first look at the other three listed tables. We expect the Step 1: Model line in the omnibus table to be statistically significant (p < .05), indicating our model (i.e., the predictor variable or variables we have chosen) do a good job predicting the outcome variable. The pseudo R2 values (Cox & Snell and / or Nagelkerke) explain approximately how much of the variance in your outcome variable is explained by the model (predictor variables) you have selected: here we see fairly high values of 26.2% - 35.4%. And we expect the Hosmer-Lemeshow goodness of fit test to be non-significant (p > .05), indicating good model fit.
If your Step 1: Model p-value is < .05, you can move on to interpret your regression using the variables in the equation table. Here, we have ONE line for each continuous predictor, and n-1 lines for each categorical predictor (where n is the number of groups in that variable).
Important to note, beta (“B” column) units is log odds. For logistic regression, we use the Exp(B) column (as this reports the odds ratios) and the Sig. column (as this reports the p-values) to interpret the impact of the predictor variable(s) on the outcome variable.
Ordinal logistic regression is used when we want to make predictions about an ordinal dependent variable (also called an outcome variable) based on one or more independent variables (also called predictor variables). This requires a single categorical (3 or more groups with inherent order; e.g., small / medium / large or strongly disagree to strongly agree) outcome variable, and one (simple ordinal logistic regression) or more (multiple ordinal logistic regression) predictor variables.
In this example, we investigate the relationship between the log-transformed weight of an animal’s brain and whether they would be a low, medium, or high-category sleeper.
Ordinal logistic regression requires an ordinal outcome variable (e.g., the data must be categorical, with an inherent rank or order to the groups). Here, we will use animals’ time spent sleeping as our outcome variable: we can look at the “sleep_total” variable and see that it is continuous. To use this variable in an ordinal logistic regression, we can categorize the animals’ sleep into “low”, “medium”, and “high” groups:
Expert tip: create two versions of the categorical version of sleep, using ordered numbers (e.g., 1, 2, 3) to denote low, medium, and high sleep categories. The order of the groups here matters (e.g., you can set this to low then medium then high or high then medium then low, but you cannot use medium then low then high as the order would be broken). In Variable View, ensure the “Measure” column is set to Scale data type for one of the categorical versions of sleep (sleep_numeric, for assumption checking purposes), and the other is set to Ordinal (sleep_cat, for conducting the analysis):
We will use sleep_cat (i.e., the ordinal version of sleep_total) for the ordinal logistic regression, which is an ordinal outcome variable, so we have passed this assumption.
We can check our predictor variable (brain weight) and plot it using our ordinal outcome variable (sleep_cat) using Graphs > Chart Builder. Select “Boxplot” from the bottom-left, and drag the first graph option to the blue text in the top middle section of the screen. Take your continuous predictor variable and place it on the y-axis, take your categorical outcome variable and place it on the x-axis. Then press OK. The Chart Builder dialogue box should look something like this:
The resulting boxplot (see below) shows some values that are really far away from the rest of the data; the data points with a star indicate that these rows of our dataset are considered extreme outliers if we use this scale.
Maybe the scale we are using isn’t the best choice! Let’s try log-transforming the brain weight variable using the ln() function in Transform > Compute Variable. Write your new variable name in the “Target Variable” box, and include your formula for calculating the log of brain weight in the “Numeric Expression” box. Click OK.
When we create the boxplot again (Graphs > Chart Builder) with the log-transformed variable, we see a log scale does a much better job displaying our results; the boxplot is no longer squished, and there do not appear to be any obvious outliers. To keep all of our data in the analysis, let’s proceed using the log-transformed brain weight for this analysis.
Independence of observations means that each observation must be entirely independent of the others. To assess independence, we can look at our data; we can see that each observation is a unique species of animal (i.e., each observation is independent of the others), thereby passing the assumption of independence of observations.
The assumption of proportional odds (sometimes called the test of parallel lines), briefly, assumes that the effect of each predictor variable must be identical at each partition of the data. In other words, the ‘slope’ value for a predictor variable within an ordinal logistic regression model must not change across the different categorized levels of the outcome variable. We can easily check this assumption when we run the regression model in SPSS (see “Interpreting the Output” section, below).
In the following example, we want to add an additional predictor variable (body weight) to the previous model (predictor: log brain size; outcome: categorized sleep). Body weight is in the “bodywt” column, and is a continuous variable.
When producing an ordinal logistic regression model with multiple predictor (independent) variables, there is an additional assumption that must be met prior to analyzing the model’s output: multicollinearity. Multicollinearity is the assumption that the predictor variables within a multivariable model are predicting a sufficiently different aspect of the outcome (dependent) variable, so that we are not including variables accounting for the same variability. Essentially, this assumption checks that each predictor in the model is actually explaining / accounting for unique variance within the model.
To measure the amount of multicollinearity that exists between two predictor variables, we can use the Variable Inflation Factor (VIF). As VIF values increase, the likelihood of multicollinearity being present within the model increases. VIF values < 3 indicates very low correlation between predictor variables, while VIF values between 3 – 8 indicates some correlation (and a potential risk of multicollinearity), and VIF values > 8 – 10 indicates high correlation (and likely multicollinearity).
Next, let’s check the VIF scores for log_bodywt and log_brainwt before we check whether these can be used to predict sleep. We check multicollinearity like so:
Looking at the Coefficients table, in the VIF column we see very high (>10) VIF scores; this makes sense, as animals with larger bodies often have larger brains. Based on the cutoff values specified above, this indicates that we have failed the assumption of multicollinearity; in order to appropriately run the regression, we would need to remove one of these two predictor variables. For the purposes of demonstration, we will continue with the analysis, pretending we have passed the assumption of multicollinearity.
If you have passed all of your assumptions (note: we still need to check the test of parallel lines), you can move on to running the simple ordinal logistic regression.
Running the above steps will generate the following output: Case Processing Summary, Model Fitting Information, Goodness-of-Fit, Pseudo R-Square, Parameter Estimates, and Test of Parallel Lines.
Before looking at the Parameter Estimates table (the regression proper), we first look at the other tables. Let’s start with the Test of Parallel Lines table (i.e., the assumption of proportional odds), which is at the very bottom of the output:
Here, a non-significant (p > .05) value in the “Sig.” column indicates we have passed the assumption of proportional odds. If this was statistically significant (p < .05), this would indicate we have failed the assumption. If we fail this assumption, we should not use or interpret the results of the ordinal logistic regression.
Next, we can quickly look at the other tables:
We see a Warning, indicating we have several combinations with missing frequencies (i.e., not every log brain weight value is represented in each sleep category). Next, the Case Processing Summary table shows us the breakdown of our data; here we can see some missing data, as some animals did not have a reported brain weight (and were excluded from the analysis). The Model Fitting Information table shows us whether the model is doing a good job (p < .05 in the “Sig.” column) or not (p > .05 in the “Sig.” column) in fitting the data; here we require a significant value (p < .05) in order to interpret the regression. In the Goodness-of-Fit table, however, we expect non-significant values (p > .05) in order to interpret the regression. The Pseudo R-Square table give us three different versions of approximately how much variance in the outcome variable is explained by the model you have built (i.e., the predictor variables); here, we see that log brain weight explains ~12.6 – 26.6% of the variance of sleep category.
After investigating all of the other tables, we can read the results of the simple ordinal logistic regression in the Parameter Estimates table:
We generally ignore any “Threshold” rows, and instead look at the “Location” row(s). For continuous predictors, we should have one row per predictor; for categorical predictors, we should have one row for each category, with the “blank” row serving as the reference variable.
The “Sig.” value is the p-value indicating whether the predictor variable is statistically significant (p < .05) in its ability to predict the value of the outcome variable. The “Estimate” value is the LOG ODDS change in the relationship, or slope, produced by the specific coefficient accounting for other variables in the model. As log odds are unintuitive, we generally exponentiate them into odds ratios for reporting / interpretation. When we exponentiate the Estimate (log odds) value of -.422, we get an odds ratio value of 0.656. We use odds ratios (in combination with p-values) to interpret the impact of the predictor variables(s) on the outcome variable.
Here, we see the odds ratio (e.g., the exponentiated log odds) for the log-transformed brain weight predictor variable is .656. This variable is less than 1, meaning we can find the percentage decrease in the odds like so: [(1 - .655) * 100] = 34.4%. As our variable was statistically significant, we can interpret the regression as follows: for each one-unit increase in the log-transformed brain weight variable (e.g., the predictor variable), the odds of being in a higher sleep category (e.g., the outcome variable) decreased by 34.4%, on average. To say this in plain-er English, animals with higher log brain weights, on average, needed more sleep than animals with lower log brain weights.
Expert tip: if your odds ratio is greater than 1, you would instead see a percentage increase in the odds of being in a different category of the outcome variable.
To report this result, we would say something along the lines of: On average, animals with a larger log-transformed brain weight were more likely (34.4%) to be in a lower sleep category (p < .001).
If you have passed all of your assumptions (note: we still need to check the test of parallel lines), you can move on to running the multiple ordinal logistic regression.
Running the above steps will generate the following output: Case Processing Summary, Model Fitting Information, Goodness-of-Fit, Pseudo R-Square, Parameter Estimates, and Test of Parallel Lines.
Before looking at the Parameter Estimates table (the regression proper), we first look at the other tables. Let’s start with the Test of Parallel Lines table (i.e., the assumption of proportional odds), which is at the very bottom of the output:
Here, a non-significant (p > .05) value in the “Sig.” column indicates we have passed the assumption of proportional odds. If this was statistically significant (p < .05), this would indicate we have failed the assumption. If we fail this assumption, we should not use or interpret the results of the ordinal logistic regression.
Next, we can quickly look at the other tables:
We see a Warning, indicating we have several combinations with missing frequencies (i.e., not every log brain weight value and / or log body weight value is represented in each sleep category). Next, the Case Processing Summary table shows us the breakdown of our data; here we can see some missing data, as some animals did not have a reported brain weight and / or a reported body weight (and were excluded from the analysis). The Model Fitting Information table shows us whether the model is doing a good job (p < .05 in the “Sig.” column) or not (p > .05 in the “Sig.” column) in fitting the data; here we require a significant value (p < .05) in order to interpret the regression. In the Goodness-of-Fit table, however, we expect non-significant values (p > .05) in order to interpret the regression. The Pseudo R-Square table give us three different versions of approximately how much variance in the outcome variable is explained by the model you have built (i.e., the predictor variables); here, we see that log brain weight explains ~13.1 – 27.7% of the variance of sleep category.
After investigating all of the other tables, we can read the results of the multiple ordinal logistic regression in the Parameter Estimates table:
We generally ignore any “Threshold” rows, and instead look at the “Location” row(s). For continuous predictors, we should have one row per predictor; for categorical predictors, we should have one row for each category, with the “blank” row serving as the reference variable.
The “Sig.” value is the p-value indicating whether the predictor variable is statistically significant (p < .05) in its ability to predict the value of the outcome variable. The “Estimate” value is the LOG ODDS change in the relationship, or slope, produced by the specific coefficient accounting for other variables in the model. As log odds are unintuitive, we generally exponentiate them into odds ratios for reporting / interpretation. When we exponentiate the Estimate (log odds) values of -.105 and -.267, we get odds ratio values of 0.899 and 0.766. We use odds ratios (in combination with p-values) to interpret the impact of the predictor variables(s) on the outcome variable.
Here, we see the odds ratio (e.g., the exponentiated log odds) for the log-transformed brain weight predictor variable is .899. This variable is less than 1, meaning we can find the percentage decrease in the odds like so: [(1 - .899) * 100] = 10.1%. As our variable was statistically significant, we can interpret the regression as follows: for each one-unit increase in the log-transformed brain weight variable (e.g., the predictor variable), the odds of being in a higher sleep category (e.g., the outcome variable) decreased by 10.1%, on average. Similarly, the body weight predictor variable had a 23.4% percentage decrease in the odds. As our variables were not statistically significant, we would not interpret these results further.
Expert tip: if your odds ratio is greater than 1, you would instead see a percentage increase in the odds of being in a different category of the outcome variable.
To report this result, if it was statistically significant, we would say something along the lines of: On average, and accounting for the log-transformed body weight, animals with a larger log-transformed brain weight were more likely (10.1%) to be in a lower sleep category (p = .793).
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.