The library offers a range of helpful services. All of our appointments are free of charge and confidential.
Linear regression is used when we want to make predictions about a continuous dependent variable (also called an outcome variable) based on one or more independent variables (also called predictor variables). This requires a single continuous outcome variable, and one (simple linear regression) or more (multiple linear regression) predictor variables.
Note that this is a parametric test; if the assumptions of this test are not met, you could / should transform your data (and check all assumptions again). For a refresh on how to check normality, read the “Explore” procedure on the Descriptive Statistics SPSS LibGuide page. For this test, normality is calculated on the RESIDUALS (see below for more details). Ideally, your p-value should be > .05, your histogram should approximate a normal distribution (i.e., a standard “bell-shaped curve”), and the points on your P-P plot (see below) should be fairly close to the line.
Reminder! Normality of the residuals is one of the assumptions of a linear regression. By running the above steps, we should have a new column called “ZRE_1” (standardized residuals) in our data view; this is the column you should run through the “Explore” procedure to check normality of the residuals.
Running the above steps will generate the following output: a Model Summary table, an ANOVA table, a Coefficients table, and a Charts section (with a normal P-P plot). Additionally, running the Explore procedure will generate the Tests of Normality table.
The Model Summary table indicates the R-Square and Adjusted R-Square values. These are two methods to determine the proportion of the outcome variable’s variance that is being accounted for by the use of the predictor variable(s) included within the model. The R-squared value provides an indication of the goodness of fit of the model produced; we can see that roughly 9.1% of the variance of the outcome variable (fake_data1) is being accounted for by this model (gender and fake_data2).
The ANOVA table indicates whether the model you indicated significantly predicts the outcome variable. Here, a p-value less than .05 in the “Sig.” column is required; if your p-value is greater than .05, this indicates poor model fit (and the regression model should NOT be interpreted further).
The Coefficients table is the regression proper. This includes a constant row (generally ignored unless you are creating a prediction equation), one row for each continuous predictor variable, and n - 1 rows for any categorical predictor variables (where n is the number of categories of the categorical predictor variable). If you do not have the correct / expected number of rows in this table, your regression has been run incorrectly; ensure you have created the appropriate dummy variables for your categorical predictor(s).
The Charts section includes a Normal P-P Plot, which can be used to visually assess the normality of the residuals assumption. To pass the normality of the residuals assumption, the data points should fall close to or on the line; if the data fall far from the line, the normality of the residual assumption has failed and you could / should consider transforming your data (and checking all assumptions).
The Tests of Normality table returns both the Kolmogorov-Smirnov statistic and the Shapiro-Wilk statistic of the standardized residual. If p > .05 (in the “Sig.” column), you have passed the normality of the residuals assumption for linear regression; if p < .05 (in the “Sig.” column), you have failed the normality of the residuals assumption and you could / should consider transforming your data (and re-checking all assumptions).
Logistic regression is used when we want to make predictions about a binary dependent variable (also called an outcome variable) based on one or more independent variables (also called predictor variables). This requires a single categorical (binary) outcome variable, and one (simple logistic regression) or more (multiple logistic regression) predictor variables.
There is an assumption that must be met if you are including one or more continuous predictor variables in your logistic regression model: linearity. The assumption of linearity assesses if the relationship between the logit transformation of the outcome variable and any continuous predictor variable(s) are, in fact, linear. NOTE: you only need to check linearity for continuous predictor variables, not for categorical predictor variables.
To test this assumption, we will use the Box-Tidwell test. To do this, we include all predictors in the model as normal, but we also include an interaction term for each continuous predictor. For example, if your model includes continuous predictor variable “X”, when you check for linearity your model must include “X” as well as “X * ln(X)” (the interaction).
To pass the linearity assumption, the interaction term(s) (e.g., X * ln(X)) must be non-significant (e.g., the value in X * ln(X)’s “Sig.” column must be > .05); if one or more of the interaction terms are significant (e.g., p < .05), you have failed the assumption of linearity.
To proceed with the logistic regression if you have failed linearity, you could categorize (e.g., “bin”) your continuous variable(s), or you could try a transformation [expert tip: start by transforming the predictor variable, and if that doesn’t work, you can try also transforming the outcome variable; if a transformation is required to produce a linear relationship between a predictor variable and outcome variable, all subsequent models must incorporate both the untransformed predictor variable and the transformed predictor variable]. Here, we have failed linearity since the Age * ln(Age) interaction term returned p = .007. To continue with this example, we will use the age_binned variable (0 = age < 18; 1 = age 18+).
When producing a logistic regression model with multiple predictor (independent) variables, there is an additional assumption that must be met: multicollinearity. Multicollinearity is the assumption that the predictor variables within a multivariable model are predicting a sufficiently different aspect of the outcome (dependent) variable, so that we are not including variables accounting for the same variability. Essentially, this assumption checks that each predictor in the model is actually explaining / accounting for unique variance within the model.
To measure the amount of multicollinearity that exists between two predictor variables, we can use the Variable Inflation Factor (VIF). As VIF values increase, the likelihood of multicollinearity being present within the model increases. VIF values < 3 indicates very low correlation between predictor variables, while VIF values between 3 – 8 indicates some correlation (and a potential risk of multicollinearity), and VIF values > 8 – 10 indicates high correlation (and likely multicollinearity).
Next, let’s check the VIF scores for sex and age_binned before we check whether these can be used to predict survival. We check multicollinearity like so:
Looking at the Coefficients table, in the VIF column we see low values (approximately 1.0) for both the categorized age variable and sex. Based on the cutoff values specified above, this indicates we have passed the assumption of multicollinearity, and can proceed with the regression.
Note: We do not consider the VIF values between the untransformed and transformed versions of the same variable (e.g., comparing the VIF of age and age_squared, if you went that route), as they are inherently correlated.
If you have passed all of your assumptions, you can move on to the logistic regression.
Running the above steps will generate the following output: the Omnibus Tests of Model Coefficients table, the Model Summary table, the Hosmer and Lemeshow Test table, and the Variables in the Equation table.
Before looking at the Variables in the Equation table (the regression proper), we first look at the other three listed tables. We expect the Step 1: Model line in the omnibus table to be statistically significant (p < .05), indicating our model (i.e., the predictor variable or variables we have chosen) do a good job predicting the outcome variable. The pseudo R2 values (Cox & Snell and / or Nagelkerke) explain approximately how much of the variance in your outcome variable is explained by the model (predictor variables) you have selected: here we see fairly high values of 26.2% - 35.4%. And we expect the Hosmer-Lemeshow goodness of fit test to be non-significant (p > .05), indicating good model fit.
If your Step 1: Model p-value is < .05, you can move on to interpret your regression using the variables in the equation table. Here, we have ONE line for each continuous predictor, and n-1 lines for each categorical predictor (where n is the number of groups in that variable).
Important to note, beta (“B” column) units is log odds. For logistic regression, we use the Exp(B) column (as this reports the odds ratios) and the Sig. column (as this reports the p-values) to interpret the impact of the predictor variable(s) on the outcome variable.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.