Skip to Main Content

Analyze Data: SPSS

Contributors: Lindsay Plater and Narjes Mosavari

Linear regression

Linear regression is used when we want to make predictions about a continuous dependent variable (also called an outcome variable) based on one or more independent variables (also called predictor variables). This requires a single continuous outcome variable, and one (simple linear regression) or more (multiple linear regression) predictor variables.

Note that this is a parametric test; if the assumptions of this test are not met, you could / should transform your data (and check all assumptions again). For a refresh on how to check normality, read the “Explore” procedure on the Descriptive Statistics SPSS LibGuide page. For this test, normality is calculated on the RESIDUALS (see below for more details). Ideally, your p-value should be > .05, your histogram should approximate a normal distribution (i.e., a standard “bell-shaped curve”), and the points on your P-P plot (see below) should be fairly close to the line.

How to run a linear regression

  1. Click on Analyze. Select Regression. Select Linear.
  2. Place the continuous outcome variable in the “Dependent” box.
  3. Place one or more categorical predictor variables (NOTE: you must create n - 1 dummy variables, where n is the number of categories of the categorical predictor variable) and / or one or more continuous predictor variables in the “Independent(s)” box.
  4. If running multiple linear regression, click the “Statistics” button and ensure “Collinearity diagnostics” is checked.
  5. Click the “Plots” button. Move “*ZRESID” to the “Y” box, move “*ZPRED” to the “X” box, and ensure “Normal probability plot” is checked in the “Standardized Residual Plots” section
  6. Click the “Save” button and ensure “Standardized” is checked in the “Residuals” section.
  7. Click OK to run the test (results will appear in the output window).

SPSS in data view, with the Analyze > Regression > Linear dialogue box open. "Fake_data1" is the outcome variable, Gender and "Fake_data2" are the predictor variables.

Reminder! Normality of the residuals is one of the assumptions of a linear regression. By running the above steps, we should have a new column called “ZRE_1” (standardized residuals) in our data view; this is the column you should run through the “Explore” procedure to check normality of the residuals.

Interpreting the output

Running the above steps will generate the following output: a Model Summary table, an ANOVA table, a Coefficients table, and a Charts section (with a normal P-P plot). Additionally, running the Explore procedure will generate the Tests of Normality table.

The navigation pane after running linear regression and Explore (normality) in SPSS.

The Model Summary table indicates the R-Square and Adjusted R-Square values. These are two methods to determine the proportion of the outcome variable’s variance that is being accounted for by the use of the predictor variable(s) included within the model. The R-squared value provides an indication of the goodness of fit of the model produced; we can see that roughly 9.1% of the variance of the outcome variable (fake_data1) is being accounted for by this model (gender and fake_data2).

The Model summary table from the linear regression procedure in SPSS, showing a low R-square value (9.1%).

The ANOVA table indicates whether the model you indicated significantly predicts the outcome variable. Here, a p-value less than .05 in the “Sig.” column is required; if your p-value is greater than .05, this indicates poor model fit (and the regression model should NOT be interpreted further).

The ANOVA table from the linear regression procedure in SPSS, showing a non-significant model (p > .05).

The Coefficients table is the regression proper. This includes a constant row (generally ignored unless you are creating a prediction equation), one row for each continuous predictor variable, and n - 1 rows for any categorical predictor variables (where n is the number of categories of the categorical predictor variable). If you do not have the correct / expected number of rows in this table, your regression has been run incorrectly; ensure you have created the appropriate dummy variables for your categorical predictor(s).

  • The Unstandardized Coefficients “B” column and the “Sig.” column are used to interpret the influence of the specified predictor variable on the outcome variable.
  • For continuous predictor variables, we interpret significant values as follows: holding all other variables constant, on average, a one-unit increase in the predictor variable resulted in an increase [positive unstandardized B value] or a decrease [negative unstandardized B value] of the outcome variable.
    • Here, fake_data2 is non-significant, but each one unit increase in fake_data2 resulted in (on average and accounting for Gender) a 10.2% increase in fake_data1.
  • For categorical predictor variables, we interpret significant values as follows: holding all other variables constant, on average, category 1 [the coded variable] resulted in an increase [positive unstandardized B value] or a decrease [negative unstandardized B value] of the outcome variable compared to category 0 [the reference variable].
    • Here, gender is non-significant, but (on average and accounting for fake_data2), females had a 69.1% lower score on fake_data1 than males.
  • Embedded within the Coefficients table is the Collinearity Statistics section, which reports the variable inflation factor (VIF) scores for each variable in a multiple linear regression. We can use the VIF scores to assess the assumption of multicollinearity.
    • As VIF values increase, the likelihood of multicollinearity being present within the model increases; VIF values < 3 indicates very low correlation between predictor variables, while VIF values between 3 – 8 indicates some correlation (and a potential risk of multicollinearity), and VIF values > 8 – 10 indicates high correlation (and likely multicollinearity).
    • If you have high multicollinearity, this indicates that your continuous predictor variables are explaining the same variance (and thus variables with high VIF scores should be removed from the regression).

The Coefficients table from the linear regression procedure in SPSS, showing non-significant Gender and Fake_data2 predictor variables (p > .05).

The Charts section includes a Normal P-P Plot, which can be used to visually assess the normality of the residuals assumption. To pass the normality of the residuals assumption, the data points should fall close to or on the line; if the data fall far from the line, the normality of the residual assumption has failed and you could / should consider transforming your data (and checking all assumptions).

The Normal P-P plot from the linear regression procedure in SPSS, showing data-points that fall roughly close to the line.

The Tests of Normality table returns both the Kolmogorov-Smirnov statistic and the Shapiro-Wilk statistic of the standardized residual. If p > .05 (in the “Sig.” column), you have passed the normality of the residuals assumption for linear regression; if p < .05 (in the “Sig.” column), you have failed the normality of the residuals assumption and you could / should consider transforming your data (and re-checking all assumptions).

The Tests of Normality table from the Explore procedure in SPSS, indicating whether the residuals have passed normality (p > .05) or failed normality (p < .05).

Logistic regression

Logistic regression is used when we want to make predictions about a binary dependent variable (also called an outcome variable) based on one or more independent variables (also called predictor variables). This requires a single categorical (binary) outcome variable, and one (simple logistic regression) or more (multiple logistic regression) predictor variables.

How to assess linearity

There is an assumption that must be met if you are including one or more continuous predictor variables in your logistic regression model: linearity. The assumption of linearity assesses if the relationship between the logit transformation of the outcome variable and any continuous predictor variable(s) are, in fact, linear. NOTE: you only need to check linearity for continuous predictor variables, not for categorical predictor variables.

To test this assumption, we will use the Box-Tidwell test. To do this, we include all predictors in the model as normal, but we also include an interaction term for each continuous predictor. For example, if your model includes continuous predictor variable “X”, when you check for linearity your model must include “X” as well as “X * ln(X)” (the interaction).

  1. Click Transform. Select Compute Variable.
  2. Add a name for your interaction term in the “Target Variable” box.
  3. In the “Numeric Expression” box, use the following format: NAMEOFPREDICTOR  * ln(NAMEOFPREDICTOR)
  4. Click OK (your new column of data will be the rightmost column of data).
  5. Repeat this procedure (steps 1-4) for any other continuous variables.

SPSS data view. Transform, Compute Variable has been selected. An age_interaction term has been created using Age * ln(Age).

  1. Create your logistic regression model as normal with ALL predictor variables you plan to include, then add the interaction term(s) you computed. Click OK.

SPSS in Data View. Analyze, Regression, Binary Logistic has been selected. A multiple regression with linearity checking is in the pop-up box.

To pass the linearity assumption, the interaction term(s) (e.g., X * ln(X)) must be non-significant (e.g., the value in X * ln(X)’s “Sig.” column must be > .05); if one or more of the interaction terms are significant (e.g., p < .05), you have failed the assumption of linearity.

Logistic regression output from SPSS. The age variable has failed linearity (age_interaction p < .05).

To proceed with the logistic regression if you have failed linearity, you could categorize (e.g., “bin”) your continuous variable(s), or you could try a transformation [expert tip: start by transforming the predictor variable, and if that doesn’t work, you can try also transforming the outcome variable; if a transformation is required to produce a linear relationship between a predictor variable and outcome variable, all subsequent models must incorporate both the untransformed predictor variable and the transformed predictor variable]. Here, we have failed linearity since the Age * ln(Age) interaction term returned p = .007. To continue with this example, we will use the age_binned variable (0 = age < 18; 1 = age 18+).

How to assess multicollinearity

When producing a logistic regression model with multiple predictor (independent) variables, there is an additional assumption that must be met: multicollinearity. Multicollinearity is the assumption that the predictor variables within a multivariable model are predicting a sufficiently different aspect of the outcome (dependent) variable, so that we are not including variables accounting for the same variability. Essentially, this assumption checks that each predictor in the model is actually explaining / accounting for unique variance within the model.

To measure the amount of multicollinearity that exists between two predictor variables, we can use the Variable Inflation Factor (VIF). As VIF values increase, the likelihood of multicollinearity being present within the model increases. VIF values < 3 indicates very low correlation between predictor variables, while VIF values between 3 – 8 indicates some correlation (and a potential risk of multicollinearity), and VIF values > 8 – 10 indicates high correlation (and likely multicollinearity).

Next, let’s check the VIF scores for sex and age_binned before we check whether these can be used to predict survival. We check multicollinearity like so:

  1. Click on Analyze. Select Regression. Select Linear. (yes, linear; this isn’t a typo!)
  2. Build your model (move your outcome variable to the dependent box, and your two or more predictor variables to the independents box).
  3. Click the Statistics button. Select “Collinearity diagnostics” button.
  4. Click Continue. Click OK.

Linear regression in SPSS, checking for multicollinearity of a logistic regression.

Looking at the Coefficients table, in the VIF column we see low values (approximately 1.0) for both the categorized age variable and sex. Based on the cutoff values specified above, this indicates we have passed the assumption of multicollinearity, and can proceed with the regression.

Linear regression output in SPSS to check multicollinearity for a logistic regression. Here, VIF scores are low (around 1); no problem with multicollinearity.

Note: We do not consider the VIF values between the untransformed and transformed versions of the same variable (e.g., comparing the VIF of age and age_squared, if you went that route), as they are inherently correlated.

How to run a logistic regression

If you have passed all of your assumptions, you can move on to the logistic regression.

  1. Click on Analyze. Select Regression. Select Binary Logistic.
  2. Place the binary outcome variable in the “Dependent” box.
  3. Place one or more predictor variables in the “Block 1 of 1” box [the independents box].
  4. If you have one or more categorical predictors, click the Categorical button and move your categorical predictor(s) to the “Categorical Covariates” box.
  5. Click options. Ensure you have selected what you want for output [likely classification plots, Hosmer-Lemeshow goodness of fit, and CI for Exp(B) at 95%].
  6. Click OK.

SPSS in data view. Analyze, Regression, Binary Logistic has been selected. A multiple logistic regression with two categorical variables has been created.

Interpreting the output

Running the above steps will generate the following output: the Omnibus Tests of Model Coefficients table, the Model Summary table, the Hosmer and Lemeshow Test table, and the Variables in the Equation table.

Before looking at the Variables in the Equation table (the regression proper), we first look at the other three listed tables. We expect the Step 1: Model line in the omnibus table to be statistically significant (p < .05), indicating our model (i.e., the predictor variable or variables we have chosen) do a good job predicting the outcome variable. The pseudo R2 values (Cox & Snell and / or Nagelkerke) explain approximately how much of the variance in your outcome variable is explained by the model (predictor variables) you have selected: here we see fairly high values of 26.2% - 35.4%. And we expect the Hosmer-Lemeshow goodness of fit test to be non-significant (p > .05), indicating good model fit.

Logistic regression output, showing a significant model and 26.2 - 35.4% of outcome variable variance explained.

If your Step 1: Model p-value is < .05, you can move on to interpret your regression using the variables in the equation table. Here, we have ONE line for each continuous predictor, and n-1 lines for each categorical predictor (where n is the number of groups in that variable).

Logistic regression output, showing a significant effect of sex and a trending effect of age.

Important to note, beta (“B” column) units is log odds. For logistic regression, we use the Exp(B) column (as this reports the odds ratios) and the Sig. column (as this reports the p-values) to interpret the impact of the predictor variable(s) on the outcome variable.

  1. On average, and accounting for sex, children were 1.584 times more likely to survive the Titanic than adults (p = .061). This can also be written as a percentage; on average, children were 58.4% more likely to survive the Titanic than adults (p = .061).
  2. On average, and accounting for age, men were 0.086 times more likely to survive the Titanic than women (p = 0.0000000000000002). This can (and should) be flipped for interpretability; women were 11.6 times more likely to survive the Titanic than men. This can also be written as a percentage; on average, and accounting for age, women were 1060% more likely to survive the Titanic than men (p = .0000000000000002).

Suggest an edit to this guide

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.