The library offers a range of helpful services. All of our appointments are free of charge and confidential.
Linear regression is used when we want to make predictions about a continuous dependent variable (also called an outcome variable) based on one or more independent variables (also called predictor variables). This requires a single continuous outcome variable, and one (simple linear regression) or more (multiple linear regression) predictor variables.
In this example, we investigate the relationship between a vehicle’s horsepower and fuel efficiency, to ultimately assess if we can use a vehicle’s horsepower (hp) to predict its fuel efficiency (mpg).
Independence of observations means that each observation must be entirely independent of the others. To assess independence, we can look at our data using View(NAMEOFDATAFRAME); here, we can see that each observation is a unique vehicle model (i.e., each observation is independent of the others), thereby passing the assumption of independence of observations.

The assumption of linearity assesses if the relationship between the continuous outcome variable and a continuous predictor variable(s) are, in fact, linear.
To test the assumption of linearity, start by plotting a scatterplot to perform visual inspection (i.e., does the relationship look linear), using the ggplot() function from the ggplot2 package:
ggplot(data = NAMEOFDATAFRAME, aes(x = NAMEOFPREDICTORCOLUMN, y = NAMEOFOUTCOMECOLUMN)) + geom_point()
If you’re not sure if the data look linear, you could add a loess curve to the scatterplot:
ggplot(data = NAMEOFDATAFRAME, aes(x = NAMEOFPREDICTORCOLUMN, y = NAMEOFOUTCOMECOLUMN)) + geom_point() + stat_smooth(method = "loess", se = FALSE)

Here, we can see that the relationship between a vehicle’s horsepower and fuel efficiency appears to be exponentially decreasing; as the loess curve does not look like a straight line, we have failed the assumption of linearity.
Since we failed linearity, we can transform the predictor variable and re-check the linearity assumption. Expert tip: start by transforming the predictor variable, and if that doesn’t work, you can try also transforming the outcome variable. Let’s try a log transform on just the predictor variable first:
NAMEOFDATAFRAME$NAMEOFPREDICTORCOLUMN_LOG <- log(NAMEOFDATAFRAME$NAMEOFPREDICTORCOLUMN)
Plot the data again, this time using the transformed predictor variable:
ggplot(data = NAMEOFDATAFRAME, aes(x = NAMEOFPREDICTORCOLUMN_LOG, y = NAMEOFOUTCOMECOLUMN)) + geom_point() + stat_smooth(method = "loess", se = FALSE)
In addition to visual inspection, we must include the non-transformed variable and the transformed variable in the regression to assess model fit:
lm(NAMEOFOUTCOMECOLUMN ~ NAMEOFPREDICTORCOLUMN + NAMEOFPREDICTORCOLUMN_LOG, DATA = NAMEOFDATAFRAME)

A scatterplot that is still non-linear, and / or a non-significant (p > .05) NAMEOFPREDICTOR_LOG regression output indicates that we have failed linearity. You could try a different transformation, and / or you could try also transforming the outcome variable (i.e., a “double transform” as both the predictor and the outcome will be transformed). Here, we have passed linearity as our scatterplot looks roughly like a line and our hp_log p-value is p < .05.
Note: If a transformation is required to produce a linear relationship between a predictor variable and outcome variable, all subsequent models must incorporate both the untransformed predictor variable and the transformed predictor variable.
Normality of a model’s residuals is a fundamental assumption of a linear regression. You have two options for assessing normality of the residuals: statistically with the Shapiro-Wilk test and / or visual inspection with a histogram. In both cases, you first create your regression model, and then you assess normality using the residuals (which is stored in your model as “resid”).
Shapiro-Wilk:
shapiro.test(resid(NAMEOFREGRESSIONMODEL))
Histogram:
hist(resid(NAMEOFREGRESSIONMODEL))

A significant Shapiro-Wilk statistic (p < .05) and / or a histogram that does not look like a standard bell-shaped curve indicate that you have failed the test of normality, and should either transform your data (and check all assumptions again) or switch to a non-parametric test (e.g., some researchers use a Spearman correlation if normality for linear regression has failed). Here, we have passed normality as our Shapiro-Wilk statistic p-value is p > .05 and our histogram looks like a bell-shaped curve.
The homogeneity of variance assumption (also known as homoskedasticity) is used to assess whether the variances between groups are equal; in the case of a linear regression, we are testing whether the variance of the model’s residuals is equal to the variance of the fitted values from the model.
With categorical predictor variables, homogeneity of variance is typically assessed using either a Levene’s test or a Bartlett test. With continuous predictor variables, homogeneity of variance is typically assessed using a Breusch-Pagan test.
Let’s use the Breusch-Pagan test (the bptest() function from the lmtest package) and visual inspection to assess homogeneity.
Breush-Pagan test:
bptest(NAMEOFREGRESSIONMODEL)
Residual versus fitted plot:
plot(NAMEOFREGRESSIONMODEL, 1)

A significant Levene’s, Bartlett, or Breusch-Pagan test (p < .05) and / or a residuals versus fitted plot that makes a “cone” or “fan” shape (i.e., a shape that starts wide and then narrows off) indicates that you have failed the homogeneity of variance assumption, and should either transform your data (and check all assumptions again), switch to a non-parametric test (e.g., some researchers use a Mann-Whitney U test for categorical predictor variables if homogeneity for linear regression has failed), or apply an adjustment. Here, we have passed homogeneity as our Breusch-Pagan test p-value is p > .05, and our residuals versus fitted plot has a band of data-points that do not make a cone or fan shape.
In the following example, we want to add an additional predictor variable (the weight of the vehicle) to the previous model (predictor: horsepower; outcome: fuel efficiency). We start as before by checking the linearity assumption of the new predictor variable (weight) by creating a plot. This example produces a non-linear scatterplot, so we add a transformation and re-check both the plot and the model using the transformed variable, like so:

Here, we see an approximately straight loess line in the scatterplot when using the log weight variable, and we also see in the regression model a significant (p < .05) log weight variable. This means we can proceed with the regression, as we have met the linearity assumption (Reminder: we must add both the regular weight term and the log weight term to the model moving forward).
When producing a linear regression model with multiple predictor (independent) variables, there is an additional assumption that must be met prior to analyzing the model’s output: multicollinearity. Multicollinearity is the assumption that the predictor variables within a multivariable model are predicting a sufficiently different aspect of the outcome (dependent) variable, so that we are not including variables accounting for the same variability. Essentially, this assumption checks that each predictor in the model is actually explaining / accounting for unique variance within the model.
To measure the amount of multicollinearity that exists between two predictor variables, we can use the Variable Inflation Factor (VIF). As VIF values increase, the likelihood of multicollinearity being present within the model increases. We can use the VIF() function in R to identify multicollinearity; VIF values < 3 indicates very low correlation between predictor variables, while VIF values between 3 – 8 indicates some correlation (and a potential risk of multicollinearity), and VIF values > 8 – 10 indicates high correlation (and likely multicollinearity).
Next, let’s check the VIF scores for weight and horsepower before we check whether these can be used to predict fuel efficiency. We build the regression model (be sure to add all predictor variables and any transformation terms, if appropriate), and then run vif(NAMEOFREGRESSIONMODEL):

We can see the VIF scores for both horsepower and vehicle weight in the output, with horsepower and log horsepower having VIF values of 18.1 and 21.7, respectively, and weight and log weight having VIF values of 24.7 and 26.8, respectively. Based on the cutoff values specified above, this indicates that a vehicle’s weight and a vehicle’s horsepower are highly correlated, and are both accounting for much of the same variability within the model. Therefore, the assumption of multicollinearity fails, and one of the high VIF variables (either horsepower or weight) must be removed from the model.
Note: We do not consider the VIF values between the untransformed and transformed versions of the same variable (e.g., comparing the VIF of hp and hp_log), as they are inherently correlated.
To run a linear regression in R, we can use the lm() function to create the model. This function requires a formula specifying the outcome (dependent) and predictor (independent) variables, and the dataframe they come from: lm(NAMEOFOUTCOMECOLUMN ~ NAMEOFPREDICTORCOLUMN, data = NAMEOFDATAFRAME). Finally, use the summary() function to see the output of the regression analysis, summary(NAMEOFREGRESSIONMODEL):

Running the above code in R will produce the following output:

Logistic regression is used when we want to make predictions about a binary dependent variable (also called an outcome variable) based on one or more independent variables (also called predictor variables). This requires a single categorical (binary) outcome variable, and one (simple logistic regression) or more (multiple logistic regression) predictor variables.
In this example, we investigate the relationship between a passenger’s age and the likelihood they survived the sinking of the Titanic, to ultimately assess if we can use age to predict survivability. First we will load in the Titanic dataset:

Independence of observations means that each observation must be entirely independent of the others. To assess independence, we can look at our data using View(NAMEOFDATAFRAME); here, we can see that each observation is a unique individual (i.e., each observation is independent of the others), thereby passing the assumption of independence of observations.

EXPERT NOTE: in some cases, simply having a unique row per participant is not enough to guarantee independence (e.g., family studies where there are multiple family members included in the same analysis). Think critically about your design to ensure your data are independent!
The assumption of linearity assesses if the relationship between the logit transformation of the outcome variable and any continuous predictor variable(s) are, in fact, linear. NOTE: you only need to check linearity for continuous predictor variables, not for categorical predictor variables.
To test this assumption, we will use the Box-Tidwell test. To do this, we include all predictors in the model as normal, but we also include an interaction term for each continuous predictor. For example, if your model includes continuous predictor variable “X”, when you check for linearity your model must include “X” as well as “X:log(X)” (the interaction).
To test the assumption of linearity, start by creating the regression model:
NAMEOFREGRESSIONMODEL = glm(NAMEOFOUTCOMECOLUMN ~ NAMEOFPREDICTORCOLUMN + NAMEOFPREDICTORCOLUMN:log(NAMEOFPREDICTORCOLUMN), data = NAMEOFDATAFRAME, family = binomial)

To pass the linearity assumption, the interaction term(s) (e.g., X:log(X)) must be non-significant (e.g., the value in X:log(X)’s Pr(>|z|) column must be > .05); if one or more of these terms is significant (e.g., p < .05), you have failed the assumption of linearity. To proceed with the logistic regression if you have failed linearity, you could categorize (e.g., “bin”) your continuous variable(s), or you could try a transformation [expert tip: start by transforming the predictor variable, and if that doesn’t work, you can try also transforming the outcome variable; if a transformation is required to produce a linear relationship between a predictor variable and outcome variable, all subsequent models must incorporate both the untransformed predictor variable and the transformed predictor variable]. Here, we have failed linearity since the Age:log(Age) interaction term returned p = .0179. To continue with this example, we will categorize or “bin” this variable.
The selection of categorical cut-off points can vary depending on the research question, the nature of the predictor variable, and the predictor variable’s relationship with the outcome variable. In this case, we will categorize the age variable into two groups: children (Age < 18) and adults (Age 18+). The selection of these categories is based on the hypothesis that children were more likely to be given a lifeboat seat (and thus, survive) than adults. To categorize the data:
NAMEOFDATAFRAME$NAMEOFPREDICTORCOLUMN_CAT = ifelse(NAMEOFDATAFRAME$NAMEOFPREDICTORCOLUMN < VALUE, “CATEGORYIFTRUE”, “CATEGORYIFFALSE”)

In the following example, we want to add an additional predictor variable (sex) to the previous model (predictor: age; outcome: survival), as it is possible that women were more likely to be given lifeboat seats than men. As before, we start with the continuous Age variable, and so we must re-check linearity: we create the model we wish to use, then add the interaction term for continuous predictors. Here, we have failed linearity again (e.g., the value in X:log(X)’s Pr(>|z|) column was .00684, which is < .05), so we will proceed with the categorized (binned) version of the age variable for our next step.

When producing a logistic regression model with multiple predictor (independent) variables, there is an additional assumption that must be met prior to analyzing the model’s output: multicollinearity. Multicollinearity is the assumption that the predictor variables within a multivariable model are predicting a sufficiently different aspect of the outcome (dependent) variable, so that we are not including variables accounting for the same variability. Essentially, this assumption checks that each predictor in the model is actually explaining / accounting for unique variance within the model.
To measure the amount of multicollinearity that exists between two predictor variables, we can use the Variable Inflation Factor (VIF). As VIF values increase, the likelihood of multicollinearity being present within the model increases. We can use the VIF() function in R to identify multicollinearity; VIF values < 3 indicates very low correlation between predictor variables, while VIF values between 3 – 8 indicates some correlation (and a potential risk of multicollinearity), and VIF values > 8 – 10 indicates high correlation (and likely multicollinearity).
Next, let’s check the VIF scores for sex and age_cat before we check whether these can be used to predict survival. We build the regression model (be sure to add all predictor variables and any transformation terms, if appropriate), and then run vif(NAMEOFREGRESSIONMODEL):

We see low values (approximately 1.0) for both the categorized age variable and sex. Based on the cutoff values specified above, this indicates we have passed the assumption of multicollinearity, and can proceed with the regression.
Note: We do not consider the VIF values between the untransformed and transformed versions of the same variable (e.g., comparing the VIF of age and age_squared, if you went that route), as they are inherently correlated.
To run a simple logistic regression in R, we can use the glm() function to create the model. This function requires a formula specifying the outcome (dependent) and predictor (independent) variable, and the dataframe they come from: glm(NAMEOFOUTCOMECOLUMN ~ NAMEOFPREDICTORCOLUMN, data = NAMEOFDATAFRAME, family = “binomial”). Finally, use the summary() function to see the output of the regression analysis, summary(NAMEOFREGRESSIONMODEL), and run the line: exp(coef(NAMEOFREGRESSIONMODEL)).

Running the above code in R will produce the following output:

To run a multiple logistic regression in R, we can use the glm() function to create the model. This function requires a formula specifying the outcome (dependent) and predictor (independent) variables, and the dataframe they come from: glm(NAMEOFOUTCOMECOLUMN ~ NAMEOFPREDICTORCOLUMN + NAMEOFPREDICTORCOLUMN, data = NAMEOFDATAFRAME, family = “binomial”). Finally, use the summary() function to see the output of the regression analysis, summary(NAMEOFREGRESSIONMODEL), and run the line: exp(coef(NAMEOFREGRESSIONMODEL)).

Running the above code in R will produce the following output:

Ordinal logistic regression is used when we want to make predictions about an ordinal dependent variable (also called an outcome variable) based on one or more independent variables (also called predictor variables). This requires a single categorical (3 or more groups with inherent order; e.g., small / medium / large or strongly disagree to strongly agree) outcome variable, and one (simple ordinal logistic regression) or more (multiple ordinal logistic regression) predictor variables.
In this example, we investigate the relationship between the log-transformed weight of a mammal’s brain and whether they would be a low, medium, or high-category sleeper. First, we will load in the msleep (mammals sleep) dataset (note: this requires ggplot2 to be loaded, which you can access if you install the tidyverse library using library(tidyverse)):

Ordinal logistic regression requires an ordinal outcome variable (e.g., the data must be categorical, with an inherent rank or order to the groups). Here, we will use mammals’ time spent sleeping as our outcome variable: we can look at the “sleep_total” variable using the summary() function and the hist() function (or using the View() function, see the next section) to see that this variable is continuous. To use this variable in an ordinal logistic regression, we can categorize the mammals’ sleep into “low”, “medium”, and “high” groups. Expert tip: for RStudio to treat this variable as ordinal, we also need to remember to treat is as an ordered factor using the factor() function with order = TRUE; the order here matters (e.g., you can set this to low then medium then high or high then medium then low, but you cannot use medium then low then high as the order would be broken):

You can double-check that the outcome variable is treated as an “ordered factor” (e.g., ordinal data) by using the following line of code: class(NAMEOFDATAFRAME$NAMEOFOUTCOMECOLUMN.
We can check our predictor variable (brain weight) and plot it using our ordinal outcome variable (sleep_cat); we can use the geom_boxplot() function from ggplot() for this purpose:

The resulting boxplot shows that some values are really far away from the rest of the data; these data points might be considered outliers, or maybe the scale we are using isn’t the best choice. Let’s try log-transforming the brain weight variable using the log() function to check this again:

Here, we see a log scale does a much better job displaying our results; the boxplot is no longer squished, and there do not appear to be any obvious outliers. To keep all of our data in the analysis, let’s proceed using the log-transformed brain weight for the analysis.
Independence of observations means that each observation must be entirely independent of the others. To assess independence, we can look at our data using View(NAMEOFDATAFRAME); here, we can see that each observation is a unique species of animal (i.e., each observation is independent of the others), thereby passing the assumption of independence of observations.

The assumption of proportional odds (sometimes called the test of parallel lines), briefly, assumes that the effect of each predictor variable must be identical at each partition of the data. In other words, the ‘slope’ value for a predictor variable within an ordinal logistic regression model must not change across the different categorized levels of the outcome variable. We can easily check this assumption using the nominal_test() function from the ordinal package:
NAMEOFMODEL = clm(NAMEOFOUTCOMECOLUMN ~ NAMEOFPREDICTORCOLUMN, data = NAMEOFDATAFRAME)
nominal_test(NAMEOFMODEL)

What we are looking for when assessing the test of parallel lines / proportional odds, is for each variable (here, there is only one) to have a non-significant p-value. If we look in the column labelled “Pr(>Chi)”, we can see the p-value for the log-transformed brain weight is p = .8218. As this value is greater than .05, we have passed this assumption. If one or more of your lines is statistically significant (p < .05), you have failed this assumption and should not use this test / model.
In the following example, we want to add an additional predictor variable (body weight) to the previous model (predictor: log brain size; outcome: categorized sleep). Body weight is in the “bodywt” column, and is a continuous variable.
When producing an ordinal logistic regression model with multiple predictor (independent) variables, there is an additional assumption that must be met prior to analyzing the model’s output: multicollinearity. Multicollinearity is the assumption that the predictor variables within a multivariable model are predicting a sufficiently different aspect of the outcome (dependent) variable, so that we are not including variables accounting for the same variability. Essentially, this assumption checks that each predictor in the model is actually explaining / accounting for unique variance within the model.
To measure the amount of multicollinearity that exists between two predictor variables, we can use the Variable Inflation Factor (VIF). As VIF values increase, the likelihood of multicollinearity being present within the model increases. We can use the VIF() function from the stats package in R to identify multicollinearity; VIF values < 3 indicates very low correlation between predictor variables, while VIF values between 3 – 8 indicates some correlation (and a potential risk of multicollinearity), and VIF values > 8 – 10 indicates high correlation (and likely multicollinearity).
Next, let’s check the VIF scores for log(bodywt) and log(brainwt) before we check whether these can be used to predict sleep. We build the regression model using the lm() or the glm() function (be sure to add all predictor variables and any transformation terms, if appropriate), and then run: vif(NAMEOFREGRESSIONMODEL):

Here, we see very high (>10) VIF scores; this makes sense, as animals with larger bodies often have larger brains. Based on the cutoff values specified above, this indicates that we have failed the assumption of multicollinearity; in order to appropriately run the regression, we would need to remove one of the two predictor variables. For the purposes of demonstration, we will continue with the analysis, pretending we have passed the assumption of multicollinearity.
There are several ways to run a simple ordinal logistic regression, but one of the easiest is to use the clm() function from the ordinal package. This function requires a formula specifying the ordinal outcome (dependent) variable, the singular predictor (independent) variable, and the dataframe they come from: clm(NAMEOFOUTCOMECOLUMN ~ NAMEOFPREDICTORCOLUMN + NAMEOFPREDICTORCOLUMN, data = NAMEOFDATAFRAME). Finally, use the summary() function to see the output of the regression analysis: summary(NAMEOFREGRESSIONMODEL).

To interpret the regression model, look at the “Coefficients:” section of the table. This will have one line for each predictor variable you used (note: n – 1 lines for categorical predictor variables), and the “Pr(>|z|)” column indicates the p-value for each. Here, we see p = .000369 for log(brainwt), indicating that the log-transformed brain weight variable significantly predicted mammals’ sleep category. Note: The “Estimate” column for ordinal logistic regression present the log odds, which we should exponentiate. By exponentiating the log odds, we instead get the odds ratios, which are more interpretable and are generally more widely used when reporting the results of an ordinal logistic regression. We can do this easily using exp(NAMEOFREGRESSIONMODEL$coefficients) like so:

Here, we see the odds ratio (e.g., the exponentiated log odds) for the log-transformed brain weight predictor variable is .656. This variable is less than 1, meaning we can find the percentage decrease in the odds like so: [(1 - .655) * 100] = 34.4%. As our variable was statistically significant, we can interpret the regression as follows: for each one-unit increase in the log-transformed brain weight variable (e.g., the predictor variable), the odds of being in a higher sleep category (e.g., the outcome variable) decreased by 34.4%, on average. To say this in plain-er English, mammals with higher log brain weights, on average, needed more sleep than mammals with lower log brain weights.
Expert tip: if your odds ratio is greater than 1, you would instead see a percentage increase in the odds of being in a different category of the outcome variable.
Running the above code in R will produce the following output:

There are several ways to run a multiple ordinal logistic regression, but one of the easiest is to use the clm() function from the ordinal package. This function requires a formula specifying the ordinal outcome (dependent) variable, the two or more predictor (independent) variables, and the dataframe they come from: clm(NAMEOFOUTCOMECOLUMN ~ NAMEOFPREDICTORCOLUMN + NAMEOFPREDICTORCOLUMN, data = NAMEOFDATAFRAME). Finally, use the summary() function to see the output of the regression analysis: summary(NAMEOFREGRESSIONMODEL).

To interpret the regression model, look at the “Coefficients:” section of the table. This will have one line for each predictor variable you used (note: n – 1 lines for categorical predictor variables), and the “Pr(>|z|)” column indicates the p-value for each. Here, we see p = .793 for log(brainwt) and p = .415 for log(bodywt), indicating that we fail to be able to say whether the log-transformed brain weight variable or the log-transformed body weight variable significantly predicted mammals’ sleep category. Note: The “Estimate” column for ordinal logistic regression present the log odds, which we should exponentiate. By exponentiating the log odds, we instead get the odds ratios, which are more interpretable and are generally more widely used when reporting the results of an ordinal logistic regression. We can do this easily using exp(NAMEOFREGRESSIONMODEL$coefficients) like so:

Here, we see the odds ratio (e.g., the exponentiated log odds) for the log-transformed brain weight predictor variable is .8999. This variable is less than 1, meaning we can find the percentage decrease in the odds like so: [(1 - .8999) * 100] = 10.0%. Similarly, the body weight predictor variable had a 23.4% percentage decrease in the odds. As our variables were not statistically significant, we would not interpret these results further.
Expert tip: if your odds ratio is greater than 1, you would instead see a percentage increase in the odds of being in a different category of the outcome variable.
Running the above code in R will produce the following output:

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.