Skip to Main Content

Analyze Data: R and RStudio

Contributors: Lindsay Plater, Michelle Dollois, Angelica Nascimento de Oliveira and Riley Oremush

Linear regression

Linear regression is used when we want to make predictions about a continuous dependent variable (also called an outcome variable) based on one or more independent variables (also called predictor variables). This requires a single continuous outcome variable, and one (simple linear regression) or more (multiple linear regression) predictor variables.

In this example, we investigate the relationship between a vehicle’s horsepower and fuel efficiency, to ultimately assess if we can use a vehicle’s horsepower (hp) to predict its fuel efficiency (mpg).

Testing assumption(s)

Independence of observations

Independence of observations means that each observation must be entirely independent of the others. To assess independence, we can look at our data using View(NAMEOFDATAFRAME); here, we can see that each observation is a unique vehicle model (i.e., each observation is independent of the others), thereby passing the assumption of independence of observations.

The mtcars dataset displayed in RStudio, with 13 columns and 32 rows of data showing specifications for different car models.

Linearity 

The assumption of linearity assesses if the relationship between the continuous outcome variable and a continuous predictor variable(s) are, in fact, linear.

To test the assumption of linearity, start by plotting a scatterplot to perform visual inspection (i.e., does the relationship look linear), using the ggplot() function from the ggplot2 package:

ggplot(data = NAMEOFDATAFRAME, aes(x = NAMEOFPREDICTORCOLUMN, y = NAMEOFOUTCOMECOLUMN)) + geom_point()

If you’re not sure if the data look linear, you could add a loess curve to the scatterplot:

ggplot(data = NAMEOFDATAFRAME, aes(x = NAMEOFPREDICTORCOLUMN, y = NAMEOFOUTCOMECOLUMN)) + geom_point() + stat_smooth(method = "loess", se = FALSE)

The RStudio interface, showing code for creation of a scatterplot in the source window and a scatterplot with a loess curve in the plots window. The scatterplot shows a non-linear relationship.

Here, we can see that the relationship between a vehicle’s horsepower and fuel efficiency appears to be exponentially decreasing; as the loess curve does not look like a straight line, we have failed the assumption of linearity.

Since we failed linearity, we can transform the predictor variable and re-check the linearity assumption. Expert tip: start by transforming the predictor variable, and if that doesn’t work, you can try also transforming the outcome variable. Let’s try a log transform on just the predictor variable first:

NAMEOFDATAFRAME$NAMEOFPREDICTORCOLUMN_LOG <- log(NAMEOFDATAFRAME$NAMEOFPREDICTORCOLUMN)

Plot the data again, this time using the transformed predictor variable:

ggplot(data = NAMEOFDATAFRAME, aes(x = NAMEOFPREDICTORCOLUMN_LOG, y = NAMEOFOUTCOMECOLUMN)) + geom_point() + stat_smooth(method = "loess", se = FALSE)

In addition to visual inspection, we must include the non-transformed variable and the transformed variable in the regression to assess model fit:

lm(NAMEOFOUTCOMECOLUMN ~ NAMEOFPREDICTORCOLUMN + NAMEOFPREDICTORCOLUMN_LOG, DATA = NAMEOFDATAFRAME)

The RStudio interface, showing code for long-transforming a variable, running a regression, and the creation of a scatterplot in the source window and a scatterplot with a loess curve in the plots window. The scatterplot shows a linear relationship, and the regression shows a significant p-value for the log-transformed variable.

A scatterplot that is still non-linear, and / or a non-significant (p > .05) NAMEOFPREDICTOR_LOG regression output indicates that we have failed linearity. You could try a different transformation, and / or you could try also transforming the outcome variable (i.e., a “double transform” as both the predictor and the outcome will be transformed). Here, we have passed linearity as our scatterplot looks roughly like a line and our hp_log p-value is p < .05.

Note: If a transformation is required to produce a linear relationship between a predictor variable and outcome variable, all subsequent models must incorporate both the untransformed predictor variable and the transformed predictor variable.

Normality of residuals

Normality of a model’s residuals is a fundamental assumption of a linear regression. You have two options for assessing normality of the residuals: statistically with the Shapiro-Wilk test and / or visual inspection with a histogram. In both cases, you first create your regression model, and then you assess normality using the residuals (which is stored in your model as “resid”).

Shapiro-Wilk:

shapiro.test(resid(NAMEOFREGRESSIONMODEL))

Histogram:

hist(resid(NAMEOFREGRESSIONMODEL))

The RStudio interface, showing code for normality (Shapiro-Wilk and a histogram) in the source window. The console shows a non-significant Shapiro-Wilk statistic, and the plot window shows a bell-shaped histogram (i.e., normality passed).

A significant Shapiro-Wilk statistic (p < .05) and / or a histogram that does not look like a standard bell-shaped curve indicate that you have failed the test of normality, and should either transform your data (and check all assumptions again) or switch to a non-parametric test (e.g., some researchers use a Spearman correlation if normality for linear regression has failed). Here, we have passed normality as our Shapiro-Wilk statistic p-value is p > .05 and our histogram looks like a bell-shaped curve.

Homogeneity of variance (homoskedasticity)

The homogeneity of variance assumption (also known as homoskedasticity) is used to assess whether the variances between groups are equal; in the case of a linear regression, we are testing whether the variance of the model’s residuals is equal to the variance of the fitted values from the model.

With categorical predictor variables, homogeneity of variance is typically assessed using either a Levene’s test or a Bartlett test. With continuous predictor variables, homogeneity of variance is typically assessed using a Breusch-Pagan test.

Let’s use the Breusch-Pagan test (the bptest() function from the lmtest package) and visual inspection to assess homogeneity.

Breush-Pagan test:

bptest(NAMEOFREGRESSIONMODEL)

Residual versus fitted plot:

plot(NAMEOFREGRESSIONMODEL, 1)

The RStudio interface, showing code for a Breisch-Pagan test and code for the creation of a residual versus fitted plot in the source window. The Breusch-Pagan test result in the console was non-significant (p > .05) and the scatterplot in the plot window shows a pretty uniform band, both indicating that the homogeneity of variance assumption passed.

A significant Levene’s, Bartlett, or Breusch-Pagan test (p < .05) and / or a residuals versus fitted plot that makes a “cone” or “fan” shape (i.e., a shape that starts wide and then narrows off) indicates that you have failed the homogeneity of variance assumption, and should either transform your data (and check all assumptions again), switch to a non-parametric test (e.g., some researchers use a Mann-Whitney U test  for categorical predictor variables if homogeneity for linear regression has failed), or apply an adjustment. Here, we have passed homogeneity as our Breusch-Pagan test p-value is p > .05, and our residuals versus fitted plot has a band of data-points that do not make a cone or fan shape.

Multivariable linear regression: Testing assumptions

In the following example, we want to add an additional predictor variable (the weight of the vehicle) to the previous model (predictor: horsepower; outcome: fuel efficiency). We start as before by checking the linearity assumption of the new predictor variable (weight) by creating a plot. This example produces a non-linear scatterplot, so we add a transformation and re-check both the plot and the model using the transformed variable, like so:

The RStudio interface, showing code for a linear regression and code for the creation of a scatterplot in the source window. The scatterplot shows an approximately linear relationship, and, the regression indicates a significant log-transformed weight variable; together, these indicate the regression passes the linearity assumption.

Here, we see an approximately straight loess line in the scatterplot when using the log weight variable, and we also see in the regression model a significant (p < .05) log weight variable. This means we can proceed with the regression, as we have met the linearity assumption (Reminder: we must add both the regular weight term and the log weight term to the model moving forward).

Multicollinearity

When producing a linear regression model with multiple predictor (independent) variables, there is an additional assumption that must be met prior to analyzing the model’s output: multicollinearity. Multicollinearity is the assumption that the predictor variables within a multivariable model are predicting a sufficiently different aspect of the outcome (dependent) variable, so that we are not including variables accounting for the same variability. Essentially, this assumption checks that each predictor in the model is actually explaining / accounting for unique variance within the model.

To measure the amount of multicollinearity that exists between two predictor variables, we can use the Variable Inflation Factor (VIF). As VIF values increase, the likelihood of multicollinearity being present within the model increases. We can use the VIF() function in R to identify multicollinearity; VIF values < 3 indicates very low correlation between predictor variables, while VIF values between 3 – 8 indicates some correlation (and a potential risk of multicollinearity), and VIF values > 8 – 10 indicates high correlation (and likely multicollinearity).

Next, let’s check the VIF scores for weight and horsepower before we check whether these can be used to predict fuel efficiency. We build the regression model (be sure to add all predictor variables and any transformation terms, if appropriate), and then run vif(NAMEOFREGRESSIONMODEL):

The RStudio interface, showing code for a linear regression and code for the creation of a scatterplot in the source window. The scatterplot shows an approximately linear relationship, and, the regression indicates high VIF scores. The high VIF scores indicate the regression failed the multicollinearity assumption.

We can see the VIF scores for both horsepower and vehicle weight in the output, with horsepower and log horsepower having VIF values of 18.1 and 21.7, respectively, and weight and log weight having VIF values of 24.7 and 26.8, respectively. Based on the cutoff values specified above, this indicates that a vehicle’s weight and a vehicle’s horsepower are highly correlated, and are both accounting for much of the same variability within the model. Therefore, the assumption of multicollinearity fails, and one of the high VIF variables (either horsepower or weight) must be removed from the model.

Note: We do not consider the VIF values between the untransformed and transformed versions of the same variable (e.g., comparing the VIF of hp and hp_log), as they are inherently correlated.

How to run a linear regression

To run a linear regression in R, we can use the lm() function to create the model. This function requires a formula specifying the outcome (dependent) and predictor (independent) variables, and the dataframe they come from: lm(NAMEOFOUTCOMECOLUMN ~ NAMEOFPREDICTORCOLUMN, data = NAMEOFDATAFRAME). Finally, use the summary() function to see the output of the regression analysis, summary(NAMEOFREGRESSIONMODEL):

The RStudio interface, showing code for a linear regression in the source window. The significant hp_log variable (p = .0002) and the high R^2 value (0.75)  indicate that a vehicle's horsepower significantly predicts fuel efficiency.

Interpreting the output

Running the above code in R will produce the following output:

A close-up of the linear regression output. The significant hp_log variable (p = .0002) and the high R^2 value (0.75)  indicate that a vehicle's horsepower significantly predicts fuel efficiency.

  1. Call: The code for the regression model.
  2. Residuals: The range of residual values produced by the regression model (i.e., the quartiles and median values).
  3. Coefficients: the “Estimate” value is the unit change relationship, or slope, produced by the specific coefficient accounting for other variables in the model.
    1. For every one unit change in the horsepower variable (and the log-transformed variable), on average, there is a 19.96 unit reduction in a vehicle’s fuel efficiency.
  4. Coefficients: the “Std. Error” value is the average amount by which a coefficient’s estimate may vary from our “Estimate” value.
  5. Coefficients: the “Pr(>|t|)” value is the p-value indicating whether the predictor variable is statistically significant significant (p<.05) in its ability to predict the value of the outcome variable.
  6. Multiple R-squared and adjusted R-squared: two methods to determine the proportion of the outcome variable’s variance that is being accounted for by the use of the predictor variable(s) included within the model.
    1. The R-squared value provides an indication of the goodness of fit of the model produced. From the R-squared values given above, we can see that roughly 75% of the variance of the outcome variable (fuel efficiency) is being accounted for by this model (horsepower).

Logistic regression

Logistic regression is used when we want to make predictions about a binary dependent variable (also called an outcome variable) based on one or more independent variables (also called predictor variables). This requires a single categorical (binary) outcome variable, and one (simple logistic regression) or more (multiple logistic regression) predictor variables.

In this example, we investigate the relationship between a passenger’s age and the likelihood they survived the sinking of the Titanic, to ultimately assess if we can use age to predict survivability. First we will load in the Titanic dataset:

The RStudio interface, showing some packages being loaded and the Titanic dataset being loaded as a dataframe.

Testing assumption(s)

Independence of observations

Independence of observations means that each observation must be entirely independent of the others. To assess independence, we can look at our data using View(NAMEOFDATAFRAME); here, we can see that each observation is a unique individual (i.e., each observation is independent of the others), thereby passing the assumption of independence of observations.

The Titanic dataframe open in RStudio using the View() function.

EXPERT NOTE: in some cases, simply having a unique row per participant is not enough to guarantee independence (e.g., family studies where there are multiple family members included in the same analysis). Think critically about your design to ensure your data are independent!

Linearity

The assumption of linearity assesses if the relationship between the logit transformation of the outcome variable and any continuous predictor variable(s) are, in fact, linear. NOTE: you only need to check linearity for continuous predictor variables, not for categorical predictor variables.

To test this assumption, we will use the Box-Tidwell test. To do this, we include all predictors in the model as normal, but we also include an interaction term for each continuous predictor. For example, if your model includes continuous predictor variable “X”, when you check for linearity your model must include “X” as well as “X:log(X)” (the interaction).

To test the assumption of linearity, start by creating the regression model:

NAMEOFREGRESSIONMODEL = glm(NAMEOFOUTCOMECOLUMN ~ NAMEOFPREDICTORCOLUMN + NAMEOFPREDICTORCOLUMN:log(NAMEOFPREDICTORCOLUMN), data = NAMEOFDATAFRAME, family = binomial)

Assessing linearity in RStudio by using the Box-Tidwell test.

To pass the linearity assumption, the interaction term(s) (e.g., X:log(X)) must be non-significant (e.g., the value in X:log(X)’s Pr(>|z|) column must be > .05); if one or more of these terms is significant (e.g., p < .05), you have failed the assumption of linearity. To proceed with the logistic regression if you have failed linearity, you could categorize (e.g., “bin”) your continuous variable(s), or you could try a transformation [expert tip: start by transforming the predictor variable, and if that doesn’t work, you can try also transforming the outcome variable; if a transformation is required to produce a linear relationship between a predictor variable and outcome variable, all subsequent models must incorporate both the untransformed predictor variable and the transformed predictor variable]. Here, we have failed linearity since the Age:log(Age) interaction term returned p = .0179. To continue with this example, we will categorize or “bin” this variable.

The selection of categorical cut-off points can vary depending on the research question, the nature of the predictor variable, and the predictor variable’s relationship with the outcome variable. In this case, we will categorize the age variable into two groups: children (Age < 18) and adults (Age 18+). The selection of these categories is based on the hypothesis that children were more likely to be given a lifeboat seat (and thus, survive) than adults. To categorize the data:

NAMEOFDATAFRAME$NAMEOFPREDICTORCOLUMN_CAT = ifelse(NAMEOFDATAFRAME$NAMEOFPREDICTORCOLUMN < VALUE, “CATEGORYIFTRUE”, “CATEGORYIFFALSE”)

RStudio code; using the ifelse() function to categorize age into child and adult.

Multivariable logistic regression: Testing assumptions

In the following example, we want to add an additional predictor variable (sex) to the previous model (predictor: age; outcome: survival), as it is possible that women were more likely to be given lifeboat seats than men. As before, we start with the continuous Age variable, and so we must re-check linearity: we create the model we wish to use, then add the interaction term for continuous predictors. Here, we have failed linearity again (e.g., the value in X:log(X)’s Pr(>|z|) column was .00684, which is < .05), so we will proceed with the categorized (binned) version of the age variable for our next step.

Assessing linearity in RStudio by using the Box-Tidwell test; the output (looking at line Age:log(Age) shows that Age has failed linearity with p = .00684.

Multicollinearity

When producing a logistic regression model with multiple predictor (independent) variables, there is an additional assumption that must be met prior to analyzing the model’s output: multicollinearity. Multicollinearity is the assumption that the predictor variables within a multivariable model are predicting a sufficiently different aspect of the outcome (dependent) variable, so that we are not including variables accounting for the same variability. Essentially, this assumption checks that each predictor in the model is actually explaining / accounting for unique variance within the model.

To measure the amount of multicollinearity that exists between two predictor variables, we can use the Variable Inflation Factor (VIF). As VIF values increase, the likelihood of multicollinearity being present within the model increases. We can use the VIF() function in R to identify multicollinearity; VIF values < 3 indicates very low correlation between predictor variables, while VIF values between 3 – 8 indicates some correlation (and a potential risk of multicollinearity), and VIF values > 8 – 10 indicates high correlation (and likely multicollinearity).

Next, let’s check the VIF scores for sex and age_cat before we check whether these can be used to predict survival. We build the regression model (be sure to add all predictor variables and any transformation terms, if appropriate), and then run vif(NAMEOFREGRESSIONMODEL):

Variable inflation factors (VIF scores) from a logistic regression in RStudio. These values are around 1, showing little to no multicollinearity.

We see low values (approximately 1.0) for both the categorized age variable and sex. Based on the cutoff values specified above, this indicates we have passed the assumption of multicollinearity, and can proceed with the regression.

Note: We do not consider the VIF values between the untransformed and transformed versions of the same variable (e.g., comparing the VIF of age and age_squared, if you went that route), as they are inherently correlated.

How to run a simple logistic regression

To run a simple logistic regression in R, we can use the glm() function to create the model. This function requires a formula specifying the outcome (dependent) and predictor (independent) variable, and the dataframe they come from: glm(NAMEOFOUTCOMECOLUMN ~ NAMEOFPREDICTORCOLUMN, data = NAMEOFDATAFRAME, family = “binomial”). Finally, use the summary() function to see the output of the regression analysis, summary(NAMEOFREGRESSIONMODEL), and run the line: exp(coef(NAMEOFREGRESSIONMODEL)).

Running a simple logistic regression in RStudio (one predictor and a binary outcome variable).

Interpreting the output

Running the above code in R will produce the following output:

The output of the simple logistic regression in RStudio (one predictor and a binary outcome variable). Age significantly predicts survivability.

  1. Call: The code for the regression model.
  2. Deviance residuals: The range of residual values produced by the regression model (i.e., the quartiles and median values).
  3. Coefficients: the “Estimate” value is the LOG ODDS change in the relationship, or slope, produced by the specific coefficient accounting for other variables in the model. As log odds are unintuitive, we generally exponentiate them into odds ratios for reporting / interpretation (see 7 below).
  4. Coefficients: the “Std. Error” value is the average amount by which a coefficient’s estimate may vary from our “Estimate” value.
  5. Coefficients: the “Pr(>|z|)” value is the p-value indicating whether the predictor variable is statistically significant (p < .05) in its ability to predict the value of the outcome variable.
    1. Note: R often provides values in scientific notation. Here, p = 7.64e-09 = 0.00000000764 for the intercept.
  6. The Null Deviance value provides an indication of how well the outcome variable is predicted by the model using only the model’s intercept. The Residual Deviance value provides an indication of how well the outcome variable is predicted by the model, given the inclusion of the predictor variable(s). When residual deviance is LOWER than null deviance, this generally indicates the model is providing a good fit.
  7. The exponentiated coefficients: when we exponentiate the Estimates (log odds), we get odds ratios. We use odds ratios (in combination with p-values) to interpret the impact of the predictor variables(s) on the outcome variable.
    1. On average, children were 1.906 times more likely to survive the Titanic than adults (p = .0018). This can also be written as a percentage; on average, children were 90.6% more likely to survive the Titanic than adults (p = .0018).

How to run a multiple logistic regression

To run a multiple logistic regression in R, we can use the glm() function to create the model. This function requires a formula specifying the outcome (dependent) and predictor (independent) variables, and the dataframe they come from: glm(NAMEOFOUTCOMECOLUMN ~ NAMEOFPREDICTORCOLUMN + NAMEOFPREDICTORCOLUMN, data = NAMEOFDATAFRAME, family = “binomial”). Finally, use the summary() function to see the output of the regression analysis, summary(NAMEOFREGRESSIONMODEL), and run the line: exp(coef(NAMEOFREGRESSIONMODEL)).

Running a multiple logistic regression in RStudio (more than one predictor variable and a binary outcome variable).

Interpreting the output

Running the above code in R will produce the following output:

The output of the multiple logistic regression in RStudio (one predictor and a binary outcome variable). Age was trending towards significantly predicting survivability, but gender significantly predicted survivability (p = 2^-16).

  1. Call: The code for the regression model.D
  2. Deviance residuals: The range of residual values produced by the regression model (i.e., the quartiles and median values).
  3. Coefficients: the “Estimate” value is the LOG ODDS change in the relationship, or slope, produced by the specific coefficient accounting for other variables in the model. As log odds are unintuitive, we generally exponentiate them into odds ratios for reporting / interpretation (see 7 below).
  4. Coefficients: the “Std. Error” value is the average amount by which a coefficient’s estimate may vary from our “Estimate” value.
  5. Coefficients: the “Pr(>|z|)” value is the p-value indicating whether the predictor variable is statistically significant (p < .05) in its ability to predict the value of the outcome variable.
    1. Note: R often provides values in scientific notation. Here, p = 5.3e-12 = 0.0000000000053 for the intercept.
  6. The Null Deviance value provides an indication of how well the outcome variable is predicted by the model using only the model’s intercept. The Residual Deviance value provides an indication of how well the outcome variable is predicted by the model, given the inclusion of the predictor variable(s). When residual deviance is LOWER than null deviance, this generally indicates the model is providing a good fit.
  7. The exponentiated coefficients: when we exponentiate the Estimates (log odds), we get odds ratios. We use odds ratios (in combination with p-values) to interpret the impact of the predictor variables(s) on the outcome variable
    1. On average, and accounting for sex, children were 1.584 times more likely to survive the Titanic than adults (p = .061). This can also be written as a percentage; on average, children were 58.4% more likely to survive the Titanic than adults (p = .061).
    2. On average, and accounting for age, men were 0.086 times more likely to survive the Titanic than women (p = 0.0000000000000002). This can (and should) be flipped for interpretability; women were 11.6 times more likely to survive the Titanic than men. This can also be written as a percentage; on average, and accounting for age, women were 1060% more likely to survive the Titanic than men (p = .0000000000000002).

Ordinal logistic regression

Ordinal logistic regression is used when we want to make predictions about an ordinal dependent variable (also called an outcome variable) based on one or more independent variables (also called predictor variables). This requires a single categorical (3 or more groups with inherent order; e.g., small / medium / large or strongly disagree to strongly agree) outcome variable, and one (simple ordinal logistic regression) or more (multiple ordinal logistic regression) predictor variables.

In this example, we investigate the relationship between the log-transformed weight of a mammal’s brain and whether they would be a low, medium, or high-category sleeper. First, we will load in the msleep (mammals sleep) dataset (note: this requires ggplot2 to be loaded, which you can access if you install the tidyverse library using library(tidyverse)):

RStudio interface, loading the msleep dataset (note: you need to install tidyverse or ggplot2 to get this dataset).

Testing assumption(s)

Ordinal outcome variable

Ordinal logistic regression requires an ordinal outcome variable (e.g., the data must be categorical, with an inherent rank or order to the groups). Here, we will use mammals’ time spent sleeping as our outcome variable: we can look at the “sleep_total” variable using the summary() function and the hist() function (or using the View() function, see the next section) to see that this variable is continuous. To use this variable in an ordinal logistic regression, we can categorize the mammals’ sleep into “low”, “medium”, and “high” groups. Expert tip: for RStudio to treat this variable as ordinal, we also need to remember to treat is as an ordered factor using the factor() function with order = TRUE; the order here matters (e.g., you can set this to low then medium then high or high then medium then low, but you cannot use medium then low then high as the order would be broken):

RStudio interface, showing how to categorize a continuous variable for use in ordinal logistic regression.

You can double-check that the outcome variable is treated as an “ordered factor” (e.g., ordinal data) by using the following line of code: class(NAMEOFDATAFRAME$NAMEOFOUTCOMECOLUMN.

A note about the predictor variable

We can check our predictor variable (brain weight) and plot it using our ordinal outcome variable (sleep_cat); we can use the geom_boxplot() function from ggplot() for this purpose:

RStudio interface, showing the brain weights of different mammals in a boxplot. Some of the values are far away from the rest of the data, making the boxplot appear squished.

The resulting boxplot shows that some values are really far away from the rest of the data; these data points might be considered outliers, or maybe the scale we are using isn’t the best choice. Let’s try log-transforming the brain weight variable using the log() function to check this again:

An RStudio boxplot, showing log-transformed brain weight categorized by level of sleep for different mammals.

Here, we see a log scale does a much better job displaying our results; the boxplot is no longer squished, and there do not appear to be any obvious outliers. To keep all of our data in the analysis, let’s proceed using the log-transformed brain weight for the analysis.

Independence of observations

Independence of observations means that each observation must be entirely independent of the others. To assess independence, we can look at our data using View(NAMEOFDATAFRAME); here, we can see that each observation is a unique species of animal (i.e., each observation is independent of the others), thereby passing the assumption of independence of observations.

RStudio interface, showing the msleep data sheet. There is one row per each animal type.

Proportional odds or parallel lines

The assumption of proportional odds (sometimes called the test of parallel lines), briefly, assumes that the effect of each predictor variable must be identical at each partition of the data. In other words, the ‘slope’ value for a predictor variable within an ordinal logistic regression model must not change across the different categorized levels of the outcome variable. We can easily check this assumption using the nominal_test() function from the ordinal package:

NAMEOFMODEL = clm(NAMEOFOUTCOMECOLUMN ~ NAMEOFPREDICTORCOLUMN, data = NAMEOFDATAFRAME)

nominal_test(NAMEOFMODEL)

RStudio interface, showing the test of parallel lines (proportional odds) for an ordinal logistic regression.

What we are looking for when assessing the test of parallel lines / proportional odds, is for each variable (here, there is only one) to have a non-significant p-value. If we look in the column labelled “Pr(>Chi)”, we can see the p-value for the log-transformed brain weight is p = .8218. As this value is greater than .05, we have passed this assumption. If one or more of your lines is statistically significant (p < .05), you have failed this assumption and should not use this test / model.

Multivariable ordinal logistic regression: Testing assumptions

In the following example, we want to add an additional predictor variable (body weight) to the previous model (predictor: log brain size; outcome: categorized sleep). Body weight is in the “bodywt” column, and is a continuous variable.

Multicollinearity

When producing an ordinal logistic regression model with multiple predictor (independent) variables, there is an additional assumption that must be met prior to analyzing the model’s output: multicollinearity. Multicollinearity is the assumption that the predictor variables within a multivariable model are predicting a sufficiently different aspect of the outcome (dependent) variable, so that we are not including variables accounting for the same variability. Essentially, this assumption checks that each predictor in the model is actually explaining / accounting for unique variance within the model.

To measure the amount of multicollinearity that exists between two predictor variables, we can use the Variable Inflation Factor (VIF). As VIF values increase, the likelihood of multicollinearity being present within the model increases. We can use the VIF() function from the stats package in R to identify multicollinearity; VIF values < 3 indicates very low correlation between predictor variables, while VIF values between 3 – 8 indicates some correlation (and a potential risk of multicollinearity), and VIF values > 8 – 10 indicates high correlation (and likely multicollinearity).

Next, let’s check the VIF scores for log(bodywt) and log(brainwt) before we check whether these can be used to predict sleep. We build the regression model using the lm() or the glm() function (be sure to add all predictor variables and any transformation terms, if appropriate), and then run: vif(NAMEOFREGRESSIONMODEL):

The RStudio interface, showing the code for calculating the variable inflation factor (VIF) in the source window and the output (VIF = 14.67) in the console.

Here, we see very high (>10) VIF scores; this makes sense, as animals with larger bodies often have larger brains. Based on the cutoff values specified above, this indicates that we have failed the assumption of multicollinearity; in order to appropriately run the regression, we would need to remove one of the two predictor variables. For the purposes of demonstration, we will continue with the analysis, pretending we have passed the assumption of multicollinearity.

How to run a simple ordinal logistic regression

There are several ways to run a simple ordinal logistic regression, but one of the easiest is to use the clm() function from the ordinal package. This function requires a formula specifying the ordinal outcome (dependent) variable, the singular predictor (independent) variable, and the dataframe they come from: clm(NAMEOFOUTCOMECOLUMN ~ NAMEOFPREDICTORCOLUMN + NAMEOFPREDICTORCOLUMN, data = NAMEOFDATAFRAME). Finally, use the summary() function to see the output of the regression analysis: summary(NAMEOFREGRESSIONMODEL).

RStudio interface, showing the output from the simple ordinal logistic regression (note: estimates reflect log odds).

To interpret the regression model, look at the “Coefficients:” section of the table. This will have one line for each predictor variable you used (note: n – 1 lines for categorical predictor variables), and the “Pr(>|z|)” column indicates the p-value for each. Here, we see p = .000369 for log(brainwt), indicating that the log-transformed brain weight variable significantly predicted mammals’ sleep category. Note: The “Estimate” column for ordinal logistic regression present the log odds, which we should exponentiate. By exponentiating the log odds, we instead get the odds ratios, which are more interpretable and are generally more widely used when reporting the results of an ordinal logistic regression. We can do this easily using exp(NAMEOFREGRESSIONMODEL$coefficients) like so:

RStudio interface, showing the exponentiated coefficients from the simple ordinal logistic regression.

Here, we see the odds ratio (e.g., the exponentiated log odds) for the log-transformed brain weight predictor variable is .656. This variable is less than 1, meaning we can find the percentage decrease in the odds like so: [(1 - .655) * 100] = 34.4%. As our variable was statistically significant, we can interpret the regression as follows: for each one-unit increase in the log-transformed brain weight variable (e.g., the predictor variable), the odds of being in a higher sleep category (e.g., the outcome variable) decreased by 34.4%, on average. To say this in plain-er English, mammals with higher log brain weights, on average, needed more sleep than mammals with lower log brain weights.

Expert tip: if your odds ratio is greater than 1, you would instead see a percentage increase in the odds of being in a different category of the outcome variable.

Interpreting the output

Running the above code in R will produce the following output:

The simple ordinal logistic regression output in RStudio, showing a statistically significant effect of the log-transformed brain weight variable on mammals' sleep category.

  1. formula: The (outcome ~ predictor) call used for the regression model.
  2. data: The dataframe used for the regression model.
  3. Coefficients: The table we look at to get the results of the simple ordinal logistic regression.
  4. Coefficients: the “Estimate” value is the LOG ODDS change in the relationship, or slope, produced by the specific coefficient accounting for other variables in the model. As log odds are unintuitive, we generally exponentiate them into odds ratios for reporting / interpretation (see 6 below).
  5. Coefficients: the “Pr(>|z|)” value is the p-value indicating whether the predictor variable is statistically significant (p < .05) in its ability to predict the value of the outcome variable.
  6. The exponentiated coefficients: when we exponentiate the Estimates (log odds), we get odds ratios. We use odds ratios (in combination with p-values) to interpret the impact of the predictor variables(s) on the outcome variable
    1. On average, mammals with a larger log-transformed brain weight were more likely (34.4%) to be in a lower sleep category (p = .0004).

How to run a multiple ordinal logistic regression

There are several ways to run a multiple ordinal logistic regression, but one of the easiest is to use the clm() function from the ordinal package. This function requires a formula specifying the ordinal outcome (dependent) variable, the two or more predictor (independent) variables, and the dataframe they come from: clm(NAMEOFOUTCOMECOLUMN ~ NAMEOFPREDICTORCOLUMN + NAMEOFPREDICTORCOLUMN, data = NAMEOFDATAFRAME). Finally, use the summary() function to see the output of the regression analysis: summary(NAMEOFREGRESSIONMODEL).

RStudio interface, showing the output from the multiple ordinal logistic regression (note: estimates reflect log odds).

To interpret the regression model, look at the “Coefficients:” section of the table. This will have one line for each predictor variable you used (note: n – 1 lines for categorical predictor variables), and the “Pr(>|z|)” column indicates the p-value for each. Here, we see p = .793 for log(brainwt) and p = .415 for log(bodywt), indicating that we fail to be able to say whether the log-transformed brain weight variable or the log-transformed body weight variable significantly predicted mammals’ sleep category. Note: The “Estimate” column for ordinal logistic regression present the log odds, which we should exponentiate. By exponentiating the log odds, we instead get the odds ratios, which are more interpretable and are generally more widely used when reporting the results of an ordinal logistic regression. We can do this easily using exp(NAMEOFREGRESSIONMODEL$coefficients) like so:

RStudio interface, showing the exponentiated coefficients from the multiple ordinal logistic regression.

Here, we see the odds ratio (e.g., the exponentiated log odds) for the log-transformed brain weight predictor variable is .8999. This variable is less than 1, meaning we can find the percentage decrease in the odds like so: [(1 - .8999) * 100] = 10.0%. Similarly, the body weight predictor variable had a 23.4% percentage decrease in the odds. As our variables were not statistically significant, we would not interpret these results further.

Expert tip: if your odds ratio is greater than 1, you would instead see a percentage increase in the odds of being in a different category of the outcome variable.

Interpreting the output

Running the above code in R will produce the following output:

The multiple ordinal logistic regression output in RStudio, showing a statistically NONsignificant effect of the log-transformed brain weight variable and the log-transformed body weight variable on mammals' sleep category.

  1. formula: The (outcome ~ predictor + predictor) call used for the regression model.
  2. data: The dataframe used for the regression model.
  3. Coefficients: The table we look at to get the results of the multiple ordinal logistic regression.
  4. Coefficients: the “Estimate” value is the LOG ODDS change in the relationship, or slope, produced by the specific coefficient accounting for other variables in the model. As log odds are unintuitive, we generally exponentiate them into odds ratios for reporting / interpretation (see 6 below). 
  5. Coefficients: the “Pr(>|z|)” value is the p-value indicating whether the predictor variables are statistically significant (p < .05) in their ability to predict the value of the outcome variable.
  6. The exponentiated coefficients: when we exponentiate the Estimates (log odds), we get odds ratios. We use odds ratios (in combination with p-values) to interpret the impact of the predictor variables(s) on the outcome variable
    1. For example, when reporting the brain weight line: On average, and accounting for body size, mammals with a larger log-transformed brain weight were more likely (10.0%) to be in a lower sleep category (p = .793). Note this value is not statistically significant so we wouldn’t want to over-interpret the findings.

Suggest an edit to this guide

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.