Skip to Main Content

Analyze Data: R and RStudio

Contributors: Lindsay Plater, Michelle Dollois, Angelica Nascimento de Oliveira and Riley Oremush

Linear regression

Linear regression is used when we want to make predictions about a continuous dependent variable (also called an outcome variable) based on one or more independent variables (also called predictor variables). This requires a single continuous outcome variable, and one (simple linear regression) or more (multiple linear regression) predictor variables.

In this example, we investigate the relationship between a vehicle’s horsepower and fuel efficiency, to ultimately assess if we can use a vehicle’s horsepower (hp) to predict its fuel efficiency (mpg).

Testing assumption(s)

Independence of observations

Independence of observations means that each observation must be entirely independent of the others. To assess independence, we can look at our data using View(NAMEOFDATAFRAME); here, we can see that each observation is a unique vehicle model (i.e., each observation is independent of the others), thereby passing the assumption of independence of observations.

The mtcars dataset displayed in RStudio, with 13 columns and 32 rows of data showing specifications for different car models.

Linearity 

The assumption of linearity assesses if the relationship between the continuous outcome variable and a continuous predictor variable(s) are, in fact, linear.

To test the assumption of linearity, start by plotting a scatterplot to perform visual inspection (i.e., does the relationship look linear), using the ggplot() function from the ggplot2 package:

ggplot(data = NAMEOFDATAFRAME, aes(x = NAMEOFPREDICTORCOLUMN, y = NAMEOFOUTCOMECOLUMN)) + geom_point()

If you’re not sure if the data look linear, you could add a loess curve to the scatterplot:

ggplot(data = NAMEOFDATAFRAME, aes(x = NAMEOFPREDICTORCOLUMN, y = NAMEOFOUTCOMECOLUMN)) + geom_point() + stat_smooth(method = "loess", se = FALSE)

The RStudio interface, showing code for creation of a scatterplot in the source window and a scatterplot with a loess curve in the plots window. The scatterplot shows a non-linear relationship.

Here, we can see that the relationship between a vehicle’s horsepower and fuel efficiency appears to be exponentially decreasing; as the loess curve does not look like a straight line, we have failed the assumption of linearity.

Since we failed linearity, we can transform the predictor variable and re-check the linearity assumption. Expert tip: start by transforming the predictor variable, and if that doesn’t work, you can try also transforming the outcome variable. Let’s try a log transform on just the predictor variable first:

NAMEOFDATAFRAME$NAMEOFPREDICTORCOLUMN_LOG <- log(NAMEOFDATAFRAME$NAMEOFPREDICTORCOLUMN)

Plot the data again, this time using the transformed predictor variable:

ggplot(data = NAMEOFDATAFRAME, aes(x = NAMEOFPREDICTORCOLUMN_LOG, y = NAMEOFOUTCOMECOLUMN)) + geom_point() + stat_smooth(method = "loess", se = FALSE)

In addition to visual inspection, we must include the non-transformed variable and the transformed variable in the regression to assess model fit:

lm(NAMEOFOUTCOMECOLUMN ~ NAMEOFPREDICTORCOLUMN + NAMEOFPREDICTORCOLUMN_LOG, DATA = NAMEOFDATAFRAME)

The RStudio interface, showing code for long-transforming a variable, running a regression, and the creation of a scatterplot in the source window and a scatterplot with a loess curve in the plots window. The scatterplot shows a linear relationship, and the regression shows a significant p-value for the log-transformed variable.

A scatterplot that is still non-linear, and / or a non-significant (p > .05) NAMEOFPREDICTOR_LOG regression output indicates that we have failed linearity. You could try a different transformation, and / or you could try also transforming the outcome variable (i.e., a “double transform” as both the predictor and the outcome will be transformed). Here, we have passed linearity as our scatterplot looks roughly like a line and our hp_log p-value is p < .05.

Note: If a transformation is required to produce a linear relationship between a predictor variable and outcome variable, all subsequent models must incorporate both the untransformed predictor variable and the transformed predictor variable.

Normality of residuals

Normality of a model’s residuals is a fundamental assumption of a linear regression. You have two options for assessing normality of the residuals: statistically with the Shapiro-Wilk test and / or visual inspection with a histogram. In both cases, you first create your regression model, and then you assess normality using the residuals (which is stored in your model as “resid”).

Shapiro-Wilk:

shapiro.test(resid(NAMEOFREGRESSIONMODEL))

Histogram:

hist(resid(NAMEOFREGRESSIONMODEL))

The RStudio interface, showing code for normality (Shapiro-Wilk and a histogram) in the source window. The console shows a non-significant Shapiro-Wilk statistic, and the plot window shows a bell-shaped histogram (i.e., normality passed).

A significant Shapiro-Wilk statistic (p < .05) and / or a histogram that does not look like a standard bell-shaped curve indicate that you have failed the test of normality, and should either transform your data (and check all assumptions again) or switch to a non-parametric test (e.g., some researchers use a Spearman correlation if normality for linear regression has failed). Here, we have passed normality as our Shapiro-Wilk statistic p-value is p > .05 and our histogram looks like a bell-shaped curve.

Homogeneity of variance (homoskedasticity)

The homogeneity of variance assumption (also known as homoskedasticity) is used to assess whether the variances between groups are equal; in the case of a linear regression, we are testing whether the variance of the model’s residuals is equal to the variance of the fitted values from the model.

With categorical predictor variables, homogeneity of variance is typically assessed using either a Levene’s test or a Bartlett test. With continuous predictor variables, homogeneity of variance is typically assessed using a Breusch-Pagan test.

Let’s use the Breusch-Pagan test (the bptest() function from the lmtest package) and visual inspection to assess homogeneity.

Breush-Pagan test:

bptest(NAMEOFREGRESSIONMODEL)

Residual versus fitted plot:

plot(NAMEOFREGRESSIONMODEL, 1)

The RStudio interface, showing code for a Breisch-Pagan test and code for the creation of a residual versus fitted plot in the source window. The Breusch-Pagan test result in the console was non-significant (p > .05) and the scatterplot in the plot window shows a pretty uniform band, both indicating that the homogeneity of variance assumption passed.

A significant Levene’s, Bartlett, or Breusch-Pagan test (p < .05) and / or a residuals versus fitted plot that makes a “cone” or “fan” shape (i.e., a shape that starts wide and then narrows off) indicates that you have failed the homogeneity of variance assumption, and should either transform your data (and check all assumptions again), switch to a non-parametric test (e.g., some researchers use a Mann-Whitney U test  for categorical predictor variables if homogeneity for linear regression has failed), or apply an adjustment. Here, we have passed homogeneity as our Breusch-Pagan test p-value is p > .05, and our residuals versus fitted plot has a band of data-points that do not make a cone or fan shape.

Multivariable linear regression: Testing assumptions

In the following example, we want to add an additional predictor variable (the weight of the vehicle) to the previous model (predictor: horsepower; outcome: fuel efficiency). We start as before by checking the linearity assumption of the new predictor variable (weight) by creating a plot. This example produces a non-linear scatterplot, so we add a transformation and re-check both the plot and the model using the transformed variable, like so:

The RStudio interface, showing code for a linear regression and code for the creation of a scatterplot in the source window. The scatterplot shows an approximately linear relationship, and, the regression indicates a significant log-transformed weight variable; together, these indicate the regression passes the linearity assumption.

Here, we see an approximately straight loess line in the scatterplot when using the log weight variable, and we also see in the regression model a significant (p < .05) log weight variable. This means we can proceed with the regression, as we have met the linearity assumption (Reminder: we must add both the regular weight term and the log weight term to the model moving forward).

Multicollinearity

When producing a linear regression model with multiple predictor (independent) variables, there is an additional assumption that must be met prior to analyzing the model’s output: multicollinearity. Multicollinearity is the assumption that the predictor variables within a multivariable model are predicting a sufficiently different aspect of the outcome (dependent) variable, so that we are not including variables accounting for the same variability. Essentially, this assumption checks that each predictor in the model is actually explaining / accounting for unique variance within the model.

To measure the amount of multicollinearity that exists between two predictor variables, we can use the Variable Inflation Factor (VIF). As VIF values increase, the likelihood of multicollinearity being present within the model increases. We can use the VIF() function in R to identify multicollinearity; VIF values < 3 indicates very low correlation between predictor variables, while VIF values between 3 – 8 indicates some correlation (and a potential risk of multicollinearity), and VIF values > 8 – 10 indicates high correlation (and likely multicollinearity).

Next, let’s check the VIF scores for weight and horsepower before we check whether these can be used to predict fuel efficiency. We build the regression model (be sure to add all predictor variables and any transformation terms, if appropriate), and then run vif(NAMEOFREGRESSIONMODEL):

The RStudio interface, showing code for a linear regression and code for the creation of a scatterplot in the source window. The scatterplot shows an approximately linear relationship, and, the regression indicates high VIF scores. The high VIF scores indicate the regression failed the multicollinearity assumption.

We can see the VIF scores for both horsepower and vehicle weight in the output, with horsepower and log horsepower having VIF values of 18.1 and 21.7, respectively, and weight and log weight having VIF values of 24.7 and 26.8, respectively. Based on the cutoff values specified above, this indicates that a vehicle’s weight and a vehicle’s horsepower are highly correlated, and are both accounting for much of the same variability within the model. Therefore, the assumption of multicollinearity fails, and one of the high VIF variables (either horsepower or weight) must be removed from the model.

Note: We do not consider the VIF values between the untransformed and transformed versions of the same variable (e.g., comparing the VIF of hp and hp_log), as they are inherently correlated.

How to run a linear regression

To run a linear regression in R, we can use the lm() function to create the model. This function requires a formula specifying the outcome (dependent) and predictor (independent) variables, and the dataframe they come from: lm(NAMEOFOUTCOMECOLUMN ~ NAMEOFPREDICTORCOLUMN, data = NAMEOFDATAFRAME). Finally, use the summary() function to see the output of the regression analysis, summary(NAMEOFREGRESSIONMODEL):

The RStudio interface, showing code for a linear regression in the source window. The significant hp_log variable (p = .0002) and the high R^2 value (0.75)  indicate that a vehicle's horsepower significantly predicts fuel efficiency.

Interpreting the output

Running the above code in R will produce the following output:

A close-up of the linear regression output. The significant hp_log variable (p = .0002) and the high R^2 value (0.75)  indicate that a vehicle's horsepower significantly predicts fuel efficiency.

  1. Call: The code for the regression model.
  2. Residuals: The range of residual values produced by the regression model (i.e., the quartiles and median values).
  3. Coefficients: the “Estimate” value is the unit change relationship, or slope, produced by the specific coefficient accounting for other variables in the model.
    1. For every one unit change in the horsepower variable (and the log-transformed variable), on average, there is a 19.96 unit reduction in a vehicle’s fuel efficiency.
  4. Coefficients: the “Std. Error” value is the average amount by which a coefficient’s estimate may vary from our “Estimate” value.
  5. Coefficients: the “Pr(>|t|)” value is the p-value indicating whether the predictor variable is statistically significant significant (p<.05) in its ability to predict the value of the outcome variable.
  6. Multiple R-squared and adjusted R-squared: two methods to determine the proportion of the outcome variable’s variance that is being accounted for by the use of the predictor variable(s) included within the model.
    1. The R-squared value provides an indication of the goodness of fit of the model produced. From the R-squared values given above, we can see that roughly 75% of the variance of the outcome variable (fuel efficiency) is being accounted for by this model (horsepower).

Logistic regression

Logistic regression is used when we want to make predictions about a binary dependent variable (also called an outcome variable) based on one or more independent variables (also called predictor variables). This requires a single categorical (binary) outcome variable, and one (simple logistic regression) or more (multiple logistic regression) predictor variables.

In this example, we investigate the relationship between a passenger’s age and the likelihood they survived the sinking of the Titanic, to ultimately assess if we can use age to predict survivability. First we will load in the Titanic dataset:

The RStudio interface, showing some packages being loaded and the Titanic dataset being loaded as a dataframe.

Testing assumption(s)

Independence of observations

Independence of observations means that each observation must be entirely independent of the others. To assess independence, we can look at our data using View(NAMEOFDATAFRAME); here, we can see that each observation is a unique individual (i.e., each observation is independent of the others), thereby passing the assumption of independence of observations.

The Titanic dataframe open in RStudio using the View() function.

EXPERT NOTE: in some cases, simply having a unique row per participant is not enough to guarantee independence (e.g., family studies where there are multiple family members included in the same analysis). Think critically about your design to ensure your data are independent!

Linearity

The assumption of linearity assesses if the relationship between the logit transformation of the outcome variable and any continuous predictor variable(s) are, in fact, linear. NOTE: you only need to check linearity for continuous predictor variables, not for categorical predictor variables.

To test this assumption, we will use the Box-Tidwell test. To do this, we include all predictors in the model as normal, but we also include an interaction term for each continuous predictor. For example, if your model includes continuous predictor variable “X”, when you check for linearity your model must include “X” as well as “X:log(X)” (the interaction).

To test the assumption of linearity, start by creating the regression model:

NAMEOFREGRESSIONMODEL = glm(NAMEOFOUTCOMECOLUMN ~ NAMEOFPREDICTORCOLUMN + NAMEOFPREDICTORCOLUMN:log(NAMEOFPREDICTORCOLUMN), data = NAMEOFDATAFRAME, family = binomial)

Assessing linearity in RStudio by using the Box-Tidwell test.

To pass the linearity assumption, the interaction term(s) (e.g., X:log(X)) must be non-significant (e.g., the value in X:log(X)’s Pr(>|z|) column must be > .05); if one or more of these terms is significant (e.g., p < .05), you have failed the assumption of linearity. To proceed with the logistic regression if you have failed linearity, you could categorize (e.g., “bin”) your continuous variable(s), or you could try a transformation [expert tip: start by transforming the predictor variable, and if that doesn’t work, you can try also transforming the outcome variable; if a transformation is required to produce a linear relationship between a predictor variable and outcome variable, all subsequent models must incorporate both the untransformed predictor variable and the transformed predictor variable]. Here, we have failed linearity since the Age:log(Age) interaction term returned p = .0179. To continue with this example, we will categorize or “bin” this variable.

The selection of categorical cut-off points can vary depending on the research question, the nature of the predictor variable, and the predictor variable’s relationship with the outcome variable. In this case, we will categorize the age variable into two groups: children (Age < 18) and adults (Age 18+). The selection of these categories is based on the hypothesis that children were more likely to be given a lifeboat seat (and thus, survive) than adults. To categorize the data:

NAMEOFDATAFRAME$NAMEOFPREDICTORCOLUMN_CAT = ifelse(NAMEOFDATAFRAME$NAMEOFPREDICTORCOLUMN < VALUE, “CATEGORYIFTRUE”, “CATEGORYIFFALSE”)

RStudio code; using the ifelse() function to categorize age into child and adult.

Multivariable logistic regression: Testing assumptions

In the following example, we want to add an additional predictor variable (sex) to the previous model (predictor: age; outcome: survival), as it is possible that women were more likely to be given lifeboat seats than men. As before, we start with the continuous Age variable, and so we must re-check linearity: we create the model we wish to use, then add the interaction term for continuous predictors. Here, we have failed linearity again (e.g., the value in X:log(X)’s Pr(>|z|) column was .00684, which is < .05), so we will proceed with the categorized (binned) version of the age variable for our next step.

Assessing linearity in RStudio by using the Box-Tidwell test; the output (looking at line Age:log(Age) shows that Age has failed linearity with p = .00684.

Multicollinearity

When producing a logistic regression model with multiple predictor (independent) variables, there is an additional assumption that must be met prior to analyzing the model’s output: multicollinearity. Multicollinearity is the assumption that the predictor variables within a multivariable model are predicting a sufficiently different aspect of the outcome (dependent) variable, so that we are not including variables accounting for the same variability. Essentially, this assumption checks that each predictor in the model is actually explaining / accounting for unique variance within the model.

To measure the amount of multicollinearity that exists between two predictor variables, we can use the Variable Inflation Factor (VIF). As VIF values increase, the likelihood of multicollinearity being present within the model increases. We can use the VIF() function in R to identify multicollinearity; VIF values < 3 indicates very low correlation between predictor variables, while VIF values between 3 – 8 indicates some correlation (and a potential risk of multicollinearity), and VIF values > 8 – 10 indicates high correlation (and likely multicollinearity).

Next, let’s check the VIF scores for sex and age_cat before we check whether these can be used to predict survival. We build the regression model (be sure to add all predictor variables and any transformation terms, if appropriate), and then run vif(NAMEOFREGRESSIONMODEL):

Variable inflation factors (VIF scores) from a logistic regression in RStudio. These values are around 1, showing little to no multicollinearity.

We see low values (approximately 1.0) for both the categorized age variable and sex. Based on the cutoff values specified above, this indicates we have passed the assumption of multicollinearity, and can proceed with the regression.

Note: We do not consider the VIF values between the untransformed and transformed versions of the same variable (e.g., comparing the VIF of age and age_squared, if you went that route), as they are inherently correlated.

How to run a simple logistic regression

To run a simple logistic regression in R, we can use the glm() function to create the model. This function requires a formula specifying the outcome (dependent) and predictor (independent) variable, and the dataframe they come from: glm(NAMEOFOUTCOMECOLUMN ~ NAMEOFPREDICTORCOLUMN, data = NAMEOFDATAFRAME, family = “binomial”). Finally, use the summary() function to see the output of the regression analysis, summary(NAMEOFREGRESSIONMODEL), and run the line: exp(coef(NAMEOFREGRESSIONMODEL)).

Running a simple logistic regression in RStudio (one predictor and a binary outcome variable).

Interpreting the output

Running the above code in R will produce the following output:

The output of the simple logistic regression in RStudio (one predictor and a binary outcome variable). Age significantly predicts survivability.

  1. Call: The code for the regression model.
  2. Deviance residuals: The range of residual values produced by the regression model (i.e., the quartiles and median values).
  3. Coefficients: the “Estimate” value is the LOG ODDS change in the relationship, or slope, produced by the specific coefficient accounting for other variables in the model. As log odds are unintuitive, we generally exponentiate them into odds ratios for reporting / interpretation (see 7 below).
  4. Coefficients: the “Std. Error” value is the average amount by which a coefficient’s estimate may vary from our “Estimate” value.
  5. Coefficients: the “Pr(>|z|)” value is the p-value indicating whether the predictor variable is statistically significant (p < .05) in its ability to predict the value of the outcome variable.
    1. Note: R often provides values in scientific notation. Here, p = 7.64e-09 = 0.00000000764 for the intercept.
  6. The Null Deviance value provides an indication of how well the outcome variable is predicted by the model using only the model’s intercept. The Residual Deviance value provides an indication of how well the outcome variable is predicted by the model, given the inclusion of the predictor variable(s). When residual deviance is LOWER than null deviance, this generally indicates the model is providing a good fit.
  7. The exponentiated coefficients: when we exponentiate the Estimates (log odds), we get odds ratios. We use odds ratios (in combination with p-values) to interpret the impact of the predictor variables(s) on the outcome variable.
    1. On average, children were 1.906 times more likely to survive the Titanic than adults (p = .0018). This can also be written as a percentage; on average, children were 90.6% more likely to survive the Titanic than adults (p = .0018).

How to run a multiple logistic regression

To run a multiple logistic regression in R, we can use the glm() function to create the model. This function requires a formula specifying the outcome (dependent) and predictor (independent) variables, and the dataframe they come from: glm(NAMEOFOUTCOMECOLUMN ~ NAMEOFPREDICTORCOLUMN + NAMEOFPREDICTORCOLUMN, data = NAMEOFDATAFRAME, family = “binomial”). Finally, use the summary() function to see the output of the regression analysis, summary(NAMEOFREGRESSIONMODEL), and run the line: exp(coef(NAMEOFREGRESSIONMODEL)).

Running a multiple logistic regression in RStudio (more than one predictor variable and a binary outcome variable).

Interpreting the output

Running the above code in R will produce the following output:

The output of the multiple logistic regression in RStudio (one predictor and a binary outcome variable). Age was trending towards significantly predicting survivability, but gender significantly predicted survivability (p = 2^-16).

  1. Call: The code for the regression model.D
  2. Deviance residuals: The range of residual values produced by the regression model (i.e., the quartiles and median values).
  3. Coefficients: the “Estimate” value is the LOG ODDS change in the relationship, or slope, produced by the specific coefficient accounting for other variables in the model. As log odds are unintuitive, we generally exponentiate them into odds ratios for reporting / interpretation (see 7 below).
  4. Coefficients: the “Std. Error” value is the average amount by which a coefficient’s estimate may vary from our “Estimate” value.
  5. Coefficients: the “Pr(>|z|)” value is the p-value indicating whether the predictor variable is statistically significant (p < .05) in its ability to predict the value of the outcome variable.
    1. Note: R often provides values in scientific notation. Here, p = 5.3e-12 = 0.0000000000053 for the intercept.
  6. The Null Deviance value provides an indication of how well the outcome variable is predicted by the model using only the model’s intercept. The Residual Deviance value provides an indication of how well the outcome variable is predicted by the model, given the inclusion of the predictor variable(s). When residual deviance is LOWER than null deviance, this generally indicates the model is providing a good fit.
  7. The exponentiated coefficients: when we exponentiate the Estimates (log odds), we get odds ratios. We use odds ratios (in combination with p-values) to interpret the impact of the predictor variables(s) on the outcome variable
    1. On average, and accounting for sex, children were 1.584 times more likely to survive the Titanic than adults (p = .061). This can also be written as a percentage; on average, children were 58.4% more likely to survive the Titanic than adults (p = .061).
    2. On average, and accounting for age, men were 0.086 times more likely to survive the Titanic than women (p = 0.0000000000000002). This can (and should) be flipped for interpretability; women were 11.6 times more likely to survive the Titanic than men. This can also be written as a percentage; on average, and accounting for age, women were 1060% more likely to survive the Titanic than men (p = .0000000000000002).

Suggest an edit to this guide

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.