Skip to Main Content

Analyze Data: R and RStudio

Contributors: Lindsay Plater and Michelle Dollois

Descriptives

RStudio has many functions that accomplish the same task. In the previous section, there was an example where three columns of data were added together, and the result was divided by the number 3. This is the long-hand way to calculate the mean, which is easy if you only have a few columns of data. If you have many columns of data, this is suddenly more complicated.

There are many ways to calculate descriptive statistics in RStudio, including procedures that calculate measures such as central tendency (mean, median, mode), dispersion (range, standard deviation, variance, minimum and maximum), and kurtosis and skewness. These kinds of descriptive statistics are best suited to describe continuous variables.

How to run descriptives

  1. summary(NAMEOFDATAFRAME)
    • where NAMEOFDATAFRAME is what you have called your data. This function will return several summary statistics at once for each column of data. This function returns the minimum, maximum, median, mean, and 1st and 3rd quartiles values. If you want these values for just one column, you can use the NAMEOFDATAFRAME$NAMEOFCOLUMN format.

A screenshot of the RStudio interface where several summary statistics (min, 1st and 3rd quartiles, mean, max) were calculated for each column using the summary() function.

  1. sapply(NAMEOFDATAFRAME, FUNCTION, na.rm = TRUE)
    • where NAMEOFDATAFRAME is what you have called your data, na.rm specifies whether you wish to remove “NA” (missing) values (TRUE = yes, FALSE = no), and FUNCTION is the descriptive statistic you would like including: mean, sd, var, min, max, median, range, quantile. This function will attempt to return the specified value (i.e., the mean) of ALL columns in your dataframe.
    • NOTE: when working with categorical data (e.g., the mean of “gender” and “colour” in the screenshot), some summary statistics (e.g., the mean) don’t make sense, and the output will return “NA”. If a grouping column (e.g., “gender”, “colour”) is defined as numbers (e.g., 0, 1), sapply will apply the specified function and return a value (which may not be appropriate / interpretable).

A screenshot of the RStudio interface where the mean and sd were calculated using the sapply() function. The means and sds of all columns are displayed in the console.

  1. NAMEOFDATAFRAME %>% summarise(FUNCTION(NAMEOFCOLUMN, na.rm = TRUE))
    • where NAMEOFDATAFRAME is what you have called your data, na.rm specifies whether you wish to remove “NA” (missing) values (TRUE = yes, FALSE = no), FUNCTION is the function (mean, sd, etc.), and NAMEOFCOLUMN is the specific column you would like the function applied to.
    • NOTE: this function requires the tidyverse library to be installed and loaded to run. This function uses pipelines (%>%), which can be read as “read my dataframe and then summarize it using the specified function on the specified column”.

A screenshot of the RStudio interface where the mean and sd were calculated using the summarise() function. The means and sds of one specified column are displayed in the console.

  1. There are numerous packages that exist that provide their own functions for calculating descriptive statistics, including (but not limited to): Hmisc, pastecs, psych, and doBy.

Frequencies and crosstabs

There are many ways to calculate frequency and contingency tables in RStudio. These kinds of descriptive statistics are best suited to describe categorical variables.

How to run frequencies

  1. table(NAMEOFDATAFRAME$COLUMNS) or table(NAMEOFDATAFRAME$ROWS, NAMDOFDATAFRAME$COLUMNS) to generate frequency tables
    • where NAMEOFDATAFRAME is what you have called your dataframe, ROWS is the name of the column in the dataframe you would like to be the rows in your table, and COLUMNS is the name of the column in the dataframe you would like to be the columns in your table.
    • it is possible to get a three-way contingency table by adding another argument to this function: [, NAMEOFDATAFRAME$NAMEOFCOLUMN].
    • if you wish to see if you have missing values, add the argument: [, useNA = “ifany”].

A screenshot of the RStudio interface where the frequencies were calculated for Gender and the Gender-Colour crosstabs using the table() function. The frequencies for the specified groups are displayed in the console.

  1. prop.table(NAMEOFTABLE) to generate tables of proportions
    • where NAMEOFTABLE is the name of the table you create using the table() function. This returns the contingency table (i.e., the crosstabs showing the proportion of responses for each combination of variables).
    • you can request either a proportion table that matches the table from the table() function, or you can ask for only the rows using prop.table(NAMEOFTABLE, 1) or you can ask for only the columns using prop.table(NAMEOFTABLE, 2). If you use the “1” format, the crosstabs will show the proportion of responses based on the row variable (here, “Gender”: the “female” variable proportions will add up to 100, and the “male” variable proportions will add up to 100). If you use the “2” format, the crosstabs will show the proportion of responses based on the column variable (here, “Colour”: the “blue” variable proportions will add up to 100, etc.).

A screenshot of the RStudio interface where the proportional frequencies were calculated for Gender and Colour using the prop.table() function. The proportional frequencies for the specified groups are displayed in the console.

  1. margin.table(NAMEOFTABLE) to generate marginal frequencies
    • where NAMEOFTABLE is the name of the table you create using the table() function. This returns the total count of cells in the table (i.e., the number of rows in the dataset).
    • you can request either a marginal table that matches the table from the table() function, or you can ask for only the rows using margin.table(NAMEOFTABLE, 1) or you can ask for only the columns using margin.table(NAMEOFTABLE, 2). If you use the “1” format, the marginal total will show the count of responses based on each category of the row variable. If you use the “2” format, the marginal total will show the count of responses based on each category of the column variable.

A screenshot of the RStudio interface where the marginal frequencies were calculated for Gender and Colour using the margin.table() function. The marginal frequencies for the specified groups are displayed in the console.

  1. NAMEOFDATAFRAME %>% group_by(NAMEOFCOLUMN) %>% summarise(group_frequency = n()) to generate frequencies
    • where NAMEOFDATAFRAME is what you have called your data, NAMEOFCOLUMN is the specific column you would like the frequencies for, group_frequency is the header that will be used in the output for the summary, and n() is a function that counts observations.
    • NOTE: this function requires the tidyverse library to be installed and loaded to run. This function uses pipelines (%>%), which can be read as “read my dataframe and then group by this column and then summarise it using the specified function.

A screenshot of the RStudio interface where the counts were calculated for Gender and Colour using the group_by() function. The counts for the specified groups are displayed in the console.

Suggest an edit to this guide

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.