Guides: Analyze Data: R and RStudio: How to use RStudio: Descriptive Statistics

Descriptives

RStudio has many functions that accomplish the same task. In the previous section, there was an example where three columns of data were added together, and the result was divided by the number 3. This is the long-hand way to calculate the mean, which is easy if you only have a few columns of data. If you have many columns of data, this is suddenly more complicated.

There are many ways to calculate descriptive statistics in RStudio, including procedures that calculate measures such as central tendency (mean, median, mode), dispersion (range, standard deviation, variance, minimum and maximum), and kurtosis and skewness. These kinds of descriptive statistics are best suited to describe continuous variables.

How to run descriptives

summary(NAMEOFDATAFRAME)
- where NAMEOFDATAFRAME is what you have called your data. This function will return several summary statistics at once for each column of data. This function returns the minimum, maximum, median, mean, and 1st and 3rd quartiles values. If you want these values for just one column, you can use the NAMEOFDATAFRAME$NAMEOFCOLUMN format.

sapply(NAMEOFDATAFRAME, FUNCTION, na.rm = TRUE)
- where NAMEOFDATAFRAME is what you have called your data, na.rm specifies whether you wish to remove “NA” (missing) values (TRUE = yes, FALSE = no), and FUNCTION is the descriptive statistic you would like including: mean, sd, var, min, max, median, range, quantile. This function will attempt to return the specified value (i.e., the mean) of ALL columns in your dataframe.
- NOTE: when working with categorical data (e.g., the mean of “gender” and “colour” in the screenshot), some summary statistics (e.g., the mean) don’t make sense, and the output will return “NA”. If a grouping column (e.g., “gender”, “colour”) is defined as numbers (e.g., 0, 1), sapply will apply the specified function and return a value (which may not be appropriate / interpretable).

NAMEOFDATAFRAME %>% summarise(FUNCTION(NAMEOFCOLUMN, na.rm = TRUE))
- where NAMEOFDATAFRAME is what you have called your data, na.rm specifies whether you wish to remove “NA” (missing) values (TRUE = yes, FALSE = no), FUNCTION is the function (mean, sd, etc.), and NAMEOFCOLUMN is the specific column you would like the function applied to.
- NOTE: this function requires the tidyverse library to be installed and loaded to run. This function uses pipelines (%>%), which can be read as “read my dataframe and then summarize it using the specified function on the specified column”.

There are numerous packages that exist that provide their own functions for calculating descriptive statistics, including (but not limited to): Hmisc, pastecs, psych, and doBy.

Frequencies and crosstabs

There are many ways to calculate frequency and contingency tables in RStudio. These kinds of descriptive statistics are best suited to describe categorical variables.

How to run frequencies

table(NAMEOFDATAFRAME$COLUMNS) or table(NAMEOFDATAFRAME$ROWS, NAMDOFDATAFRAME$COLUMNS) to generate frequency tables
- where NAMEOFDATAFRAME is what you have called your dataframe, ROWS is the name of the column in the dataframe you would like to be the rows in your table, and COLUMNS is the name of the column in the dataframe you would like to be the columns in your table.
- it is possible to get a three-way contingency table by adding another argument to this function: [, NAMEOFDATAFRAME$NAMEOFCOLUMN].
- if you wish to see if you have missing values, add the argument: [, useNA = “ifany”].

prop.table(NAMEOFTABLE) to generate tables of proportions
- where NAMEOFTABLE is the name of the table you create using the table() function. This returns the contingency table (i.e., the crosstabs showing the proportion of responses for each combination of variables).
- you can request either a proportion table that matches the table from the table() function, or you can ask for only the rows using prop.table(NAMEOFTABLE, 1) or you can ask for only the columns using prop.table(NAMEOFTABLE, 2). If you use the “1” format, the crosstabs will show the proportion of responses based on the row variable (here, “Gender”: the “female” variable proportions will add up to 100, and the “male” variable proportions will add up to 100). If you use the “2” format, the crosstabs will show the proportion of responses based on the column variable (here, “Colour”: the “blue” variable proportions will add up to 100, etc.).

margin.table(NAMEOFTABLE) to generate marginal frequencies
- where NAMEOFTABLE is the name of the table you create using the table() function. This returns the total count of cells in the table (i.e., the number of rows in the dataset).
- you can request either a marginal table that matches the table from the table() function, or you can ask for only the rows using margin.table(NAMEOFTABLE, 1) or you can ask for only the columns using margin.table(NAMEOFTABLE, 2). If you use the “1” format, the marginal total will show the count of responses based on each category of the row variable. If you use the “2” format, the marginal total will show the count of responses based on each category of the column variable.

NAMEOFDATAFRAME %>% group_by(NAMEOFCOLUMN) %>% summarise(group_frequency = n()) to generate frequencies
- where NAMEOFDATAFRAME is what you have called your data, NAMEOFCOLUMN is the specific column you would like the frequencies for, group_frequency is the header that will be used in the output for the summary, and n() is a function that counts observations.
- NOTE: this function requires the tidyverse library to be installed and loaded to run. This function uses pipelines (%>%), which can be read as “read my dataframe and then group by this column and then summarise it using the specified function.

Analyze Data: R and RStudio

Ask Us: Chat, email, visit or call

Get assistance

Descriptives

How to run descriptives

Frequencies and crosstabs

How to run frequencies