Automate the creation of a summary table with gtsummary



This post use the gtsummary package to create a table that summarizes a dataset with descriptive, inferential statistics and more. We’ll go through several examples with reproducible code using the gtsummary package.

Table Data to Viz

Packages


For this post, we need to load the following library:

# install.packages("gtsummary")
library(gtsummary)


Default output for summary table


The gtsummary uses the tbl_summary() to generate the summary table and works well with the %>% symbol.

It automatically detects data type and use it to decides what type of statistics to compute. By default, it’s: - median, 1st and 3rd quartile for numeric columns - number of observations and proportion for categorical columns

library(gtsummary)

# create dataset
data("Titanic")
df = as.data.frame(Titanic)

# create the table
df %>%
  tbl_summary()
Characteristic N = 321
Class
    1st 8 (25%)
    2nd 8 (25%)
    3rd 8 (25%)
    Crew 8 (25%)
Sex
    Male 16 (50%)
    Female 16 (50%)
Age
    Child 16 (50%)
    Adult 16 (50%)
Survived 16 (50%)
Freq 14 (1, 77)
1 n (%); Median (IQR)

Add p-values and statistical details


If you want to add p-values to the table, you have to add by=variable_name in the tbl_summary() function. This happens because p-values are used to compare things between them.

The variable in the by argument will be used to split the dataset into multiple sub-samples (2 if it’s dichotomous, 3 if there are 3 distinct labels in the variable, etc). Those samples will be compared for each column in the dataset, and the test done depends on the type of data.

In this case, we add: - add_p() to create a new column for p-values - add_overall() to add a new column for descriptive statistics for the whole sample

library(gtsummary)

# create dataset
data("Titanic")
df = as.data.frame(Titanic)

# create the table
df %>%
  tbl_summary(by=Survived) %>%
  add_overall() %>%
  add_p() #%>%
Characteristic Overall, N = 321 No, N = 161 Yes, N = 161 p-value2
Class


>0.9
    1st 8 (25%) 4 (25%) 4 (25%)
    2nd 8 (25%) 4 (25%) 4 (25%)
    3rd 8 (25%) 4 (25%) 4 (25%)
    Crew 8 (25%) 4 (25%) 4 (25%)
Sex


>0.9
    Male 16 (50%) 8 (50%) 8 (50%)
    Female 16 (50%) 8 (50%) 8 (50%)
Age


>0.9
    Child 16 (50%) 8 (50%) 8 (50%)
    Adult 16 (50%) 8 (50%) 8 (50%)
Freq 14 (1, 77) 9 (0, 96) 14 (10, 75) 0.6
1 n (%); Median (IQR)
2 Fisher’s exact test; Pearson’s Chi-squared test; Wilcoxon rank sum test
  #add_stat_label()

Add a column based on a custom function


Thanks to the add_stat() function, we can create new column based on our own functions.

Below, we define an anova function that returns the p-values of an ANOVA and pass it to the add_stat() function.

library(gtsummary)

# create dataset
data("iris")
df = as.data.frame(iris)

my_anova = function(data, variable, by, ...) {
  result = aov(as.formula(paste(variable, "~", by)), data = data)
  summary(result)[[1]]$'Pr(>F)'[1] # Extracting the p-value for the group effect
}

# create the table
df %>%
  tbl_summary(by=Species) %>%
  add_overall() %>%
  add_p() %>%
  add_stat(fns = everything() ~ my_anova) %>%
  modify_header(
    list(
      add_stat_1 ~ "**p-value**",
      all_stat_cols() ~ "**{level}**"
    )
  ) %>%
  modify_footnote(
    add_stat_1 ~ "ANOVA")
Characteristic Overall1 setosa1 versicolor1 virginica1 p-value2 p-value3
Sepal.Length 5.80 (5.10, 6.40) 5.00 (4.80, 5.20) 5.90 (5.60, 6.30) 6.50 (6.23, 6.90) <0.001 0.000
Sepal.Width 3.00 (2.80, 3.30) 3.40 (3.20, 3.68) 2.80 (2.53, 3.00) 3.00 (2.80, 3.18) <0.001 0.000
Petal.Length 4.35 (1.60, 5.10) 1.50 (1.40, 1.58) 4.35 (4.00, 4.60) 5.55 (5.10, 5.88) <0.001 0.000
Petal.Width 1.30 (0.30, 1.80) 0.20 (0.20, 0.30) 1.30 (1.20, 1.50) 2.00 (1.80, 2.30) <0.001 0.000
1 Median (IQR)
2 Kruskal-Wallis rank sum test
3 ANOVA

Conclusion

This post explained how to create summary table using the gtsummary library. For more of this package, see the dedicated section or the table section.

Related chart types


Line plot
Area
Stacked area
Streamchart
Time Series



Contact

This document is a work by Yan Holtz. Any feedback is highly encouraged. You can fill an issue on Github, drop me a message on Twitter, or send an email pasting yan.holtz.data with gmail.com.