A closer look at the ggpubr package for R

Prerequisites

Load required packages:

Code
library(pacman)
p_load(tidyverse, rio, ggpubr, rstatix, scales, flextable)

Introduction

If you are using R there is a very good chance you are creating plots using the ggplot2 package.

What you might not know is that there are now over 100 registered extensions available which support and/or extend ggplot21.

One of my favourites is ggpubr - a package that provides some easy-to-use functions for creating clean and publication ready plots.

Let’s see what ggpubr has to offer using the sports car price data set.

Are faster sports cars more expensive?

The question we want to answer is will spending more money on a sports car gets you a faster car?

To answer this question we will create a plot and leverage some of the functions provided by ggpubr.

Start by downloading and importing the data set:

Code
data <- rio::import("sport_car_price.csv") %>% 
  as_tibble()
flextable::flextable(data %>% head)

Car Make

Car Model

Year

Engine Size (L)

Horsepower

Torque (lb-ft)

0-60 MPH Time (seconds)

Price (in USD)

Porsche

911

2,022

3

379

331

4

101,200

Lamborghini

Huracan

2,021

5.2

630

443

2.8

274,390

Ferrari

488 GTB

2,022

3.9

661

561

3

333,750

Audi

R8

2,022

5.2

562

406

3.2

142,700

McLaren

720S

2,021

4

710

568

2.7

298,000

BMW

M8

2,022

4.4

617

553

3.1

130,000

Then clean up the data and convert time taken to reach 60 mph (in seconds) from a continuous to a categorical variable:

Code
data <-
  data %>% filter(!grepl("Electric", `Engine Size (L)`)) %>% 
  filter(`Engine Size (L)` != "N/A") %>% 
  filter(`Engine Size (L)` != "0") %>% 
  filter(`Engine Size (L)` != "-") %>%
   filter(!grepl("Hybrid", `Engine Size (L)`)) %>%
  distinct(`Car Make`,`Car Model`,Year,`Engine Size (L)`,Horsepower, `Torque (lb-ft)`, .keep_all = TRUE)

data <- data %>% 
  mutate(across(everything(), ~str_replace_all(., ",","")))

data <- data %>% 
  mutate(across(c(`Price (in USD)`,`Horsepower`, `Torque (lb-ft)`, `Engine Size (L)`, `0-60 MPH Time (seconds)`),
         as.numeric))

data <- data %>% 
  mutate(`time(0-60mph)`=  cut(`0-60 MPH Time (seconds)`, breaks = c(2, 2.9, 4 , Inf),
     labels = c("< 3", "3-4", "> 4"),
     include.lowest = TRUE))

data <- data %>% 
  rename(price = `Price (in USD)`)

flextable::flextable(data %>% 
                       select(-`Engine Size (L)`, -Horsepower, -`Torque (lb-ft)`, -`0-60 MPH Time (seconds)`) %>% 
                       head())%>%
  align_text_col(align = "right") %>%
  flextable::set_table_properties(layout = "autofit",  width = 1) 

Car Make

Car Model

Year

price

time(0-60mph)

Porsche

911

2022

101,200

3-4

Lamborghini

Huracan

2021

274,390

< 3

Ferrari

488 GTB

2022

333,750

3-4

Audi

R8

2022

142,700

3-4

McLaren

720S

2021

298,000

< 3

BMW

M8

2022

130,000

3-4

Then create boxplots comparing the price of the 0-to-60 mph groups using ggplot2 basic settings:

Code
plot <- 
  ggplot(data, aes(x=`time(0-60mph)`, y=price)) + 
  geom_boxplot(aes( fill = `time(0-60mph)`)) +
  ylab("Price (USD)") +
  xlab("Time 0-60 mph (sec)")

plot

The default ggplot2 look and feel is OK, but could definitely be improved.

This is where ggpubr can help.

Theme

First add the theme_pubr() theme to the plot:

Code
plot <- plot + ggpubr::theme_pubr()
plot

Then add the colour palette taken from the New England Journal of Medicine with the set_palette function:

Code
plot <- plot %>% 
  ggpubr::set_palette(palette = "nejm") +
  theme(legend.position = 'none')

plot

This is a much better looking plot, but the price is in scientific notation and the price range across groups is so large that it is making the boxplots appear compressed.

Let’s further improve the plot.

Scales

Using the scales package, set the y-axis to a logarithmic scale and convert it to dollar labels:

Code
plot <- plot +
  scale_y_continuous(
    trans = 'log10',
    labels = scales::label_dollar(scale_cut = cut_short_scale()
                                  ),
    breaks = scales::breaks_log(n = 5),
    expand = expansion(mult = c(0, 0.1))
    )

plot

Statistics

We can also improve the plot by adding statistics.

In combination with the rstatix package, ggpubr can make adding statistical labels to a plot relatively easy.

To add statistics, use the t_test function from the rstatix package to create a data frame with the comparisons required2:

Code
stat_test <- data %>% 
  rename(time = `time(0-60mph)`) %>% 
  rstatix::t_test(price ~ time, detailed = TRUE) %>%
  rstatix::adjust_pvalue(method = "bonferroni") %>%
  rstatix::add_significance("p.adj")

flextable::flextable(stat_test)

estimate

estimate1

estimate2

.y.

group1

group2

n1

n2

statistic

p

df

conf.low

conf.high

method

alternative

p.adj

p.adj.signif

1,116,868.26

1,305,436.8

188,568.5

price

< 3

3-4

76

173

6.979559

0.0000000008180

78.99424

798,356.27

1,435,380.3

T-test

two.sided

0.0000000024540

****

1,192,797.13

1,305,436.8

112,639.6

price

< 3

> 4

76

66

7.506716

0.0000000000912

76.81129

876,379.71

1,509,214.5

T-test

two.sided

0.0000000002736

****

75,928.87

188,568.5

112,639.6

price

3-4

> 4

173

66

2.449132

0.0150000000000

235.15664

14,851.02

137,006.7

T-test

two.sided

0.0450000000000

*

We can now add the significance values to our plot:

Code
# Add p-values onto the box plots
stat_test <- 
  stat_test %>%
  add_y_position(y.trans = log10,
                 step.increase = 0.5)

plot <- plot + 
  stat_pvalue_manual(
  stat_test,  
  label = "{p.adj.signif}",
  tip.length = 0.005,
  size = 6
  )

plot

Now make some final adjustments to the plot3:

Code
f1 <- function(x) {
  log10(mean(10 ^ x)) 
}

plot <- plot +
  stat_summary(fun.y = f1, colour = 'grey', size = 1.5) +
  theme(axis.text = element_text(size = 20),
        axis.title = element_text(size = 24))
plot

And there we have a really clean looking plot that is easy to read and shows statistical significance.

Conclusion

ggpubr is a fantastic extension to ggplot2 and makes it easy to turn good plots into great plots.

So does spending more money on a sports car get you a faster car?

Thanks to ggpubr, we can see that on average the > 4 second cars are the least expensive, the 3-4 seconds cars are in the middle, and the < 3 seconds cars are the most expensive.

Thus, spending more money on a sports car does appear to buy you a faster car.

Footnotes

  1. ggplot2 extensions gallery↩︎

  2. I adjusted the alpha (α) level so that the probability of committing a type I error is controlled for. I also had to rename the time(0-60mph) variable to time. This is because the t_test function could not find the variable when it was input with back ticks (`) which is the only way I know how when there is a numeric value within the name.↩︎

  3. Increase the axis text and title size and add the average price to the boxplot represented as a grey dot.↩︎