Code
library(pacman)
p_load(tidyverse, rio, ggpubr, rstatix, scales, flextable)Load required packages:
library(pacman)
p_load(tidyverse, rio, ggpubr, rstatix, scales, flextable)If you are using R there is a very good chance you are creating plots using the ggplot2 package.
What you might not know is that there are now over 100 registered extensions available which support and/or extend ggplot21.
One of my favourites is ggpubr - a package that provides some easy-to-use functions for creating clean and publication ready plots.
Let’s see what ggpubr has to offer using the sports car price data set.
The question we want to answer is will spending more money on a sports car gets you a faster car?
To answer this question we will create a plot and leverage some of the functions provided by ggpubr.
Start by downloading and importing the data set:
data <- rio::import("sport_car_price.csv") %>%
as_tibble()
flextable::flextable(data %>% head)Car Make | Car Model | Year | Engine Size (L) | Horsepower | Torque (lb-ft) | 0-60 MPH Time (seconds) | Price (in USD) |
|---|---|---|---|---|---|---|---|
Porsche | 911 | 2,022 | 3 | 379 | 331 | 4 | 101,200 |
Lamborghini | Huracan | 2,021 | 5.2 | 630 | 443 | 2.8 | 274,390 |
Ferrari | 488 GTB | 2,022 | 3.9 | 661 | 561 | 3 | 333,750 |
Audi | R8 | 2,022 | 5.2 | 562 | 406 | 3.2 | 142,700 |
McLaren | 720S | 2,021 | 4 | 710 | 568 | 2.7 | 298,000 |
BMW | M8 | 2,022 | 4.4 | 617 | 553 | 3.1 | 130,000 |
Then clean up the data and convert time taken to reach 60 mph (in seconds) from a continuous to a categorical variable:
data <-
data %>% filter(!grepl("Electric", `Engine Size (L)`)) %>%
filter(`Engine Size (L)` != "N/A") %>%
filter(`Engine Size (L)` != "0") %>%
filter(`Engine Size (L)` != "-") %>%
filter(!grepl("Hybrid", `Engine Size (L)`)) %>%
distinct(`Car Make`,`Car Model`,Year,`Engine Size (L)`,Horsepower, `Torque (lb-ft)`, .keep_all = TRUE)
data <- data %>%
mutate(across(everything(), ~str_replace_all(., ",","")))
data <- data %>%
mutate(across(c(`Price (in USD)`,`Horsepower`, `Torque (lb-ft)`, `Engine Size (L)`, `0-60 MPH Time (seconds)`),
as.numeric))
data <- data %>%
mutate(`time(0-60mph)`= cut(`0-60 MPH Time (seconds)`, breaks = c(2, 2.9, 4 , Inf),
labels = c("< 3", "3-4", "> 4"),
include.lowest = TRUE))
data <- data %>%
rename(price = `Price (in USD)`)
flextable::flextable(data %>%
select(-`Engine Size (L)`, -Horsepower, -`Torque (lb-ft)`, -`0-60 MPH Time (seconds)`) %>%
head())%>%
align_text_col(align = "right") %>%
flextable::set_table_properties(layout = "autofit", width = 1) Car Make | Car Model | Year | price | time(0-60mph) |
|---|---|---|---|---|
Porsche | 911 | 2022 | 101,200 | 3-4 |
Lamborghini | Huracan | 2021 | 274,390 | < 3 |
Ferrari | 488 GTB | 2022 | 333,750 | 3-4 |
Audi | R8 | 2022 | 142,700 | 3-4 |
McLaren | 720S | 2021 | 298,000 | < 3 |
BMW | M8 | 2022 | 130,000 | 3-4 |
Then create boxplots comparing the price of the 0-to-60 mph groups using ggplot2 basic settings:
plot <-
ggplot(data, aes(x=`time(0-60mph)`, y=price)) +
geom_boxplot(aes( fill = `time(0-60mph)`)) +
ylab("Price (USD)") +
xlab("Time 0-60 mph (sec)")
plotThe default ggplot2 look and feel is OK, but could definitely be improved.
This is where ggpubr can help.
First add the theme_pubr() theme to the plot:
plot <- plot + ggpubr::theme_pubr()
plotThen add the colour palette taken from the New England Journal of Medicine with the set_palette function:
plot <- plot %>%
ggpubr::set_palette(palette = "nejm") +
theme(legend.position = 'none')
plotThis is a much better looking plot, but the price is in scientific notation and the price range across groups is so large that it is making the boxplots appear compressed.
Let’s further improve the plot.
Using the scales package, set the y-axis to a logarithmic scale and convert it to dollar labels:
plot <- plot +
scale_y_continuous(
trans = 'log10',
labels = scales::label_dollar(scale_cut = cut_short_scale()
),
breaks = scales::breaks_log(n = 5),
expand = expansion(mult = c(0, 0.1))
)
plotWe can also improve the plot by adding statistics.
In combination with the rstatix package, ggpubr can make adding statistical labels to a plot relatively easy.
To add statistics, use the t_test function from the rstatix package to create a data frame with the comparisons required2:
stat_test <- data %>%
rename(time = `time(0-60mph)`) %>%
rstatix::t_test(price ~ time, detailed = TRUE) %>%
rstatix::adjust_pvalue(method = "bonferroni") %>%
rstatix::add_significance("p.adj")
flextable::flextable(stat_test)estimate | estimate1 | estimate2 | .y. | group1 | group2 | n1 | n2 | statistic | p | df | conf.low | conf.high | method | alternative | p.adj | p.adj.signif |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1,116,868.26 | 1,305,436.8 | 188,568.5 | price | < 3 | 3-4 | 76 | 173 | 6.979559 | 0.0000000008180 | 78.99424 | 798,356.27 | 1,435,380.3 | T-test | two.sided | 0.0000000024540 | **** |
1,192,797.13 | 1,305,436.8 | 112,639.6 | price | < 3 | > 4 | 76 | 66 | 7.506716 | 0.0000000000912 | 76.81129 | 876,379.71 | 1,509,214.5 | T-test | two.sided | 0.0000000002736 | **** |
75,928.87 | 188,568.5 | 112,639.6 | price | 3-4 | > 4 | 173 | 66 | 2.449132 | 0.0150000000000 | 235.15664 | 14,851.02 | 137,006.7 | T-test | two.sided | 0.0450000000000 | * |
We can now add the significance values to our plot:
# Add p-values onto the box plots
stat_test <-
stat_test %>%
add_y_position(y.trans = log10,
step.increase = 0.5)
plot <- plot +
stat_pvalue_manual(
stat_test,
label = "{p.adj.signif}",
tip.length = 0.005,
size = 6
)
plotNow make some final adjustments to the plot3:
f1 <- function(x) {
log10(mean(10 ^ x))
}
plot <- plot +
stat_summary(fun.y = f1, colour = 'grey', size = 1.5) +
theme(axis.text = element_text(size = 20),
axis.title = element_text(size = 24))
plotAnd there we have a really clean looking plot that is easy to read and shows statistical significance.
ggpubr is a fantastic extension to ggplot2 and makes it easy to turn good plots into great plots.
So does spending more money on a sports car get you a faster car?
Thanks to ggpubr, we can see that on average the > 4 second cars are the least expensive, the 3-4 seconds cars are in the middle, and the < 3 seconds cars are the most expensive.
Thus, spending more money on a sports car does appear to buy you a faster car.
I adjusted the alpha (α) level so that the probability of committing a type I error is controlled for. I also had to rename the time(0-60mph) variable to time. This is because the t_test function could not find the variable when it was input with back ticks (`) which is the only way I know how when there is a numeric value within the name.↩︎
Increase the axis text and title size and add the average price to the boxplot represented as a grey dot.↩︎