Code
::p_load(tidyverse, ggpubr, rstatix, emmeans, performance) pacman
Analysis of variance (ANOVA) is a popular statistical technique for comparing outcomes across multiple groups, but utilizing it effectively in R can be overwhelming with the multitude of functions and options available. This article serves as a personal playbook - showcasing the essential tools and workflows I’ve curated specifically for conducting (one-way) ANOVA in R, from coding to visualization and reporting.
::p_load(tidyverse, ggpubr, rstatix, emmeans, performance) pacman
In this hypothetical scenario1, we have a clinical trial involving eighteen participants who were randomly assigned to one of three drug groups: placebo, anxifree, or joyzepam. The aim of the trial was to determine if either of the two drugs could improve mood compared to the placebo. Each group consisted of six participants and the primary outcome was the change in mood over the course of the trial:
id | drug | mood_gain |
---|---|---|
1 | placebo | 0.5 |
2 | placebo | 0.3 |
3 | placebo | 0.1 |
4 | anxifree | 0.6 |
5 | anxifree | 0.4 |
6 | anxifree | 0.2 |
7 | joyzepam | 1.4 |
8 | joyzepam | 1.7 |
9 | joyzepam | 1.3 |
10 | placebo | 0.6 |
11 | placebo | 0.9 |
12 | placebo | 0.3 |
13 | anxifree | 1.1 |
14 | anxifree | 0.8 |
15 | anxifree | 1.2 |
16 | joyzepam | 1.8 |
17 | joyzepam | 1.3 |
18 | joyzepam | 1.4 |
Let’s plot the data using the ggboxplot()
function from the ggpubr
package:
%>%
data ::ggboxplot(
ggpubry = "mood_gain",
x = "drug",
color = "drug",
add = c("dotplot"),
add.params = list(
position = position_jitter(w = 0.09, h = 0.04)
),palette = c("#00AFBB", "#E7B800", "#FC4E07")
+
) theme(
legend.position = "none",
axis.title.x = element_blank()
)
Straight away, we can see that there appears to be a difference between the drug groups, with joyzepam showing higher mood gain on average than anxifree and placebo. However, while a raw difference in groups provides a sense of the magnitude of an effect, we need statistics to helps us determine if the difference is likely to represent a real effect rather than just chance variation.
To assess whether the observed differences between the three treatment groups are statistically significant, we will perform a one-way ANOVA2.
My preferred way to compute (one-way) ANOVA in R is with the rstatix
package (Kassambara 2023).
The first step is to calculate the ANOVA with the rstatix::anova_test
function:
<-
rstatixaov %>%
data ::anova_test(dv = mood_gain, between = drug, wid = id) rstatix
Then print the output:
%>%
rstatixaov ::as_tibble() %>%
tibblemutate(p = ifelse(p < 0.001, "< .001", signif(p, 2))) %>%
mutate(across(where(is.numeric), round,2)) %>%
select(-`p<.05`) %>%
::flextable() %>%
flextable::autofit() flextable
Effect | DFn | DFd | F | p | ges |
---|---|---|---|---|---|
drug | 2 | 15 | 18.61 | < .001 | 0.71 |
The results reveal a statistically significant effect of the drug treatments on changing mood. The small p-value (<0.001) indicates it is highly unlikely (< 0.1% chance) to obtain these results if the drugs had no real effect. Furthermore, the large generalized eta squared value (ges = 0.71) suggests the drug treatments accounted for 71% of the variance in mood scores. This large effect size provides evidence that the drugs substantially impacted mood.
While the overall analysis provides evidence for drug effect, we don’t yet know which specific drug treatments led to superior mood gains compared to others. Our previous visualization gives us some hints, but to explore formally, we need to run post-hoc pairwise comparisons.
Post-hoc comparisons are statistical tests performed to determine which specific pairs of drug groups differ from each other significantly, while ensuring the overall risk of false positives remains low3.
I like to run pairwise comparisons with the rstatix::emmeans_test()
function:
<-
rstatixcomp ::emmeans_test(mood_gain ~ drug,
rstatixdata = data,
p.adjust.method = "holm",
detailed = T
%>%
) mutate(across(where(is.numeric), signif, 2)) %>%
select(-.y., -term, -se, -null.value, -df, -p) %>%
arrange(p.adj)
%>%
rstatixcomp ::flextable() %>%
flextable::autofit() flextable
group1 | group2 | estimate | conf.low | conf.high | statistic | p.adj | p.adj.signif |
---|---|---|---|---|---|---|---|
joyzepam | placebo | 1.00 | 0.66 | 1.40 | 5.9 | 0.000091 | **** |
anxifree | joyzepam | -0.77 | -1.10 | -0.39 | -4.4 | 0.001100 | ** |
anxifree | placebo | 0.27 | -0.11 | 0.64 | 1.5 | 0.150000 | ns |
The results provide evidence of a difference in mood gain between joyzepam and anxifree (diff = 0.77), as well as between joyzepam and placebo (diff = 1), but not between anxifree and placebo (diff = 0.27).
To ensure the validity of the results from a one-way ANOVA, several key assumptions must be tested. Firstly, the residuals are assumed to be normally distributed, often assessed visually through residual plots. Secondly, the homogeneity of variances assumption requires equal population variances across groups, typically checked with Levene’s test and residual plots. Thirdly, observations must be independent, unrelated, and uninfluenced by each other, usually satisfied through proper experimental design and randomization (Navarro 2019).
The trial participants were not related (i think) and randomly allocated to each group so it is safe to assume each observation is independent.
However, let’s check the normality and equal variance assumptions using the performance::check_models
function:
# homogenity of variance
<-
assumptions ::check_model(lm(mood_gain ~ drug, data = data),
performancecheck = c("homogeneity", "normality")
)
assumptions
Here we see no major violations to the equal variance and normality assumptions, giving us confidence that our results are valid.
My favourite aspect of rstatix
is the ability to incorporate the statistical results into plots produced by the ggpubr
package4.
For example, here is how to put all the most relevant information into one plot:
# add xy position to rstatix dataframe
<-
stat_test %>%
rstatixcomp ::add_xy_position(x = "drug")
rstatix
# boxplot using ggpubr package add means, ANOVA, and pairwise results
%>%
data ::ggboxplot(
ggpubry = "mood_gain",
x = "drug",
color = "drug",
add = c("dotplot"),
add.params = list(
position = position_jitter(w = 0.09, h = 0.04),
size = .8
),palette = c("#00AFBB", "#E7B800", "#FC4E07")
+
) scale_y_continuous(breaks = seq(.5, 2, .5)) +
coord_cartesian(ylim = c(0.1, 2.3)) +
theme(
legend.position = "none",
axis.title.x = element_blank()
+
) labs(y = "Mood Gain") +
stat_summary(
geom = "point",
fun.y = "mean",
col = "black",
size = 2.5,
shape = 24,
+
) ::stat_pvalue_manual(stat_test,
ggpubrlabel = "p.adj.signif",
tip.length = 0.01,
hide.ns = T,
bracket.nudge.y = -.03
+
) ::stat_anova_test(label = "as_detailed_italic", label.y.npc = 1) ggpubr
Here is an example for writing up ANOVA and pairwise comparison results for publication in a scientific journal:
“A one-way ANOVA revealed a significant difference in mood gain between the three groups, F(2,15) = 18.611, p < 0.01, η² = 0.713. Post hoc pairwise comparisons were conducted using the Holm method to control for the family-wise error rate across the multiple comparisons. The Holm-adjusted p-values indicated that the joyzepan group had significantly higher mood gain (M = 1.483, SD = 0.214) compared to the placebo (M = 0.45, SD = 0.281) and anxifree (M = 0.717, SD = 0.392) group. There was no significant difference between the anxifree group and the placebo group.”
With so many options available running ANOVA in R can seem daunting at first, but breaking it down into manageable steps makes the process much more approachable. Leveraging the rstatix
package is my preferred approach as it not only simplifies the calculations required for ANOVA but also provides a range of functions to help communicate the results effectively.
This data set is taken from an article explaining how to calculate ANOVA in jamovi. This allowed me to directly compare the results generated by R, with those that can be recreated using the jamovi software. The data set can be accessed here.↩︎
Controlling the overall Type I error rate across the multiple comparisons.↩︎
A good article on how to incorporate pairwise comparisons into plots can be found here↩︎