tidybulk is a powerful R package designed for modular transcriptomic data analysis that brings transcriptomics to the tidyverse.
Tidybulk provides a unified interface for comprehensive transcriptomic data analysis with seamless integration of SummarizedExperiment objects and tidyverse principles. It streamlines the entire workflow from raw data to biological insights.
Function | Description |
---|---|
scale_abundance() |
Scale abundance data |
quantile_normalise_abundance() |
Quantile normalization |
adjust_abundance() |
Adjust abundance for unwanted variation |
fill_missing_abundance() |
Fill missing abundance values |
impute_missing_abundance() |
Impute missing abundance values |
Function | Description |
---|---|
identify_abundant() |
Identify abundant transcripts without removing them |
keep_abundant() |
Keep abundant transcripts |
keep_variable() |
Keep variable transcripts |
filterByExpr() |
Filter by expression |
Function | Description |
---|---|
reduce_dimensions() |
Reduce dimensions with PCA/MDS/tSNE/UMAP |
rotate_dimensions() |
Rotate dimensions |
remove_redundancy() |
Remove redundant features |
Function | Description |
---|---|
cluster_elements() |
Cluster elements with various methods |
kmeans clustering |
K-means clustering |
SNN clustering |
Shared nearest neighbor clustering |
hierarchical clustering |
Hierarchical clustering |
DBSCAN clustering |
Density-based clustering |
Function | Description |
---|---|
test_differential_expression() |
Test differential expression with various methods |
Function | Description |
---|---|
deconvolve_cellularity() |
Deconvolve cellularity with various methods |
cibersort() |
CIBERSORT analysis |
Function | Description |
---|---|
test_gene_enrichment() |
Test gene enrichment |
test_gene_overrepresentation() |
Test gene overrepresentation |
test_gene_rank() |
Test gene rank |
Function | Description |
---|---|
describe_transcript() |
Describe transcript characteristics |
get_bibliography() |
Get bibliography |
resolve_complete_confounders_of_non_interest() |
Resolve confounders |
Function | Description |
---|---|
check_if_counts_is_na() |
Check if counts contain NA values |
check_if_duplicated_genes() |
Check for duplicated genes |
check_if_wrong_input() |
Validate input data |
log10_reverse_trans() |
Log10 reverse transformation |
logit_trans() |
Logit transformation |
All functions are directly compatible with SummarizedExperiment
objects and follow tidyverse principles for seamless integration with the tidyverse ecosystem.
Mangiola, Stefano, Ramyar Molania, Ruining Dong, Maria A. Doyle, and Anthony T. Papenfuss. 2021. “Tidybulk: An R tidy framework for modular transcriptomic data analysis.” Genome Biology 22 (42). https://doi.org/10.1186/s13059-020-02233-7
Genome Biology - tidybulk: an R tidy framework for modular transcriptomic data analysis
In this example we will use the airway
dataset, a SummarizedExperiment
object containing RNA-seq data from an experiment studying the effect of dexamethasone treatment on airway smooth muscle cells. This dataset is available in the airway package.
# Standard differential expression analysis
airway = airway |>
# Use QL method
test_differential_expression(~ dex, method = "edgeR_quasi_likelihood", prefix = "ql__") |>
# Use edger_robust_likelihood_ratio
test_differential_expression(~ dex, method = "edger_robust_likelihood_ratio", prefix = "lr_robust__") |>
# Use DESeq2 method
test_differential_expression(~ dex, method = "DESeq2", prefix = "deseq2__") |>
# Use limma_voom
test_differential_expression(~ dex, method = "limma_voom", prefix = "voom__") |>
# Use limma_voom_sample_weights
test_differential_expression(~ dex, method = "limma_voom_sample_weights", prefix = "voom_weights__")
## Warning: The `.abundance` argument of `test_differential_abundance()` is deprecated as
## of tidybulk 2.0.0.
## ℹ Please use the `abundance` argument instead.
## ℹ The deprecated feature was likely used in the tidybulk package.
## Please report the issue at <https://github.com/stemangiola/tidybulk/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## =====================================
## tidybulk says: All testing methods use raw counts, irrespective of if scale_abundance
## or adjust_abundance have been calculated. Therefore, it is essential to add covariates
## such as batch effects (if applicable) in the formula.
## =====================================
## tidybulk says: The design column names are "(Intercept), dexuntrt"
##
## tidybulk says: to access the DE object do `metadata(.)$tidybulk$edgeR_quasi_likelihood_object`
## tidybulk says: to access the raw results (fitted GLM) do `metadata(.)$tidybulk$edgeR_quasi_likelihood_fit`
## tidybulk says: The design column names are "(Intercept), dexuntrt"
##
## tidybulk says: to access the DE object do `metadata(.)$tidybulk$edger_robust_likelihood_ratio_object`
## tidybulk says: to access the raw results (fitted GLM) do `metadata(.)$tidybulk$edger_robust_likelihood_ratio_fit`
## estimating size factors
##
## estimating dispersions
##
## gene-wise dispersion estimates
##
## mean-dispersion relationship
##
## final dispersion estimates
##
## fitting model and testing
##
## tidybulk says: to access the DE object do `metadata(.)$tidybulk$DESeq2_object`
## tidybulk says: to access the raw results (fitted GLM) do `metadata(.)$tidybulk$DESeq2_fit`
## tidybulk says: The design column names are "(Intercept), dexuntrt"
##
## tidybulk says: to access the DE object do `metadata(.)$tidybulk$limma_voom_object`
## tidybulk says: to access the raw results (fitted GLM) do `metadata(.)$tidybulk$limma_voom_fit`
## tidybulk says: The design column names are "(Intercept), dexuntrt"
##
## tidybulk says: to access the DE object do `metadata(.)$tidybulk$limma_voom_sample_weights_object`
## tidybulk says: to access the raw results (fitted GLM) do `metadata(.)$tidybulk$limma_voom_sample_weights_fit`
## This message is displayed once per session.
Inspection of the raw p-value histogram provides a rapid check of differential-expression results. When no gene is truly differentially expressed, the p-values follow a uniform U(0,1) distribution across the interval 0–1, so the histogram appears flat Source. In a more realistic scenario where only a subset of genes changes, this uniform background is still present but an obvious spike emerges close to zero, created by the genuine signals.
Thanks to the modularity of the tidybulk
workflow, that can multiplex different methods, we can easily compare the p-values across methods.
airway |>
rowData() |>
as_tibble() |>
select(
ql__PValue,
lr_robust__PValue,
voom__P.Value,
voom_weights__P.Value,
deseq2__pvalue
) |>
pivot_longer(everything(), names_to = "method", values_to = "pvalue") |>
ggplot(aes(x = pvalue, fill = method)) +
geom_histogram(binwidth = 0.01) +
facet_wrap(~method) +
my_theme +
labs(title = "Histogram of p-values across methods")
# Summary statistics
airway |> rowData() |> as_tibble() |> select(contains("ql|lr_robust|voom|voom_weights|deseq2")) |> select(contains("logFC")) |>
summarise(across(everything(), list(min = min, median = median, max = max), na.rm = TRUE))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(...)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
## # A tibble: 1 × 0
airway |>
rowData() |>
as_tibble() |>
select(ql__PValue, lr_robust__PValue, voom__P.Value, voom_weights__P.Value, deseq2__pvalue) |>
ggpairs(columns = 1:5) +
scale_x_continuous(trans = tidybulk::log10_reverse_trans()) +
scale_y_continuous(trans = tidybulk::log10_reverse_trans()) +
my_theme +
labs(title = "Pairplot of p-values across methods")
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 30 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 30 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 30 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 30 rows containing missing values
## Warning in scale_y_continuous(trans = tidybulk::log10_reverse_trans()):
## log10_reverse transformation introduced infinite values.
## Warning: Removed 30 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 30 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 30 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 30 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 30 rows containing non-finite outside the scale range
## (`stat_density()`).
## Warning in scale_y_continuous(trans = tidybulk::log10_reverse_trans()):
## log10_reverse transformation introduced infinite values.
library(GGally)
airway |>
rowData() |>
as_tibble() |>
select(ql__logFC, lr_robust__logFC, voom__logFC, voom_weights__logFC, deseq2__log2FoldChange) |>
ggpairs(columns = 1:5) +
my_theme +
labs(title = "Pairplot of effect sizes across methods")
It is important to check the quality of the fit. All methods produce a fit object that can be used for quality control. The fit object produced by each underlying method are stored in as attributes of the airway_mini
object. We can use them for example to perform quality control of the fit.
Plot the biological coefficient of variation (BCV) trend. This plot is helpful to understant the dispersion of the data.
metadata(airway)$tidybulk$edgeR_quasi_likelihood_object |>
plotBCV()
Plot the log-fold change vs mean plot.
Visualising the significance and effect size of the differential expression results as a volcano plots we appreciate that some methods have much lower p-values distributions than other methods, for the same model and data.
# Create volcano plots
airway |>
# Select the columns we want to plot
rowData() |>
as_tibble(rownames = ".feature") |>
select(
.feature,
ql__logFC, ql__PValue,
lr_robust__logFC, lr_robust__PValue,
voom__logFC, voom__P.Value,
voom_weights__logFC, voom_weights__P.Value,
deseq2__log2FoldChange, deseq2__pvalue
) |>
# Pivot longer to get a tidy data frame
pivot_longer(
- .feature,
names_to = c("method", "stat"),
values_to = "value", names_sep = "__"
) |>
# Harmonize column names
mutate(stat = case_when(
stat %in% c("logFC", "log2FoldChange") ~ "logFC",
stat %in% c("PValue", "pvalue", "P.Value", "p.value") ~ "PValue"
)) |>
pivot_wider(names_from = "stat", values_from = "value") |>
unnest(c(logFC, PValue)) |>
# Plot
ggplot(aes(x = logFC, y = PValue)) +
geom_point(aes(color = PValue < 0.05, size = PValue < 0.05)) +
scale_y_continuous(trans = tidybulk::log10_reverse_trans()) +
scale_color_manual(values = c("TRUE" = "red", "FALSE" = "black")) +
scale_size_manual(values = c("TRUE" = 0.5, "FALSE" = 0.1)) +
facet_wrap(~method) +
my_theme +
labs(title = "Volcano Plots by Method")
## Warning: Removed 30 rows containing missing values or values outside the scale range
## (`geom_point()`).
Plotting independent y-axis scales for the p-values and effect sizes allows us to compare the top genes across methods.
Methods:
edgeR quasi-likelihood: Quasi-likelihood F-tests for differential expression
edgeR robust likelihood ratio: Robust likelihood ratio tests
DESeq2: Negative binomial distribution with dispersion estimation
limma-voom: Linear modeling with empirical Bayes moderation
limma-voom with sample weights: Enhanced voom with quality weights References:
Robinson et al. (2010) edgeR: a Bioconductor package for differential expression analysis
Chen et al. (2016) From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline
Love et al. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2
Law et al. (2014) voom: precision weights unlock linear model analysis tools for RNA-seq read counts
Liu et al. (2015) Why weight? Modelling sample and observational level variability improves power in RNA-seq analyses
tidybulk
allows you to get the bibliography of all methods used in our workflow.
# Get bibliography of all methods used in our workflow
airway |> get_bibliography()
## @Article{tidybulk,
## title = {tidybulk: an R tidy framework for modular transcriptomic data analysis},
## author = {Stefano Mangiola and Ramyar Molania and Ruining Dong and Maria A. Doyle & Anthony T. Papenfuss},
## journal = {Genome Biology},
## year = {2021},
## volume = {22},
## number = {42},
## url = {https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02233-7},
## }
## @article{wickham2019welcome,
## title={Welcome to the Tidyverse},
## author={Wickham, Hadley and Averick, Mara and Bryan, Jennifer and Chang, Winston and McGowan, Lucy D'Agostino and Francois, Romain and Grolemund, Garrett and Hayes, Alex and Henry, Lionel and Hester, Jim and others},
## journal={Journal of Open Source Software},
## volume={4},
## number={43},
## pages={1686},
## year={2019}
## }
## @article{robinson2010edger,
## title={edgeR: a Bioconductor package for differential expression analysis of digital gene expression data},
## author={Robinson, Mark D and McCarthy, Davis J and Smyth, Gordon K},
## journal={Bioinformatics},
## volume={26},
## number={1},
## pages={139--140},
## year={2010},
## publisher={Oxford University Press}
## }
## @article{robinson2010scaling,
## title={A scaling normalization method for differential expression analysis of RNA-seq data},
## author={Robinson, Mark D and Oshlack, Alicia},
## journal={Genome biology},
## volume={11},
## number={3},
## pages={1--9},
## year={2010},
## publisher={BioMed Central}
## }
## @incollection{smyth2005limma,
## title={Limma: linear models for microarray data},
## author={Smyth, Gordon K},
## booktitle={Bioinformatics and computational biology solutions using R and Bioconductor},
## pages={397--420},
## year={2005},
## publisher={Springer}
## }
## @Manual{,
## title = {R: A Language and Environment for Statistical Computing},
## author = {{R Core Team}},
## organization = {R Foundation for Statistical Computing},
## address = {Vienna, Austria},
## year = {2020},
## url = {https://www.R-project.org/},
## }
## @article{lund2012detecting,
## title={Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates},
## author={Lund, Steven P and Nettleton, Dan and McCarthy, Davis J and Smyth, Gordon K},
## journal={Statistical applications in genetics and molecular biology},
## volume={11},
## number={5},
## year={2012},
## publisher={De Gruyter}
## }
## @article{zhou2014robustly,
## title={Robustly detecting differential expression in RNA sequencing data using observation weights},
## author={Zhou, Xiaobei and Lindsay, Helen and Robinson, Mark D},
## journal={Nucleic acids research},
## volume={42},
## number={11},
## pages={e91--e91},
## year={2014},
## publisher={Oxford University Press}
## }
## @article{love2014moderated,
## title={Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2},
## author={Love, Michael I and Huber, Wolfgang and Anders, Simon},
## journal={Genome biology},
## volume={15},
## number={12},
## pages={550},
## year={2014},
## publisher={Springer}
## }
## @article{law2014voom,
## title={voom: Precision weights unlock linear model analysis tools for RNA-seq read counts},
## author={Law, Charity W and Chen, Yunshun and Shi, Wei and Smyth, Gordon K},
## journal={Genome biology},
## volume={15},
## number={2},
## pages={R29},
## year={2014},
## publisher={Springer}
## }
## @article{liu2015weight,
## title={Why weight? Modelling sample and observational level variability improves power in RNA-seq analyses},
## author={Liu, Ruijie and Holik, Aliaksei Z and Su, Shian and Jansz, Natasha and Chen, Kelan and Leong, Huei San and Blewitt, Marnie E and Asselin-Labat, Marie-Liesse and Smyth, Gordon K and Ritchie, Matthew E},
## journal={Nucleic acids research},
## volume={43},
## number={15},
## pages={e97--e97},
## year={2015},
## publisher={Oxford University Press}
## }
## @article{leek2012sva,
## title={The sva package for removing batch effects and other unwanted variation in high-throughput experiments},
## author={Leek, Jeffrey T and Johnson, W Evan and Parker, Hilary S and Jaffe, Andrew E and Storey, John D},
## journal={Bioinformatics},
## volume={28},
## number={6},
## pages={882--883},
## year={2012},
## publisher={Oxford University Press}
## }
## @article{newman2015robust,
## title={Robust enumeration of cell subsets from tissue expression profiles},
## author={Newman, Aaron M and Liu, Chih Long and Green, Michael R and Gentles, Andrew J and Feng, Weiguo and Xu, Yue and Hoang, Chuong D and Diehn, Maximilian and Alizadeh, Ash A},
## journal={Nature methods},
## volume={12},
## number={5},
## pages={453--457},
## year={2015},
## publisher={Nature Publishing Group}
## }