Brings transcriptomics to the tidyverse!
The code is released under the version 3 of the GNU General Public License.
website: stemangiola.github.io/tidybulk/ Third party tutorials Please have a look also to
Function | Description |
---|---|
aggregate_duplicates |
Aggregate abundance and annotation of duplicated transcripts in a robust way |
identify_abundant keep_abundant
|
identify or keep the abundant genes |
keep_variable |
Filter for top variable features |
scale_abundance |
Scale (normalise) abundance for RNA sequencing depth |
reduce_dimensions |
Perform dimensionality reduction (PCA, MDS, tSNE, UMAP) |
cluster_elements |
Labels elements with cluster identity (kmeans, SNN) |
remove_redundancy |
Filter out elements with highly correlated features |
adjust_abundance |
Remove known unwanted variation (Combat) |
test_differential_abundance |
Differential transcript abundance testing (DESeq2, edgeR, voom) |
deconvolve_cellularity |
Estimated tissue composition (Cibersort, llsr, epic, xCell, mcp_counter, quantiseq |
test_differential_cellularity |
Differential cell-type abundance testing |
test_stratification_cellularity |
Estimate Kaplan-Meier survival differences |
test_gene_enrichment |
Gene enrichment analyses (EGSEA) |
test_gene_overrepresentation |
Gene enrichment on list of transcript names (no rank) |
test_gene_rank |
Gene enrichment on list of transcript (GSEA) |
impute_missing_abundance |
Impute abundance for missing data points using sample groupings |
Utilities | Description |
---|---|
get_bibliography |
Get the bibliography of your workflow |
tidybulk |
add tidybulk attributes to a tibble object |
tidybulk_SAM_BAM |
Convert SAM BAM files into tidybulk tibble |
pivot_sample |
Select sample-wise columns/information |
pivot_transcript |
Select transcript-wise columns/information |
rotate_dimensions |
Rotate two dimensions of a degree |
ensembl_to_symbol |
Add gene symbol from ensembl IDs |
symbol_to_entrez |
Add entrez ID from gene symbol |
describe_transcript |
Add gene description from gene symbol |
All functions are directly compatibble with SummarizedExperiment
object.
From Bioconductor
BiocManager::install("tidybulk")
From Github
devtools::install_github("stemangiola/tidybulk")
We will use a SummarizedExperiment
object
counts_SE
## [90m# A SummarizedExperiment-tibble abstraction: 408,624 × 8[39m
## [90m# [90mFeatures=8513 | Samples=48 | Assays=count[0m[39m
## .feature .sample count Cell.type time condition batch factor_of_interest
## [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<fct>[39m[23m [3m[90m<fct>[39m[23m [3m[90m<lgl>[39m[23m [3m[90m<fct>[39m[23m [3m[90m<lgl>[39m[23m
## [90m 1[39m A1BG SRR1740034 153 b_cell 0 d TRUE 0 TRUE
## [90m 2[39m A1BG-AS1 SRR1740034 83 b_cell 0 d TRUE 0 TRUE
## [90m 3[39m AAAS SRR1740034 868 b_cell 0 d TRUE 0 TRUE
## [90m 4[39m AACS SRR1740034 222 b_cell 0 d TRUE 0 TRUE
## [90m 5[39m AAGAB SRR1740034 590 b_cell 0 d TRUE 0 TRUE
## [90m 6[39m AAMDC SRR1740034 48 b_cell 0 d TRUE 0 TRUE
## [90m 7[39m AAMP SRR1740034 [4m1[24m257 b_cell 0 d TRUE 0 TRUE
## [90m 8[39m AANAT SRR1740034 284 b_cell 0 d TRUE 0 TRUE
## [90m 9[39m AAR2 SRR1740034 379 b_cell 0 d TRUE 0 TRUE
## [90m10[39m AARS2 SRR1740034 685 b_cell 0 d TRUE 0 TRUE
## [90m# ℹ 40 more rows[39m
Loading tidySummarizedExperiment
will automatically abstract this object as tibble
, so we can display it and manipulate it with tidy tools. Although it looks different, and more tools (tidyverse) are available to us, this object is in fact a SummarizedExperiment
object.
class(counts_SE)
First of all, you can cite all articles utilised within your workflow automatically from any tidybulk tibble
counts_SE |> get_bibliography()
transcripts
tidybulk provide the aggregate_duplicates
function to aggregate duplicated transcripts (e.g., isoforms, ensembl). For example, we often have to convert ensembl symbols to gene/transcript symbol, but in doing so we have to deal with duplicates. aggregate_duplicates
takes a tibble and column names (as symbols; for sample
, transcript
and count
) as arguments and returns a tibble with transcripts with the same name aggregated. All the rest of the columns are appended, and factors and boolean are appended as characters.
TidyTranscriptomics
rowData(counts_SE)$gene_name = rownames(counts_SE)
counts_SE.aggr = counts_SE |> aggregate_duplicates(.transcript = gene_name)
Standard procedure (comparative purpose)
counts
We may want to compensate for sequencing depth, scaling the transcript abundance (e.g., with TMM algorithm, Robinson and Oshlack doi.org/10.1186/gb-2010-11-3-r25). scale_abundance
takes a tibble, column names (as symbols; for sample
, transcript
and count
) and a method as arguments and returns a tibble with additional columns with scaled data as <NAME OF COUNT COLUMN>_scaled
.
TidyTranscriptomics
counts_SE.norm = counts_SE.aggr |> identify_abundant(factor_of_interest = condition) |> scale_abundance()
Standard procedure (comparative purpose)
We can easily plot the scaled density to check the scaling outcome. On the x axis we have the log scaled counts, on the y axes we have the density, data is grouped by sample and coloured by cell type.
counts_SE.norm |>
ggplot(aes(count_scaled + 1, group=.sample, color=`Cell.type`)) +
geom_density() +
scale_x_log10() +
my_theme
variable transcripts
We may want to identify and filter variable transcripts.
TidyTranscriptomics
counts_SE.norm.variable = counts_SE.norm |> keep_variable()
Standard procedure (comparative purpose)
dimensions
We may want to reduce the dimensions of our data, for example using PCA or MDS algorithms. reduce_dimensions
takes a tibble, column names (as symbols; for sample
, transcript
and count
) and a method (e.g., MDS or PCA) as arguments and returns a tibble with additional columns for the reduced dimensions.
MDS (Robinson et al., 10.1093/bioinformatics/btp616)
TidyTranscriptomics
counts_SE.norm.MDS =
counts_SE.norm |>
reduce_dimensions(method="MDS", .dims = 6)
Standard procedure (comparative purpose)
On the x and y axes axis we have the reduced dimensions 1 to 3, data is coloured by cell type.
counts_SE.norm.MDS |> pivot_sample() |> select(contains("Dim"), everything())
## [90m# A tibble: 48 × 14[39m
## Dim1 Dim2 Dim3 Dim4 Dim5 Dim6 .sample Cell.type time
## [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<fct>[39m[23m [3m[90m<fct>[39m[23m
## [90m 1[39m -[31m1[39m[31m.[39m[31m46[39m 0.220 -[31m1[39m[31m.[39m[31m68[39m 0.055[4m3[24m 0.065[4m8[24m -[31m0[39m[31m.[39m[31m126[39m SRR1740034 b_cell 0 d
## [90m 2[39m -[31m1[39m[31m.[39m[31m46[39m 0.226 -[31m1[39m[31m.[39m[31m71[39m 0.030[4m0[24m 0.045[4m4[24m -[31m0[39m[31m.[39m[31m137[39m SRR1740035 b_cell 1 d
## [90m 3[39m -[31m1[39m[31m.[39m[31m44[39m 0.193 -[31m1[39m[31m.[39m[31m60[39m 0.089[4m0[24m 0.050[4m3[24m -[31m0[39m[31m.[39m[31m121[39m SRR1740036 b_cell 3 d
## [90m 4[39m -[31m1[39m[31m.[39m[31m44[39m 0.198 -[31m1[39m[31m.[39m[31m67[39m 0.089[4m1[24m 0.054[4m3[24m -[31m0[39m[31m.[39m[31m110[39m SRR1740037 b_cell 7 d
## [90m 5[39m 0.243 -[31m1[39m[31m.[39m[31m42[39m 0.182 0.006[4m4[24m[4m2[24m -[31m0[39m[31m.[39m[31m503[39m -[31m0[39m[31m.[39m[31m131[39m SRR1740038 dendritic_mye… 0 d
## [90m 6[39m 0.191 -[31m1[39m[31m.[39m[31m42[39m 0.195 0.018[4m0[24m -[31m0[39m[31m.[39m[31m457[39m -[31m0[39m[31m.[39m[31m130[39m SRR1740039 dendritic_mye… 1 d
## [90m 7[39m 0.257 -[31m1[39m[31m.[39m[31m42[39m 0.152 0.013[4m0[24m -[31m0[39m[31m.[39m[31m582[39m -[31m0[39m[31m.[39m[31m0[39m[31m92[4m7[24m[39m SRR1740040 dendritic_mye… 3 d
## [90m 8[39m 0.162 -[31m1[39m[31m.[39m[31m43[39m 0.189 0.023[4m2[24m -[31m0[39m[31m.[39m[31m452[39m -[31m0[39m[31m.[39m[31m109[39m SRR1740041 dendritic_mye… 7 d
## [90m 9[39m 0.516 -[31m1[39m[31m.[39m[31m47[39m 0.240 -[31m0[39m[31m.[39m[31m251[39m 0.457 -[31m0[39m[31m.[39m[31m119[39m SRR1740042 monocyte 0 d
## [90m10[39m 0.514 -[31m1[39m[31m.[39m[31m41[39m 0.231 -[31m0[39m[31m.[39m[31m219[39m 0.458 -[31m0[39m[31m.[39m[31m131[39m SRR1740043 monocyte 1 d
## [90m# ℹ 38 more rows[39m
## [90m# ℹ 5 more variables: condition <lgl>, batch <fct>, factor_of_interest <lgl>,[39m
## [90m# TMM <dbl>, multiplier <dbl>[39m
counts_SE.norm.MDS |>
pivot_sample() |>
GGally::ggpairs(columns = 6:(6+5), ggplot2::aes(colour=`Cell.type`))
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## [1m[22m`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1m[22m`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1m[22m`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1m[22m`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1m[22m`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
PCA
TidyTranscriptomics
counts_SE.norm.PCA =
counts_SE.norm |>
reduce_dimensions(method="PCA", .dims = 6)
Standard procedure (comparative purpose)
On the x and y axes axis we have the reduced dimensions 1 to 3, data is coloured by cell type.
counts_SE.norm.PCA |> pivot_sample() |> select(contains("PC"), everything())
## [90m# A tibble: 48 × 14[39m
## PC1 PC2 PC3 PC4 PC5 PC6 .sample Cell.type time condition
## [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<fct>[39m[23m [3m[90m<fct>[39m[23m [3m[90m<lgl>[39m[23m
## [90m 1[39m -[31m12[39m[31m.[39m[31m6[39m -[31m2[39m[31m.[39m[31m52[39m -[31m14[39m[31m.[39m[31m9[39m -[31m0[39m[31m.[39m[31m424[39m -[31m0[39m[31m.[39m[31m592[39m -[31m1[39m[31m.[39m[31m22[39m SRR17400… b_cell 0 d TRUE
## [90m 2[39m -[31m12[39m[31m.[39m[31m6[39m -[31m2[39m[31m.[39m[31m57[39m -[31m15[39m[31m.[39m[31m2[39m -[31m0[39m[31m.[39m[31m140[39m -[31m0[39m[31m.[39m[31m388[39m -[31m1[39m[31m.[39m[31m30[39m SRR17400… b_cell 1 d TRUE
## [90m 3[39m -[31m12[39m[31m.[39m[31m6[39m -[31m2[39m[31m.[39m[31m41[39m -[31m14[39m[31m.[39m[31m5[39m -[31m0[39m[31m.[39m[31m714[39m -[31m0[39m[31m.[39m[31m344[39m -[31m1[39m[31m.[39m[31m10[39m SRR17400… b_cell 3 d TRUE
## [90m 4[39m -[31m12[39m[31m.[39m[31m5[39m -[31m2[39m[31m.[39m[31m34[39m -[31m14[39m[31m.[39m[31m9[39m -[31m0[39m[31m.[39m[31m816[39m -[31m0[39m[31m.[39m[31m427[39m -[31m1[39m[31m.[39m[31m00[39m SRR17400… b_cell 7 d TRUE
## [90m 5[39m 0.189 13.0 1.66 -[31m0[39m[31m.[39m[31m0[39m[31m26[4m9[24m[39m 4.64 -[31m1[39m[31m.[39m[31m35[39m SRR17400… dendriti… 0 d FALSE
## [90m 6[39m -[31m0[39m[31m.[39m[31m293[39m 12.9 1.76 -[31m0[39m[31m.[39m[31m0[39m[31m72[4m7[24m[39m 4.21 -[31m1[39m[31m.[39m[31m28[39m SRR17400… dendriti… 1 d FALSE
## [90m 7[39m 0.407 13.0 1.42 -[31m0[39m[31m.[39m[31m0[39m[31m52[4m9[24m[39m 5.37 -[31m1[39m[31m.[39m[31m0[39m[31m1[39m SRR17400… dendriti… 3 d FALSE
## [90m 8[39m -[31m0[39m[31m.[39m[31m620[39m 13.0 1.73 -[31m0[39m[31m.[39m[31m201[39m 4.17 -[31m1[39m[31m.[39m[31m0[39m[31m7[39m SRR17400… dendriti… 7 d FALSE
## [90m 9[39m 2.56 13.5 2.32 2.03 -[31m4[39m[31m.[39m[31m32[39m -[31m1[39m[31m.[39m[31m22[39m SRR17400… monocyte 0 d FALSE
## [90m10[39m 2.65 13.1 2.21 1.80 -[31m4[39m[31m.[39m[31m29[39m -[31m1[39m[31m.[39m[31m30[39m SRR17400… monocyte 1 d FALSE
## [90m# ℹ 38 more rows[39m
## [90m# ℹ 4 more variables: batch <fct>, factor_of_interest <lgl>, TMM <dbl>,[39m
## [90m# multiplier <dbl>[39m
counts_SE.norm.PCA |>
pivot_sample() |>
GGally::ggpairs(columns = 11:13, ggplot2::aes(colour=`Cell.type`))
tSNE
TidyTranscriptomics
counts_SE.norm.tSNE =
breast_tcga_mini_SE |>
identify_abundant() |>
reduce_dimensions(
method = "tSNE",
perplexity=10,
pca_scale =TRUE
)
Standard procedure (comparative purpose)
Plot
counts_SE.norm.tSNE |>
pivot_sample() |>
select(contains("tSNE"), everything())
## [90m# A tibble: 251 × 4[39m
## tSNE1 tSNE2 .sample Call
## [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<fct>[39m[23m
## [90m 1[39m 4.25 8.51 TCGA-A1-A0SD-01A-11R-A115-07 LumA
## [90m 2[39m -[31m5[39m[31m.[39m[31m57[39m -[31m2[39m[31m.[39m[31m96[39m TCGA-A1-A0SF-01A-11R-A144-07 LumA
## [90m 3[39m 8.64 -[31m3[39m[31m.[39m[31m15[39m TCGA-A1-A0SG-01A-11R-A144-07 LumA
## [90m 4[39m 4.91 -[31m4[39m[31m.[39m[31m31[39m TCGA-A1-A0SH-01A-11R-A084-07 LumA
## [90m 5[39m 4.74 5.76 TCGA-A1-A0SI-01A-11R-A144-07 LumB
## [90m 6[39m -[31m4[39m[31m.[39m[31m0[39m[31m5[39m 7.13 TCGA-A1-A0SJ-01A-11R-A084-07 LumA
## [90m 7[39m -[31m32[39m[31m.[39m[31m2[39m -[31m5[39m[31m.[39m[31m23[39m TCGA-A1-A0SK-01A-12R-A084-07 Basal
## [90m 8[39m 4.22 -[31m13[39m[31m.[39m[31m2[39m TCGA-A1-A0SM-01A-11R-A084-07 LumA
## [90m 9[39m 4.88 -[31m12[39m[31m.[39m[31m1[39m TCGA-A1-A0SN-01A-11R-A144-07 LumB
## [90m10[39m 13.6 17.8 TCGA-A1-A0SQ-01A-21R-A144-07 LumA
## [90m# ℹ 241 more rows[39m
counts_SE.norm.tSNE |>
pivot_sample() |>
ggplot(aes(x = `tSNE1`, y = `tSNE2`, color=Call)) + geom_point() + my_theme
dimensions
We may want to rotate the reduced dimensions (or any two numeric columns really) of our data, of a set angle. rotate_dimensions
takes a tibble, column names (as symbols; for sample
, transcript
and count
) and an angle as arguments and returns a tibble with additional columns for the rotated dimensions. The rotated dimensions will be added to the original data set as <NAME OF DIMENSION> rotated <ANGLE>
by default, or as specified in the input arguments.
TidyTranscriptomics
counts_SE.norm.MDS.rotated =
counts_SE.norm.MDS |>
rotate_dimensions(`Dim1`, `Dim2`, rotation_degrees = 45, action="get")
Standard procedure (comparative purpose)
Original On the x and y axes axis we have the first two reduced dimensions, data is coloured by cell type.
counts_SE.norm.MDS.rotated |>
ggplot(aes(x=`Dim1`, y=`Dim2`, color=`Cell.type` )) +
geom_point() +
my_theme
Rotated On the x and y axes axis we have the first two reduced dimensions rotated of 45 degrees, data is coloured by cell type.
counts_SE.norm.MDS.rotated |>
pivot_sample() |>
ggplot(aes(x=`Dim1_rotated_45`, y=`Dim2_rotated_45`, color=`Cell.type` )) +
geom_point() +
my_theme
differential abundance
We may want to test for differential transcription between sample-wise factors of interest (e.g., with edgeR). test_differential_abundance
takes a tibble, column names (as symbols; for sample
, transcript
and count
) and a formula representing the desired linear model as arguments and returns a tibble with additional columns for the statistics from the hypothesis test (e.g., log fold change, p-value and false discovery rate).
TidyTranscriptomics
counts_SE.de =
counts_SE |>
test_differential_abundance( ~ condition, action="get")
counts_SE.de
Standard procedure (comparative purpose)
library(edgeR)
dgList <- DGEList(counts=counts_m,group=group)
keep <- filterByExpr(dgList)
dgList <- dgList[keep,,keep.lib.sizes=FALSE]
dgList <- calcNormFactors(dgList)
design <- model.matrix(~group)
dgList <- estimateDisp(dgList,design)
fit <- glmQLFit(dgList,design)
qlf <- glmQLFTest(fit,coef=2)
topTags(qlf, n=Inf)
The functon test_differential_abundance
operated with contrasts too. The constrasts hve the name of the design matrix (generally
counts_SE.de =
counts_SE |>
identify_abundant(factor_of_interest = condition) |>
test_differential_abundance(
~ 0 + condition,
.contrasts = c( "conditionTRUE - conditionFALSE"),
action="get"
)
counts
We may want to adjust counts
for (known) unwanted variation. adjust_abundance
takes as arguments a tibble, column names (as symbols; for sample
, transcript
and count
) and a formula representing the desired linear model where the first covariate is the factor of interest and the second covariate is the unwanted variation, and returns a tibble with additional columns for the adjusted counts as <COUNT COLUMN>_adjusted
. At the moment just an unwanted covariated is allowed at a time.
TidyTranscriptomics
counts_SE.norm.adj =
counts_SE.norm |> adjust_abundance( .factor_unwanted = batch, .factor_of_interest = factor_of_interest)
Standard procedure (comparative purpose)
library(sva)
count_m_log = log(count_m + 1)
design =
model.matrix(
object = ~ factor_of_interest + batch,
data = annotation
)
count_m_log.sva =
ComBat(
batch = design[,2],
mod = design,
...
)
count_m_log.sva = ceiling(exp(count_m_log.sva) -1)
count_m_log.sva$cell_type = counts[
match(counts$sample, rownames(count_m_log.sva)),
"Cell.type"
]
Cell type composition
We may want to infer the cell type composition of our samples (with the algorithm Cibersort; Newman et al., 10.1038/nmeth.3337). deconvolve_cellularity
takes as arguments a tibble, column names (as symbols; for sample
, transcript
and count
) and returns a tibble with additional columns for the adjusted cell type proportions.
TidyTranscriptomics
counts_SE.cibersort =
counts_SE |>
deconvolve_cellularity(action="get", cores=1, prefix = "cibersort__")
Standard procedure (comparative purpose)
With the new annotated data frame, we can plot the distributions of cell types across samples, and compare them with the nominal cell type labels to check for the purity of isolation. On the x axis we have the cell types inferred by Cibersort, on the y axis we have the inferred proportions. The data is facetted and coloured by nominal cell types (annotation given by the researcher after FACS sorting).
counts_SE.cibersort |>
pivot_longer(
names_to= "Cell_type_inferred",
values_to = "proportion",
names_prefix ="cibersort__",
cols=contains("cibersort__")
) |>
ggplot(aes(x=`Cell_type_inferred`, y=proportion, fill=`Cell.type`)) +
geom_boxplot() +
facet_wrap(~`Cell.type`) +
my_theme +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5), aspect.ratio=1/5)
We can also perform a statistical test on the differential cell-type abundance across conditions
counts_SE |>
test_differential_cellularity(. ~ condition )
## [90m# A tibble: 22 × 7[39m
## .cell_type cell_type_proportions `estimate_(Intercept)`
## [3m[90m<chr>[39m[23m [3m[90m<list>[39m[23m [3m[90m<dbl>[39m[23m
## [90m 1[39m cibersort.B.cells.naive [90m<tibble [48 × 8]>[39m -[31m2[39m[31m.[39m[31m94[39m
## [90m 2[39m cibersort.B.cells.memory [90m<tibble [48 × 8]>[39m -[31m4[39m[31m.[39m[31m86[39m
## [90m 3[39m cibersort.Plasma.cells [90m<tibble [48 × 8]>[39m -[31m5[39m[31m.[39m[31m33[39m
## [90m 4[39m cibersort.T.cells.CD8 [90m<tibble [48 × 8]>[39m -[31m2[39m[31m.[39m[31m33[39m
## [90m 5[39m cibersort.T.cells.CD4.naive [90m<tibble [48 × 8]>[39m -[31m2[39m[31m.[39m[31m83[39m
## [90m 6[39m cibersort.T.cells.CD4.memory.re… [90m<tibble [48 × 8]>[39m -[31m2[39m[31m.[39m[31m46[39m
## [90m 7[39m cibersort.T.cells.CD4.memory.ac… [90m<tibble [48 × 8]>[39m -[31m3[39m[31m.[39m[31m67[39m
## [90m 8[39m cibersort.T.cells.follicular.he… [90m<tibble [48 × 8]>[39m -[31m5[39m[31m.[39m[31m68[39m
## [90m 9[39m cibersort.T.cells.regulatory..T… [90m<tibble [48 × 8]>[39m -[31m5[39m[31m.[39m[31m0[39m[31m4[39m
## [90m10[39m cibersort.T.cells.gamma.delta [90m<tibble [48 × 8]>[39m -[31m4[39m[31m.[39m[31m78[39m
## [90m# ℹ 12 more rows[39m
## [90m# ℹ 4 more variables: estimate_conditionTRUE <dbl>,[39m
## [90m# std.error_conditionTRUE <dbl>, statistic_conditionTRUE <dbl>,[39m
## [90m# p.value_conditionTRUE <dbl>[39m
We can also perform regression analysis with censored data (coxph).
# Add survival data
counts_SE_survival =
counts_SE |>
nest(data = -sample) |>
mutate(
days = sample(1:1000, size = n()),
dead = sample(c(0,1), size = n(), replace = TRUE)
) |>
unnest(data)
## Warning in is_sample_feature_deprecated_used(.data, .cols):
## tidySummarizedExperiment says: from version 1.3.1, the special columns
## including sample/feature id (colnames(se), rownames(se)) has changed to
## ".sample" and ".feature". This dataset is returned with the old-style
## vocabulary (feature and sample), however we suggest to update your workflow to
## reflect the new vocabulary (.feature, .sample)
## Warning in is_sample_feature_deprecated_used(.data, .cols):
## tidySummarizedExperiment says: from version 1.3.1, the special columns
## including sample/feature id (colnames(se), rownames(se)) has changed to
## ".sample" and ".feature". This dataset is returned with the old-style
## vocabulary (feature and sample), however we suggest to update your workflow to
## reflect the new vocabulary (.feature, .sample)
## Warning: [1m[22mThere were 96 warnings in `mutate()`.
## The first warning was:
## [1m[22m[36mℹ[39m In argument: `data = map2(...)`.
## Caused by warning in `is_sample_feature_deprecated_used()`:
## [33m![39m tidySummarizedExperiment says: from version 1.3.1, the special columns including sample/feature id (colnames(se), rownames(se)) has changed to ".sample" and ".feature". This dataset is returned with the old-style vocabulary (feature and sample), however we suggest to update your workflow to reflect the new vocabulary (.feature, .sample)
## [1m[22m[36mℹ[39m Run `dplyr::last_dplyr_warnings()` to see the 95 remaining warnings.
# Test
counts_SE_survival |>
test_differential_cellularity(survival::Surv(days, dead) ~ .)
## [90m# A tibble: 22 × 6[39m
## .cell_type cell_type_proportions estimate std.error statistic p.value
## [3m[90m<chr>[39m[23m [3m[90m<list>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
## [90m 1[39m cibersort.B.cells… [90m<tibble [48 × 9]>[39m -[31m1[39m[31m.[39m[31m46[39m 0.614 -[31m2[39m[31m.[39m[31m37[39m 0.017[4m6[24m
## [90m 2[39m cibersort.B.cells… [90m<tibble [48 × 9]>[39m -[31m1[39m[31m.[39m[31m15[39m 0.483 -[31m2[39m[31m.[39m[31m38[39m 0.017[4m2[24m
## [90m 3[39m cibersort.Plasma.… [90m<tibble [48 × 9]>[39m -[31m0[39m[31m.[39m[31m811[39m 0.472 -[31m1[39m[31m.[39m[31m72[39m 0.085[4m8[24m
## [90m 4[39m cibersort.T.cells… [90m<tibble [48 × 9]>[39m -[31m2[39m[31m.[39m[31m62[39m 1.28 -[31m2[39m[31m.[39m[31m0[39m[31m5[39m 0.040[4m5[24m
## [90m 5[39m cibersort.T.cells… [90m<tibble [48 × 9]>[39m 0.287 0.374 0.769 0.442
## [90m 6[39m cibersort.T.cells… [90m<tibble [48 × 9]>[39m -[31m0[39m[31m.[39m[31m744[39m 0.549 -[31m1[39m[31m.[39m[31m36[39m 0.175
## [90m 7[39m cibersort.T.cells… [90m<tibble [48 × 9]>[39m -[31m0[39m[31m.[39m[31m777[39m 0.829 -[31m0[39m[31m.[39m[31m937[39m 0.349
## [90m 8[39m cibersort.T.cells… [90m<tibble [48 × 9]>[39m -[31m2[39m[31m.[39m[31m0[39m[31m3[39m 0.779 -[31m2[39m[31m.[39m[31m61[39m 0.009[4m0[24m[4m9[24m
## [90m 9[39m cibersort.T.cells… [90m<tibble [48 × 9]>[39m -[31m0[39m[31m.[39m[31m794[39m 0.644 -[31m1[39m[31m.[39m[31m23[39m 0.217
## [90m10[39m cibersort.T.cells… [90m<tibble [48 × 9]>[39m -[31m1[39m[31m.[39m[31m95[39m 1.01 -[31m1[39m[31m.[39m[31m92[39m 0.054[4m7[24m
## [90m# ℹ 12 more rows[39m
We can also perform test of Kaplan-Meier curves.
counts_stratified =
counts_SE_survival |>
# Test
test_stratification_cellularity(
survival::Surv(days, dead) ~ .,
sample, transcript, count
)
counts_stratified
## [90m# A tibble: 22 × 6[39m
## .cell_type cell_type_proportions .low_cellularity_exp…¹
## [3m[90m<chr>[39m[23m [3m[90m<list>[39m[23m [3m[90m<dbl>[39m[23m
## [90m 1[39m cibersort.B.cells.naive [90m<tibble [48 × 9]>[39m 8.42
## [90m 2[39m cibersort.B.cells.memory [90m<tibble [48 × 9]>[39m 11.6
## [90m 3[39m cibersort.Plasma.cells [90m<tibble [48 × 9]>[39m 12.8
## [90m 4[39m cibersort.T.cells.CD8 [90m<tibble [48 × 9]>[39m 8.05
## [90m 5[39m cibersort.T.cells.CD4.naive [90m<tibble [48 × 9]>[39m 11.6
## [90m 6[39m cibersort.T.cells.CD4.memory.re… [90m<tibble [48 × 9]>[39m 6.13
## [90m 7[39m cibersort.T.cells.CD4.memory.ac… [90m<tibble [48 × 9]>[39m 8.31
## [90m 8[39m cibersort.T.cells.follicular.he… [90m<tibble [48 × 9]>[39m 14.8
## [90m 9[39m cibersort.T.cells.regulatory..T… [90m<tibble [48 × 9]>[39m 9.36
## [90m10[39m cibersort.T.cells.gamma.delta [90m<tibble [48 × 9]>[39m 15.7
## [90m# ℹ 12 more rows[39m
## [90m# ℹ abbreviated name: ¹.low_cellularity_expected[39m
## [90m# ℹ 3 more variables: .high_cellularity_expected <dbl>, pvalue <dbl>,[39m
## [90m# plot <list>[39m
Plot Kaplan-Meier curves
counts_stratified$plot[[1]]
samples
We may want to cluster our data (e.g., using k-means sample-wise). cluster_elements
takes as arguments a tibble, column names (as symbols; for sample
, transcript
and count
) and returns a tibble with additional columns for the cluster annotation. At the moment only k-means clustering is supported, the plan is to introduce more clustering methods.
k-means
TidyTranscriptomics
counts_SE.norm.cluster = counts_SE.norm.MDS |>
cluster_elements(method="kmeans", centers = 2, action="get" )
Standard procedure (comparative purpose)
We can add cluster annotation to the MDS dimension reduced data set and plot.
counts_SE.norm.cluster |>
ggplot(aes(x=`Dim1`, y=`Dim2`, color=`cluster_kmeans`)) +
geom_point() +
my_theme
SNN
Matrix package (v1.3-3) causes an error with Seurat::FindNeighbors used in this method. We are trying to solve this issue. At the moment this option in unaviable.
TidyTranscriptomics
counts_SE.norm.SNN =
counts_SE.norm.tSNE |>
cluster_elements(method = "SNN")
Standard procedure (comparative purpose)
library(Seurat)
snn = CreateSeuratObject(count_m)
snn = ScaleData(
snn, display.progress = TRUE,
num.cores=4, do.par = TRUE
)
snn = FindVariableFeatures(snn, selection.method = "vst")
snn = FindVariableFeatures(snn, selection.method = "vst")
snn = RunPCA(snn, npcs = 30)
snn = FindNeighbors(snn)
snn = FindClusters(snn, method = "igraph", ...)
snn = snn[["seurat_clusters"]]
snn$cell_type = tibble_counts[
match(tibble_counts$sample, rownames(snn)),
c("Cell.type", "Dim1", "Dim2")
]
counts_SE.norm.SNN |>
pivot_sample() |>
select(contains("tSNE"), everything())
counts_SE.norm.SNN |>
pivot_sample() |>
gather(source, Call, c("cluster_SNN", "Call")) |>
distinct() |>
ggplot(aes(x = `tSNE1`, y = `tSNE2`, color=Call)) + geom_point() + facet_grid(~source) + my_theme
# Do differential transcription between clusters
counts_SE.norm.SNN |>
mutate(factor_of_interest = `cluster_SNN` == 3) |>
test_differential_abundance(
~ factor_of_interest,
action="get"
)
redundant
transcripts
We may want to remove redundant elements from the original data set (e.g., samples or transcripts), for example if we want to define cell-type specific signatures with low sample redundancy. remove_redundancy
takes as arguments a tibble, column names (as symbols; for sample
, transcript
and count
) and returns a tibble with redundant elements removed (e.g., samples). Two redundancy estimation approaches are supported:
Approach 1
TidyTranscriptomics
counts_SE.norm.non_redundant =
counts_SE.norm.MDS |>
remove_redundancy( method = "correlation" )
Standard procedure (comparative purpose)
library(widyr)
.data.correlated =
pairwise_cor(
counts,
sample,
transcript,
rc,
sort = TRUE,
diag = FALSE,
upper = FALSE
) |>
filter(correlation > correlation_threshold) |>
distinct(item1) |>
rename(!!.element := item1)
# Return non redudant data frame
counts |> anti_join(.data.correlated) |>
spread(sample, rc, - transcript) |>
left_join(annotation)
We can visualise how the reduced redundancy with the reduced dimentions look like
counts_SE.norm.non_redundant |>
pivot_sample() |>
ggplot(aes(x=`Dim1`, y=`Dim2`, color=`Cell.type`)) +
geom_point() +
my_theme
Approach 2
counts_SE.norm.non_redundant =
counts_SE.norm.MDS |>
remove_redundancy(
method = "reduced_dimensions",
Dim_a_column = `Dim1`,
Dim_b_column = `Dim2`
)
We can visualise MDS reduced dimensions of the samples with the closest pair removed.
counts_SE.norm.non_redundant |>
pivot_sample() |>
ggplot(aes(x=`Dim1`, y=`Dim2`, color=`Cell.type`)) +
geom_point() +
my_theme
The above wrapper streamline the most common processing of bulk RNA sequencing data. Other useful wrappers are listed above.
We can calculate gene counts (using FeatureCounts; Liao Y et al., 10.1093/nar/gkz114) from a list of BAM/SAM files and format them into a tidy structure (similar to counts).
counts = tidybulk_SAM_BAM(
file_names,
genome = "hg38",
isPairedEnd = TRUE,
requireBothEndsMapped = TRUE,
checkFragLength = FALSE,
useMetaFeatures = TRUE
)
We can add gene symbols from ensembl identifiers. This is useful since different resources use ensembl IDs while others use gene symbol IDs. This currently works for human and mouse.
counts_ensembl |> ensembl_to_symbol(ens)
## [90m# A tibble: 119 × 8[39m
## ens iso `read count` sample cases_0_project_dise…¹ cases_0_samples_0_sa…²
## [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m
## [90m 1[39m ENSG… 13 144 TARGE… Acute Myeloid Leukemia Primary Blood Derived…
## [90m 2[39m ENSG… 13 72 TARGE… Acute Myeloid Leukemia Primary Blood Derived…
## [90m 3[39m ENSG… 13 0 TARGE… Acute Myeloid Leukemia Primary Blood Derived…
## [90m 4[39m ENSG… 13 [4m1[24m099 TARGE… Acute Myeloid Leukemia Primary Blood Derived…
## [90m 5[39m ENSG… 13 11 TARGE… Acute Myeloid Leukemia Primary Blood Derived…
## [90m 6[39m ENSG… 13 2 TARGE… Acute Myeloid Leukemia Primary Blood Derived…
## [90m 7[39m ENSG… 13 3 TARGE… Acute Myeloid Leukemia Primary Blood Derived…
## [90m 8[39m ENSG… 13 [4m2[24m678 TARGE… Acute Myeloid Leukemia Primary Blood Derived…
## [90m 9[39m ENSG… 13 751 TARGE… Acute Myeloid Leukemia Primary Blood Derived…
## [90m10[39m ENSG… 13 1 TARGE… Acute Myeloid Leukemia Primary Blood Derived…
## [90m# ℹ 109 more rows[39m
## [90m# ℹ abbreviated names: ¹cases_0_project_disease_type,[39m
## [90m# ²cases_0_samples_0_sample_type[39m
## [90m# ℹ 2 more variables: transcript <chr>, ref_genome <chr>[39m
We can add gene full name (and in future description) from symbol identifiers. This currently works for human and mouse.
counts_SE |>
describe_transcript() |>
select(feature, description, everything())
##
##
## Warning in is_sample_feature_deprecated_used(.data, .cols):
## tidySummarizedExperiment says: from version 1.3.1, the special columns
## including sample/feature id (colnames(se), rownames(se)) has changed to
## ".sample" and ".feature". This dataset is returned with the old-style
## vocabulary (feature and sample), however we suggest to update your workflow to
## reflect the new vocabulary (.feature, .sample)
## [90m# A SummarizedExperiment-tibble abstraction: 408,624 × 10[39m
## [90m# [90mFeatures=8513 | Samples=48 | Assays=count[0m[39m
## feature sample count Cell.type time condition batch factor_of_interest
## [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<fct>[39m[23m [3m[90m<fct>[39m[23m [3m[90m<lgl>[39m[23m [3m[90m<fct>[39m[23m [3m[90m<lgl>[39m[23m
## [90m 1[39m A1BG SRR1740034 153 b_cell 0 d TRUE 0 TRUE
## [90m 2[39m A1BG-AS1 SRR1740034 83 b_cell 0 d TRUE 0 TRUE
## [90m 3[39m AAAS SRR1740034 868 b_cell 0 d TRUE 0 TRUE
## [90m 4[39m AACS SRR1740034 222 b_cell 0 d TRUE 0 TRUE
## [90m 5[39m AAGAB SRR1740034 590 b_cell 0 d TRUE 0 TRUE
## [90m 6[39m AAMDC SRR1740034 48 b_cell 0 d TRUE 0 TRUE
## [90m 7[39m AAMP SRR1740034 [4m1[24m257 b_cell 0 d TRUE 0 TRUE
## [90m 8[39m AANAT SRR1740034 284 b_cell 0 d TRUE 0 TRUE
## [90m 9[39m AAR2 SRR1740034 379 b_cell 0 d TRUE 0 TRUE
## [90m10[39m AARS2 SRR1740034 685 b_cell 0 d TRUE 0 TRUE
## [90m# ℹ 40 more rows[39m
## [90m# ℹ 2 more variables: description <chr>, gene_name <chr>[39m