R/aggregate_duplicates.R
aggregate_duplicates-methods.Rd
aggregate_duplicates() takes as input A `tbl` (with at least three columns for sample, feature and transcript abundance) or `SummarizedExperiment` (more convenient if abstracted to tibble with library(tidySummarizedExperiment)) and returns a consistent object (to the input) with aggregated transcripts that were duplicated.
aggregate_duplicates(
.data,
.transcript = NULL,
feature = NULL,
.abundance = NULL,
aggregation_function = sum,
keep_integer = TRUE,
...
)
# S4 method for class 'SummarizedExperiment'
aggregate_duplicates(
.data,
.transcript = NULL,
feature = NULL,
.abundance = NULL,
aggregation_function = sum,
keep_integer = TRUE,
...
)
# S4 method for class 'RangedSummarizedExperiment'
aggregate_duplicates(
.data,
.transcript = NULL,
feature = NULL,
.abundance = NULL,
aggregation_function = sum,
keep_integer = TRUE,
...
)
A `tbl` (with at least three columns for sample, feature and transcript abundance) or `SummarizedExperiment` (more convenient if abstracted to tibble with library(tidySummarizedExperiment))
DEPRECATED The name of the transcript/gene column (deprecated, use `feature` instead)
The name of the feature column as a character string
The name of the transcript/gene abundance column
A function for counts aggregation (e.g., sum, median, or mean)
A boolean. Whether to force the aggregated counts to integer
Additional arguments passed to the aggregation function
A consistent object (to the input) with aggregated transcript abundance and annotation
A `SummarizedExperiment` object
A `SummarizedExperiment` object
`r lifecycle::badge("maturing")`
This function aggregates duplicated transcripts (e.g., isoforms, ensembl). For example, we often have to convert ensembl symbols to gene/transcript symbol, but in doing so we have to deal with duplicates. `aggregate_duplicates` takes a tibble and column names (as symbols; for `sample`, `transcript` and `count`) as arguments and returns a tibble with aggregate transcript with the same name. All the rest of the column are appended, and factors and boolean are appended as characters.
Underlying custom method: data |> filter(n_aggr > 1) |> group_by(!!.sample,!!.transcript) |> dplyr::mutate(!!.abundance := !!.abundance |> aggregation_function())
Mangiola, S., Molania, R., Dong, R., Doyle, M. A., & Papenfuss, A. T. (2021). tidybulk: an R tidy framework for modular transcriptomic data analysis. Genome Biology, 22(1), 42. doi:10.1186/s13059-020-02233-7
Lawrence, M., Huber, W., Pagès, H., Aboyoun, P., Carlson, M., Gentleman, R., Morgan, M. T., & Carey, V. J. (2013). Software for computing and annotating genomic ranges. PLoS Computational Biology, 9(8), e1003118. doi:10.1371/journal.pcbi.1003118
Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). dplyr: A Grammar of Data Manipulation. R package version 1.1.0. https://CRAN.R-project.org/package=dplyr
## Load airway dataset for examples
data('airway', package = 'airway')
# Ensure a 'condition' column exists for examples expecting it
SummarizedExperiment::colData(airway)$condition <- SummarizedExperiment::colData(airway)$dex
# Create a aggregation column
airway = airway
SummarizedExperiment::rowData(airway )$gene_name = rownames(airway )
aggregate_duplicates(
airway,
feature = "gene_name"
)
#> tidybulk says: your object does not have duplicates along the gene_name column. The input dataset is returned.
#> class: RangedSummarizedExperiment
#> dim: 63677 8
#> metadata(1): ''
#> assays(1): counts
#> rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
#> ENSG00000273493
#> rowData names(10): gene_id gene_name ... seq_coord_system symbol
#> colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
#> colData names(10): SampleName cell ... BioSample condition