Scale the counts of transcripts/genes — scale

scale_abundance() takes as input A `tbl` (with at least three columns for sample, feature and transcript abundance) or `SummarizedExperiment` (more convenient if abstracted to tibble with library(tidySummarizedExperiment)) and Scales transcript abundance compansating for sequencing depth (e.g., with TMM algorithm, Robinson and Oshlack doi.org/10.1186/gb-2010-11-3-r25).

scale_abundance(
  .data,
  abundance = assayNames(.data)[1],
  method = "TMM",
  reference_sample = NULL,
  .subset_for_scaling = NULL,
  suffix = "_scaled",
  reference_selection_function = NULL,
  ...,
  .abundance = NULL
)

# S4 method for class 'SummarizedExperiment'
scale_abundance(
  .data,
  abundance = assayNames(.data)[1],
  method = "TMM",
  reference_sample = NULL,
  .subset_for_scaling = NULL,
  suffix = "_scaled",
  reference_selection_function = NULL,
  ...,
  .abundance = NULL
)

# S4 method for class 'RangedSummarizedExperiment'
scale_abundance(
  .data,
  abundance = assayNames(.data)[1],
  method = "TMM",
  reference_sample = NULL,
  .subset_for_scaling = NULL,
  suffix = "_scaled",
  reference_selection_function = NULL,
  ...,
  .abundance = NULL
)

Arguments

.data: A `tbl` (with at least three columns for sample, feature and transcript abundance) or `SummarizedExperiment` (more convenient if abstracted to tibble with library(tidySummarizedExperiment))
abundance: The name of the transcript/gene abundance column (character, preferred)
method: A character string. The scaling method passed to the back-end function (i.e., edgeR::calcNormFactors; "TMM","TMMwsp","RLE","upperquartile")
reference_sample: A character string. The name of the reference sample. If NULL the sample with highest total read count will be selected as reference.
.subset_for_scaling: A gene-wise quosure condition. This will be used to filter rows (features/genes) of the dataset. For example
suffix: A character string to append to the scaled abundance column name. Default is "_scaled".
reference_selection_function: DEPRECATED. please use reference_sample.
...: Further arguments.
.abundance: DEPRECATED. The name of the transcript/gene abundance column (symbolic, for backward compatibility)

Value

A tbl object with additional columns with scaled data as `<NAME OF COUNT COLUMN>_scaled`

A `SummarizedExperiment` object

Details

`r lifecycle::badge("maturing")`

Scales transcript abundance compensating for sequencing depth (e.g., with TMM algorithm, Robinson and Oshlack doi.org/10.1186/gb-2010-11-3-r25). Lowly transcribed transcripts/genes (defined with minimum_counts and minimum_proportion parameters) are filtered out from the scaling procedure. The scaling inference is then applied back to all unfiltered data.

Underlying method edgeR::calcNormFactors(.data, method = c("TMM","TMMwsp","RLE","upperquartile"))

References

Mangiola, S., Molania, R., Dong, R., Doyle, M. A., & Papenfuss, A. T. (2021). tidybulk: an R tidy framework for modular transcriptomic data analysis. Genome Biology, 22(1), 42. doi:10.1186/s13059-020-02233-7

Robinson, M. D., & Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11(3), R25. doi:10.1186/gb-2010-11-3-r25

Examples

## Load airway dataset for examples

  data('airway', package = 'airway')
  # Ensure a 'condition' column exists for examples expecting it

    SummarizedExperiment::colData(airway)$condition <- SummarizedExperiment::colData(airway)$dex




 airway |>
   identify_abundant() |>
   scale_abundance()
#> Warning: All samples appear to belong to the same group.
#> tidybulk says: the sample with largest library size SRR1039517 was chosen as reference for scaling
#> # A SummarizedExperiment-tibble abstraction: 509,416 × 28
#> # Features=63677 | Samples=8 | Assays=counts, counts_scaled
#>    .feature      .sample counts counts_scaled SampleName cell  dex   albut Run  
#>    <chr>         <chr>    <int>         <dbl> <fct>      <fct> <fct> <fct> <fct>
#>  1 ENSG00000000… SRR103…    679         940.  GSM1275862 N613… untrt untrt SRR1…
#>  2 ENSG00000000… SRR103…      0           0   GSM1275862 N613… untrt untrt SRR1…
#>  3 ENSG00000000… SRR103…    467         646.  GSM1275862 N613… untrt untrt SRR1…
#>  4 ENSG00000000… SRR103…    260         360.  GSM1275862 N613… untrt untrt SRR1…
#>  5 ENSG00000000… SRR103…     60          83.1 GSM1275862 N613… untrt untrt SRR1…
#>  6 ENSG00000000… SRR103…      0           0   GSM1275862 N613… untrt untrt SRR1…
#>  7 ENSG00000000… SRR103…   3251        4500.  GSM1275862 N613… untrt untrt SRR1…
#>  8 ENSG00000001… SRR103…   1433        1984.  GSM1275862 N613… untrt untrt SRR1…
#>  9 ENSG00000001… SRR103…    519         718.  GSM1275862 N613… untrt untrt SRR1…
#> 10 ENSG00000001… SRR103…    394         545.  GSM1275862 N613… untrt untrt SRR1…
#> # ℹ 40 more rows
#> # ℹ 19 more variables: avgLength <int>, Experiment <fct>, Sample <fct>,
#> #   BioSample <fct>, condition <fct>, TMM <dbl>, multiplier <dbl>,
#> #   gene_id <chr>, gene_name <chr>, entrezid <int>, gene_biotype <chr>,
#> #   gene_seq_start <int>, gene_seq_end <int>, seq_name <chr>, seq_strand <int>,
#> #   seq_coord_system <int>, symbol <chr>, .abundant <lgl>, GRangesList <list>