Drop redundant elements (e.g., samples) for which feature (e.g., transcript/gene) abundances are correlated

remove_redundancy() takes as input A `tbl` (with at least three columns for sample, feature and transcript abundance) or `SummarizedExperiment` (more convenient if abstracted to tibble with library(tidySummarizedExperiment)) for correlation method or | <DIMENSION 1> | <DIMENSION 2> | <...> | for reduced_dimensions method, and returns a consistent object (to the input) with dropped elements (e.g., samples).

remove_redundancy(
  .data,
  .element = NULL,
  .feature = NULL,
  .abundance = NULL,
  method,
  of_samples = TRUE,
  correlation_threshold = 0.9,
  top = Inf,
  transform = identity,
  Dim_a_column,
  Dim_b_column,
  log_transform = NULL
)

# S4 method for class 'SummarizedExperiment'
remove_redundancy(
  .data,
  .element = NULL,
  .feature = NULL,
  .abundance = NULL,
  method,
  of_samples = TRUE,
  correlation_threshold = 0.9,
  top = Inf,
  transform = identity,
  Dim_a_column = NULL,
  Dim_b_column = NULL,
  log_transform = NULL
)

# S4 method for class 'RangedSummarizedExperiment'
remove_redundancy(
  .data,
  .element = NULL,
  .feature = NULL,
  .abundance = NULL,
  method,
  of_samples = TRUE,
  correlation_threshold = 0.9,
  top = Inf,
  transform = identity,
  Dim_a_column = NULL,
  Dim_b_column = NULL,
  log_transform = NULL
)

Arguments

.data: A `tbl` (with at least three columns for sample, feature and transcript abundance) or `SummarizedExperiment` (more convenient if abstracted to tibble with library(tidySummarizedExperiment))
.element: The name of the element column (normally samples).
.feature: The name of the feature column (normally transcripts/genes)
.abundance: The name of the column including the numerical value the clustering is based on (normally transcript abundance)
method: A character string. The method to use, correlation and reduced_dimensions are available. The latter eliminates one of the most proximar pairs of samples in PCA reduced dimensions.
of_samples: A boolean. In case the input is a tidybulk object, it indicates Whether the element column will be sample or transcript column
correlation_threshold: A real number between 0 and 1. For correlation based calculation.
top: An integer. How many top genes to select for correlation based method
transform: A function that will tranform the counts, by default it is log1p for RNA sequencing data, but for avoinding tranformation you can use identity
Dim_a_column: A character string. For reduced_dimension based calculation. The column of one principal component
Dim_b_column: A character string. For reduced_dimension based calculation. The column of another principal component
log_transform: DEPRECATED - A boolean, whether the value should be log-transformed (e.g., TRUE for RNA sequencing data)

Value

A tbl object with with dropped redundant elements (e.g., samples).

A `SummarizedExperiment` object

Details

`r lifecycle::badge("maturing")`

This function removes redundant elements from the original data set (e.g., samples or transcripts). For example, if we want to define cell-type specific signatures with low sample redundancy. This function returns a tibble with dropped redundant elements (e.g., samples). Two redundancy estimation approaches are supported: (i) removal of highly correlated clusters of elements (keeping a representative) with method="correlation"; (ii) removal of most proximal element pairs in a reduced dimensional space.

Underlying method for correlation: widyr::pairwise_cor(sample, transcript,count, sort = TRUE, diag = FALSE, upper = FALSE)

References

Mangiola, S., Molania, R., Dong, R., Doyle, M. A., & Papenfuss, A. T. (2021). tidybulk: an R tidy framework for modular transcriptomic data analysis. Genome Biology, 22(1), 42. doi:10.1186/s13059-020-02233-7

Examples

## Load airway dataset for examples

  data('airway', package = 'airway')
  # Ensure a 'condition' column exists for examples expecting it

    SummarizedExperiment::colData(airway)$condition <- SummarizedExperiment::colData(airway)$dex




 airway |>
 identify_abundant() |>
   remove_redundancy(
     .element = sample,
     .feature = transcript,
       .abundance =  count,
       method = "correlation"
       )
#> Warning: All samples appear to belong to the same group.
#> Getting the 14224 most variable genes
#> # A SummarizedExperiment-tibble abstraction: 63,677 × 25
#> # Features=63677 | Samples=1 | Assays=counts
#>    .feature        .sample   counts SampleName cell  dex   albut Run   avgLength
#>    <chr>           <chr>      <int> <fct>      <fct> <fct> <fct> <fct>     <int>
#>  1 ENSG00000000003 SRR10395…    572 GSM1275875 N061… trt   untrt SRR1…        98
#>  2 ENSG00000000005 SRR10395…      0 GSM1275875 N061… trt   untrt SRR1…        98
#>  3 ENSG00000000419 SRR10395…    508 GSM1275875 N061… trt   untrt SRR1…        98
#>  4 ENSG00000000457 SRR10395…    229 GSM1275875 N061… trt   untrt SRR1…        98
#>  5 ENSG00000000460 SRR10395…     60 GSM1275875 N061… trt   untrt SRR1…        98
#>  6 ENSG00000000938 SRR10395…      0 GSM1275875 N061… trt   untrt SRR1…        98
#>  7 ENSG00000000971 SRR10395…   7995 GSM1275875 N061… trt   untrt SRR1…        98
#>  8 ENSG00000001036 SRR10395…   1109 GSM1275875 N061… trt   untrt SRR1…        98
#>  9 ENSG00000001084 SRR10395…    704 GSM1275875 N061… trt   untrt SRR1…        98
#> 10 ENSG00000001167 SRR10395…    269 GSM1275875 N061… trt   untrt SRR1…        98
#> # ℹ 40 more rows
#> # ℹ 16 more variables: Experiment <fct>, Sample <fct>, BioSample <fct>,
#> #   condition <fct>, gene_id <chr>, gene_name <chr>, entrezid <int>,
#> #   gene_biotype <chr>, gene_seq_start <int>, gene_seq_end <int>,
#> #   seq_name <chr>, seq_strand <int>, seq_coord_system <int>, symbol <chr>,
#> #   .abundant <lgl>, GRangesList <list>