Filter to keep only abundant transcripts/genes — keep

Filters the data to keep only transcripts/genes that are consistently expressed above a threshold across samples. This is a filtering version of identify_abundant() that removes low-abundance features instead of just marking them.

keep_abundant(
  .data,
  .sample = NULL,
  .transcript = NULL,
  .abundance = NULL,
  factor_of_interest = NULL,
  design = NULL,
  minimum_counts = 10,
  minimum_proportion = 0.7
)

# S4 method for class 'spec_tbl_df'
keep_abundant(
  .data,
  .sample = NULL,
  .transcript = NULL,
  .abundance = NULL,
  factor_of_interest = NULL,
  design = NULL,
  minimum_counts = 10,
  minimum_proportion = 0.7
)

# S4 method for class 'tbl_df'
keep_abundant(
  .data,
  .sample = NULL,
  .transcript = NULL,
  .abundance = NULL,
  factor_of_interest = NULL,
  design = NULL,
  minimum_counts = 10,
  minimum_proportion = 0.7
)

# S4 method for class 'tidybulk'
keep_abundant(
  .data,
  .sample = NULL,
  .transcript = NULL,
  .abundance = NULL,
  factor_of_interest = NULL,
  design = NULL,
  minimum_counts = 10,
  minimum_proportion = 0.7
)

# S4 method for class 'SummarizedExperiment'
keep_abundant(
  .data,
  .sample = NULL,
  .transcript = NULL,
  .abundance = NULL,
  factor_of_interest = NULL,
  design = NULL,
  minimum_counts = 10,
  minimum_proportion = 0.7
)

# S4 method for class 'RangedSummarizedExperiment'
keep_abundant(
  .data,
  .sample = NULL,
  .transcript = NULL,
  .abundance = NULL,
  factor_of_interest = NULL,
  design = NULL,
  minimum_counts = 10,
  minimum_proportion = 0.7
)

Arguments

.data: A `tbl` or `SummarizedExperiment` object containing transcript/gene abundance data
.sample: The name of the sample column
.transcript: The name of the transcript/gene column
.abundance: The name of the transcript/gene abundance column
factor_of_interest: The name of the column containing groups/conditions for filtering. Used by edgeR's filterByExpr to define sample groups.
design: A design matrix for more complex experimental designs. If provided, this is passed to filterByExpr instead of factor_of_interest.
minimum_counts: A positive number specifying the minimum counts per million (CPM) threshold for a transcript to be kept (default = 10)
minimum_proportion: A number between 0 and 1 specifying the minimum proportion of samples that must exceed the minimum_counts threshold (default = 0.7)

Value

Returns a filtered version of the input object containing only the features that passed the abundance threshold criteria.

A consistent object (to the input) with additional columns for the statistics from the hypothesis test (e.g., log fold change, p-value and false discovery rate).

A `SummarizedExperiment` object

Details

This function uses edgeR's filterByExpr() function to identify and keep consistently expressed features. A feature is kept if it has CPM > minimum_counts in at least minimum_proportion of samples in at least one experimental group (defined by factor_of_interest or design).

This function is similar to identify_abundant() but instead of adding an .abundant column, it filters out the low-abundance features directly.

References

McCarthy, D. J., Chen, Y., & Smyth, G. K. (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research, 40(10), 4288-4297. DOI: 10.1093/bioinformatics/btp616

Examples

# Basic usage
se_mini |> keep_abundant()
#> No group or design set. Assuming all samples belong to one group.
#> class: SummarizedExperiment 
#> dim: 182 5 
#> metadata(0):
#> assays(1): count
#> rownames(182): ACAP1 ACP5 ... ZNF286A ZNF324
#> rowData names(2): entrez .abundant
#> colnames(5): SRR1740034 SRR1740035 SRR1740043 SRR1740058 SRR1740067
#> colData names(5): Cell.type time condition days dead

# With custom thresholds
se_mini |> keep_abundant(
  minimum_counts = 5,
  minimum_proportion = 0.5
)
#> No group or design set. Assuming all samples belong to one group.
#> class: SummarizedExperiment 
#> dim: 228 5 
#> metadata(0):
#> assays(1): count
#> rownames(228): ABCB4 ABCB9 ... ZNF324 ZNF442
#> rowData names(2): entrez .abundant
#> colnames(5): SRR1740034 SRR1740035 SRR1740043 SRR1740058 SRR1740067
#> colData names(5): Cell.type time condition days dead

# Using a factor of interest
se_mini |> keep_abundant(factor_of_interest = condition)
#> class: SummarizedExperiment 
#> dim: 394 5 
#> metadata(0):
#> assays(1): count
#> rownames(394): ABCB4 ABCB9 ... ZNF324 ZNF442
#> rowData names(2): entrez .abundant
#> colnames(5): SRR1740034 SRR1740035 SRR1740043 SRR1740058 SRR1740067
#> colData names(5): Cell.type time condition days dead