Identifies transcripts/genes that are consistently expressed above a threshold across samples. This function adds a logical column `.abundant` to indicate which features pass the filtering criteria.

identify_abundant(
  .data,
  abundance = assayNames(.data)[1],
  design = NULL,
  formula_design = NULL,
  minimum_counts = 10,
  minimum_proportion = 0.7,
  minimum_count_per_million = NULL,
  factor_of_interest = NULL,
  ...,
  .abundance = NULL
)

# S4 method for class 'SummarizedExperiment'
identify_abundant(
  .data,
  abundance = assayNames(.data)[1],
  design = NULL,
  formula_design = NULL,
  minimum_counts = 10,
  minimum_proportion = 0.7,
  minimum_count_per_million = NULL,
  factor_of_interest = NULL,
  ...,
  .abundance = NULL
)

# S4 method for class 'RangedSummarizedExperiment'
identify_abundant(
  .data,
  abundance = assayNames(.data)[1],
  design = NULL,
  formula_design = NULL,
  minimum_counts = 10,
  minimum_proportion = 0.7,
  minimum_count_per_million = NULL,
  factor_of_interest = NULL,
  ...,
  .abundance = NULL
)

Arguments

.data

A `tbl` or `SummarizedExperiment` object containing transcript/gene abundance data

abundance

The name of the transcript/gene abundance column (character, preferred)

design

A design matrix for more complex experimental designs. If provided, this is passed to filterByExpr instead of factor_of_interest.

formula_design

...

minimum_counts

...

minimum_proportion

...

minimum_count_per_million

Minimum CPM cutoff to use for filtering (passed to CPM.Cutoff in filterByExpr). If provided, this will override the minimum_counts parameter. Default is NULL (uses edgeR default).

factor_of_interest

The name of the column containing groups/conditions for filtering. Used by edgeR's filterByExpr to define sample groups. DEPRECATED: Use 'design' or 'formula_design' instead. This argument will be removed in a future release.

...

Further arguments.

.abundance

DEPRECATED. The name of the transcript/gene abundance column (symbolic, for backward compatibility)

Value

Returns the input object with an additional logical column `.abundant` indicating which features passed the abundance threshold criteria.

A `SummarizedExperiment` object

A `SummarizedExperiment` object

Details

`r lifecycle::badge("maturing")`

This function uses edgeR's filterByExpr() function to identify consistently expressed features. A feature is considered abundant if it has CPM > minimum_counts in at least minimum_proportion of samples in at least one experimental group (defined by factor_of_interest or design).

References

Mangiola, S., Molania, R., Dong, R., Doyle, M. A., & Papenfuss, A. T. (2021). tidybulk: an R tidy framework for modular transcriptomic data analysis. Genome Biology, 22(1), 42. doi:10.1186/s13059-020-02233-7

McCarthy, D. J., Chen, Y., & Smyth, G. K. (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research, 40(10), 4288-4297. DOI: 10.1093/bioinformatics/btp616

Examples

## Load airway dataset for examples

  data('airway', package = 'airway')
  # Ensure a 'condition' column exists for examples expecting it

    SummarizedExperiment::colData(airway)$condition <- SummarizedExperiment::colData(airway)$dex


# Basic usage
airway |> identify_abundant()
#> Warning: All samples appear to belong to the same group.
#> # A SummarizedExperiment-tibble abstraction: 509,416 × 25
#> # Features=63677 | Samples=8 | Assays=counts
#>    .feature        .sample   counts SampleName cell  dex   albut Run   avgLength
#>    <chr>           <chr>      <int> <fct>      <fct> <fct> <fct> <fct>     <int>
#>  1 ENSG00000000003 SRR10395…    679 GSM1275862 N613… untrt untrt SRR1…       126
#>  2 ENSG00000000005 SRR10395…      0 GSM1275862 N613… untrt untrt SRR1…       126
#>  3 ENSG00000000419 SRR10395…    467 GSM1275862 N613… untrt untrt SRR1…       126
#>  4 ENSG00000000457 SRR10395…    260 GSM1275862 N613… untrt untrt SRR1…       126
#>  5 ENSG00000000460 SRR10395…     60 GSM1275862 N613… untrt untrt SRR1…       126
#>  6 ENSG00000000938 SRR10395…      0 GSM1275862 N613… untrt untrt SRR1…       126
#>  7 ENSG00000000971 SRR10395…   3251 GSM1275862 N613… untrt untrt SRR1…       126
#>  8 ENSG00000001036 SRR10395…   1433 GSM1275862 N613… untrt untrt SRR1…       126
#>  9 ENSG00000001084 SRR10395…    519 GSM1275862 N613… untrt untrt SRR1…       126
#> 10 ENSG00000001167 SRR10395…    394 GSM1275862 N613… untrt untrt SRR1…       126
#> # ℹ 40 more rows
#> # ℹ 16 more variables: Experiment <fct>, Sample <fct>, BioSample <fct>,
#> #   condition <fct>, gene_id <chr>, gene_name <chr>, entrezid <int>,
#> #   gene_biotype <chr>, gene_seq_start <int>, gene_seq_end <int>,
#> #   seq_name <chr>, seq_strand <int>, seq_coord_system <int>, symbol <chr>,
#> #   .abundant <lgl>, GRangesList <list>

# With custom thresholds
airway |> identify_abundant(
  minimum_counts = 5,
  minimum_proportion = 0.5
)
#> Warning: All samples appear to belong to the same group.
#> # A SummarizedExperiment-tibble abstraction: 509,416 × 25
#> # Features=63677 | Samples=8 | Assays=counts
#>    .feature        .sample   counts SampleName cell  dex   albut Run   avgLength
#>    <chr>           <chr>      <int> <fct>      <fct> <fct> <fct> <fct>     <int>
#>  1 ENSG00000000003 SRR10395…    679 GSM1275862 N613… untrt untrt SRR1…       126
#>  2 ENSG00000000005 SRR10395…      0 GSM1275862 N613… untrt untrt SRR1…       126
#>  3 ENSG00000000419 SRR10395…    467 GSM1275862 N613… untrt untrt SRR1…       126
#>  4 ENSG00000000457 SRR10395…    260 GSM1275862 N613… untrt untrt SRR1…       126
#>  5 ENSG00000000460 SRR10395…     60 GSM1275862 N613… untrt untrt SRR1…       126
#>  6 ENSG00000000938 SRR10395…      0 GSM1275862 N613… untrt untrt SRR1…       126
#>  7 ENSG00000000971 SRR10395…   3251 GSM1275862 N613… untrt untrt SRR1…       126
#>  8 ENSG00000001036 SRR10395…   1433 GSM1275862 N613… untrt untrt SRR1…       126
#>  9 ENSG00000001084 SRR10395…    519 GSM1275862 N613… untrt untrt SRR1…       126
#> 10 ENSG00000001167 SRR10395…    394 GSM1275862 N613… untrt untrt SRR1…       126
#> # ℹ 40 more rows
#> # ℹ 16 more variables: Experiment <fct>, Sample <fct>, BioSample <fct>,
#> #   condition <fct>, gene_id <chr>, gene_name <chr>, entrezid <int>,
#> #   gene_biotype <chr>, gene_seq_start <int>, gene_seq_end <int>,
#> #   seq_name <chr>, seq_strand <int>, seq_coord_system <int>, symbol <chr>,
#> #   .abundant <lgl>, GRangesList <list>

# Using a factor of interest
airway |> identify_abundant(factor_of_interest = condition)
#> Warning: The `factor_of_interest` argument of `identify_abundant()` is deprecated as of
#> tidybulk 2.0.0.
#>  Please use the `formula_design` argument instead.
#>  The argument 'factor_of_interest' is deprecated and will be removed in a
#>   future release. Please use the 'design' or 'formula_design' argument instead.
#> # A SummarizedExperiment-tibble abstraction: 509,416 × 25
#> # Features=63677 | Samples=8 | Assays=counts
#>    .feature        .sample   counts SampleName cell  dex   albut Run   avgLength
#>    <chr>           <chr>      <int> <fct>      <fct> <fct> <fct> <fct>     <int>
#>  1 ENSG00000000003 SRR10395…    679 GSM1275862 N613… untrt untrt SRR1…       126
#>  2 ENSG00000000005 SRR10395…      0 GSM1275862 N613… untrt untrt SRR1…       126
#>  3 ENSG00000000419 SRR10395…    467 GSM1275862 N613… untrt untrt SRR1…       126
#>  4 ENSG00000000457 SRR10395…    260 GSM1275862 N613… untrt untrt SRR1…       126
#>  5 ENSG00000000460 SRR10395…     60 GSM1275862 N613… untrt untrt SRR1…       126
#>  6 ENSG00000000938 SRR10395…      0 GSM1275862 N613… untrt untrt SRR1…       126
#>  7 ENSG00000000971 SRR10395…   3251 GSM1275862 N613… untrt untrt SRR1…       126
#>  8 ENSG00000001036 SRR10395…   1433 GSM1275862 N613… untrt untrt SRR1…       126
#>  9 ENSG00000001084 SRR10395…    519 GSM1275862 N613… untrt untrt SRR1…       126
#> 10 ENSG00000001167 SRR10395…    394 GSM1275862 N613… untrt untrt SRR1…       126
#> # ℹ 40 more rows
#> # ℹ 16 more variables: Experiment <fct>, Sample <fct>, BioSample <fct>,
#> #   condition <fct>, gene_id <chr>, gene_name <chr>, entrezid <int>,
#> #   gene_biotype <chr>, gene_seq_start <int>, gene_seq_end <int>,
#> #   seq_name <chr>, seq_strand <int>, seq_coord_system <int>, symbol <chr>,
#> #   .abundant <lgl>, GRangesList <list>