Filters the data to keep only transcripts/genes that are consistently expressed above a threshold across samples. This is a filtering version of identify_abundant() that removes low-abundance features instead of just marking them.
This function is similar to identify_abundant() but instead of adding an .abundant column, it filters out the low-abundance features directly.
keep_abundant(
.data,
abundance = assayNames(.data)[1],
design = NULL,
formula_design = NULL,
minimum_counts = 10,
minimum_proportion = 0.7,
minimum_count_per_million = NULL,
factor_of_interest = NULL,
...,
.abundance = NULL
)
A `tbl` or `SummarizedExperiment` object containing transcript/gene abundance data
The name of the transcript/gene abundance column (character, preferred)
A design matrix for more complex experimental designs. If provided, this is passed to filterByExpr instead of factor_of_interest.
A formula for creating the design matrix
The minimum count threshold for a feature to be considered abundant
The minimum proportion of samples in which a feature must be abundant
The minimum count per million threshold
The name of the column containing groups/conditions for filtering. DEPRECATED: Use 'design' or 'formula_design' instead.
Further arguments.
DEPRECATED. The name of the transcript/gene abundance column (symbolic, for backward compatibility)
Returns a filtered version of the input object containing only the features that passed the abundance threshold criteria.
Returns a filtered version of the input object containing only the features that passed the abundance threshold criteria.
Filter to keep only abundant transcripts/genes
This function uses edgeR's filterByExpr() function to identify and keep consistently expressed features. A feature is kept if it has CPM > minimum_counts in at least minimum_proportion of samples in at least one experimental group (defined by factor_of_interest or design).
This function is similar to identify_abundant() but instead of adding an .abundant column, it filters out the low-abundance features directly.
McCarthy, D. J., Chen, Y., & Smyth, G. K. (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research, 40(10), 4288-4297. DOI: 10.1093/bioinformatics/btp616
## Load airway dataset for examples
data('airway', package = 'airway')
# Ensure a 'condition' column exists for examples expecting it
SummarizedExperiment::colData(airway)$condition <- SummarizedExperiment::colData(airway)$dex
# Basic usage
airway |> keep_abundant()
#> Warning: All samples appear to belong to the same group.
#> # A SummarizedExperiment-tibble abstraction: 113,792 × 25
#> # Features=14224 | Samples=8 | Assays=counts
#> .feature .sample counts SampleName cell dex albut Run avgLength
#> <chr> <chr> <int> <fct> <fct> <fct> <fct> <fct> <int>
#> 1 ENSG00000000003 SRR10395… 679 GSM1275862 N613… untrt untrt SRR1… 126
#> 2 ENSG00000000419 SRR10395… 467 GSM1275862 N613… untrt untrt SRR1… 126
#> 3 ENSG00000000457 SRR10395… 260 GSM1275862 N613… untrt untrt SRR1… 126
#> 4 ENSG00000000460 SRR10395… 60 GSM1275862 N613… untrt untrt SRR1… 126
#> 5 ENSG00000000971 SRR10395… 3251 GSM1275862 N613… untrt untrt SRR1… 126
#> 6 ENSG00000001036 SRR10395… 1433 GSM1275862 N613… untrt untrt SRR1… 126
#> 7 ENSG00000001084 SRR10395… 519 GSM1275862 N613… untrt untrt SRR1… 126
#> 8 ENSG00000001167 SRR10395… 394 GSM1275862 N613… untrt untrt SRR1… 126
#> 9 ENSG00000001460 SRR10395… 172 GSM1275862 N613… untrt untrt SRR1… 126
#> 10 ENSG00000001461 SRR10395… 2112 GSM1275862 N613… untrt untrt SRR1… 126
#> # ℹ 40 more rows
#> # ℹ 16 more variables: Experiment <fct>, Sample <fct>, BioSample <fct>,
#> # condition <fct>, gene_id <chr>, gene_name <chr>, entrezid <int>,
#> # gene_biotype <chr>, gene_seq_start <int>, gene_seq_end <int>,
#> # seq_name <chr>, seq_strand <int>, seq_coord_system <int>, symbol <chr>,
#> # .abundant <lgl>, GRangesList <list>
# With custom thresholds
airway |> keep_abundant(
minimum_counts = 5,
minimum_proportion = 0.5
)
#> Warning: All samples appear to belong to the same group.
#> # A SummarizedExperiment-tibble abstraction: 123,488 × 25
#> # Features=15436 | Samples=8 | Assays=counts
#> .feature .sample counts SampleName cell dex albut Run avgLength
#> <chr> <chr> <int> <fct> <fct> <fct> <fct> <fct> <int>
#> 1 ENSG00000000003 SRR10395… 679 GSM1275862 N613… untrt untrt SRR1… 126
#> 2 ENSG00000000419 SRR10395… 467 GSM1275862 N613… untrt untrt SRR1… 126
#> 3 ENSG00000000457 SRR10395… 260 GSM1275862 N613… untrt untrt SRR1… 126
#> 4 ENSG00000000460 SRR10395… 60 GSM1275862 N613… untrt untrt SRR1… 126
#> 5 ENSG00000000971 SRR10395… 3251 GSM1275862 N613… untrt untrt SRR1… 126
#> 6 ENSG00000001036 SRR10395… 1433 GSM1275862 N613… untrt untrt SRR1… 126
#> 7 ENSG00000001084 SRR10395… 519 GSM1275862 N613… untrt untrt SRR1… 126
#> 8 ENSG00000001167 SRR10395… 394 GSM1275862 N613… untrt untrt SRR1… 126
#> 9 ENSG00000001460 SRR10395… 172 GSM1275862 N613… untrt untrt SRR1… 126
#> 10 ENSG00000001461 SRR10395… 2112 GSM1275862 N613… untrt untrt SRR1… 126
#> # ℹ 40 more rows
#> # ℹ 16 more variables: Experiment <fct>, Sample <fct>, BioSample <fct>,
#> # condition <fct>, gene_id <chr>, gene_name <chr>, entrezid <int>,
#> # gene_biotype <chr>, gene_seq_start <int>, gene_seq_end <int>,
#> # seq_name <chr>, seq_strand <int>, seq_coord_system <int>, symbol <chr>,
#> # .abundant <lgl>, GRangesList <list>
# Using a factor of interest
airway |> keep_abundant(factor_of_interest = condition)
#> Warning: The `factor_of_interest` argument of `keep_abundant()` is deprecated as of
#> tidybulk 2.0.0.
#> ℹ Please use the `formula_design` argument instead.
#> ℹ The argument 'factor_of_interest' is deprecated and will be removed in a
#> future release. Please use the 'design' or 'formula_design' argument instead.
#> # A SummarizedExperiment-tibble abstraction: 127,408 × 25
#> # Features=15926 | Samples=8 | Assays=counts
#> .feature .sample counts SampleName cell dex albut Run avgLength
#> <chr> <chr> <int> <fct> <fct> <fct> <fct> <fct> <int>
#> 1 ENSG00000000003 SRR10395… 679 GSM1275862 N613… untrt untrt SRR1… 126
#> 2 ENSG00000000419 SRR10395… 467 GSM1275862 N613… untrt untrt SRR1… 126
#> 3 ENSG00000000457 SRR10395… 260 GSM1275862 N613… untrt untrt SRR1… 126
#> 4 ENSG00000000460 SRR10395… 60 GSM1275862 N613… untrt untrt SRR1… 126
#> 5 ENSG00000000971 SRR10395… 3251 GSM1275862 N613… untrt untrt SRR1… 126
#> 6 ENSG00000001036 SRR10395… 1433 GSM1275862 N613… untrt untrt SRR1… 126
#> 7 ENSG00000001084 SRR10395… 519 GSM1275862 N613… untrt untrt SRR1… 126
#> 8 ENSG00000001167 SRR10395… 394 GSM1275862 N613… untrt untrt SRR1… 126
#> 9 ENSG00000001460 SRR10395… 172 GSM1275862 N613… untrt untrt SRR1… 126
#> 10 ENSG00000001461 SRR10395… 2112 GSM1275862 N613… untrt untrt SRR1… 126
#> # ℹ 40 more rows
#> # ℹ 16 more variables: Experiment <fct>, Sample <fct>, BioSample <fct>,
#> # condition <fct>, gene_id <chr>, gene_name <chr>, entrezid <int>,
#> # gene_biotype <chr>, gene_seq_start <int>, gene_seq_end <int>,
#> # seq_name <chr>, seq_strand <int>, seq_coord_system <int>, symbol <chr>,
#> # .abundant <lgl>, GRangesList <list>