test_gene_enrichment() takes as input a `tbl` (with at least three columns for sample, feature and transcript abundance) or `SummarizedExperiment` (more convenient if abstracted to tibble with library(tidySummarizedExperiment)) and returns a `tbl` of gene set information

test_gene_enrichment(
  .data,
  .formula,
  .entrez,
  .abundance = NULL,
  contrasts = NULL,
  methods = c("camera", "roast", "safe", "gage", "padog", "globaltest", "ora"),
  gene_sets = c("h", "c1", "c2", "c3", "c4", "c5", "c6", "c7", "kegg_disease",
    "kegg_metabolism", "kegg_signaling"),
  species,
  cores = parallel::detectCores(),
  method = NULL,
  .contrasts = NULL
)

# S4 method for class 'SummarizedExperiment'
test_gene_enrichment(
  .data,
  .formula,
  .entrez,
  .abundance = NULL,
  contrasts = NULL,
  methods = c("camera", "roast", "safe", "gage", "padog", "globaltest", "ora"),
  gene_sets = c("h", "c1", "c2", "c3", "c4", "c5", "c6", "c7", "kegg_disease",
    "kegg_metabolism", "kegg_signaling"),
  species,
  cores = parallel::detectCores(),
  method = NULL,
  .contrasts = NULL
)

# S4 method for class 'RangedSummarizedExperiment'
test_gene_enrichment(
  .data,
  .formula,
  .entrez,
  .abundance = NULL,
  contrasts = NULL,
  methods = c("camera", "roast", "safe", "gage", "padog", "globaltest", "ora"),
  gene_sets = c("h", "c1", "c2", "c3", "c4", "c5", "c6", "c7", "kegg_disease",
    "kegg_metabolism", "kegg_signaling"),
  species,
  cores = parallel::detectCores(),
  method = NULL,
  .contrasts = NULL
)

Arguments

.data

A `tbl` (with at least three columns for sample, feature and transcript abundance) or `SummarizedExperiment` (more convenient if abstracted to tibble with library(tidySummarizedExperiment))

.formula

A formula with no response variable, representing the desired linear model

.entrez

The ENTREZ ID of the transcripts/genes

.abundance

The name of the transcript/gene abundance column

contrasts

This parameter takes the format of the contrast parameter of the method of choice. For edgeR and limma-voom is a character vector. For DESeq2 is a list including a character vector of length three. The first covariate is the one the model is tested against (e.g., ~ factor_of_interest)

methods

A character vector. One or 3 or more methods to use in the testing (currently EGSEA errors if 2 are used). Type EGSEA::egsea.base() to see the supported GSE methods.

gene_sets

A character vector or a list. It can take one or more of the following built-in collections as a character vector: c("h", "c1", "c2", "c3", "c4", "c5", "c6", "c7", "kegg_disease", "kegg_metabolism", "kegg_signaling"), to be used with EGSEA buildIdx. c1 is human specific. Alternatively, a list of user-supplied gene sets can be provided, to be used with EGSEA buildCustomIdx. In that case, each gene set is a character vector of Entrez IDs and the names of the list are the gene set names.

species

A character. It can be human, mouse or rat.

cores

An integer. The number of cores available

method

DEPRECATED. Please use methods.

.contrasts

DEPRECATED - This parameter takes the format of the contrast parameter of the method of choice. For edgeR and limma-voom is a character vector. For DESeq2 is a list including a character vector of length three. The first covariate is the one the model is tested against (e.g., ~ factor_of_interest)

Value

A consistent object (to the input)

A consistent object (to the input)

A consistent object (to the input)

Details

`r lifecycle::badge("maturing")`

This wrapper executes ensemble gene enrichment analyses of the dataset using EGSEA (DOI:0.12688/f1000research.12544.1)

dge = data |> keep_abundant( factor_of_interest = !!as.symbol(parse_formula(.formula)[[1]]), !!.sample, !!.entrez, !!.abundance ) |>

# Make sure transcript names are adjacent [...] |> as_matrix(rownames = !!.entrez) |> edgeR::DGEList(counts = .)

idx = buildIdx(entrezIDs = rownames(dge), species = species, msigdb.gsets = msigdb.gsets, kegg.exclude = kegg.exclude)

dge |>

# Calculate weights limma::voom(design, plot = FALSE) |>

# Execute EGSEA egsea( contrasts = my_contrasts, baseGSEAs = methods, gs.annots = idx, sort.by = "med.rank", num.threads = cores, report = FALSE )

References

Mangiola, S., Molania, R., Dong, R., Doyle, M. A., & Papenfuss, A. T. (2021). tidybulk: an R tidy framework for modular transcriptomic data analysis. Genome Biology, 22(1), 42. doi:10.1186/s13059-020-02233-7

Alhamdoosh, M., Ng, M., Wilson, N. J., Sheridan, J. M., Huynh, H., Wilson, M. J., & Ritchie, M. E. (2017). Combining multiple tools outperforms individual methods for gene set enrichment analysis in single-cell RNA-seq data. Genome Biology, 18(1), 174. doi:10.1186/s13059-017-1279-y

Examples

## Load airway dataset for examples

  data('airway', package = 'airway')
  # Ensure a 'condition' column exists for examples expecting it

    SummarizedExperiment::colData(airway)$condition <- SummarizedExperiment::colData(airway)$dex


library(tidySummarizedExperiment)

library("EGSEA")
#> Loading required package: gage
#> Loading required package: topGO
#> Loading required package: graph
#> Loading required package: GO.db
#> Loading required package: SparseM
#> 
#> groupGOTerms: 	GOBPTerm, GOMFTerm, GOCCTerm environments built.
#> 
#> Attaching package: ‘topGO’
#> The following object is masked from ‘package:gage’:
#> 
#>     geneData
#> The following object is masked from ‘package:IRanges’:
#> 
#>     members
#> Loading required package: pathview

if (FALSE) { # \dontrun{
df_entrez <- airway |>
  mutate(entrez = .feature) |>
  aggregate_duplicates(.transcript = entrez )

  test_gene_enrichment(
    df_entrez,
    ~ condition,
    .entrez = entrez,
    .abundance = count,
    methods = c("roast" , "safe", "gage"  ,  "padog" , "globaltest", "ora" ),
    gene_sets = c("h", "c1", "c2", "c3", "c4", "c5", "c6", "c7", "kegg_disease", "kegg_metabolism", "kegg_signaling"),
    species="human",
    cores = 2
  )
} # }