impute transcript abundance if missing from sample-transcript pairs

impute_missing_abundance() takes as input A `tbl` (with at least three columns for sample, feature and transcript abundance) or `SummarizedExperiment` (more convenient if abstracted to tibble with library(tidySummarizedExperiment)) and returns a consistent object (to the input) with additional sample-transcript pairs with imputed transcript abundance.

impute_missing_abundance(
  .data,
  .formula,
  suffix = "",
  force_scaling = FALSE,
  ...,
  abundance = assayNames(.data)[1],
  .abundance = NULL
)

# S4 method for class 'SummarizedExperiment'
impute_missing_abundance(
  .data,
  .formula,
  suffix = "",
  force_scaling = FALSE,
  ...,
  abundance = assayNames(.data)[1],
  .abundance = NULL
)

# S4 method for class 'RangedSummarizedExperiment'
impute_missing_abundance(
  .data,
  .formula,
  suffix = "",
  force_scaling = FALSE,
  ...,
  abundance = assayNames(.data)[1],
  .abundance = NULL
)

Arguments

.data: A `tbl` (with at least three columns for sample, feature and transcript abundance) or `SummarizedExperiment` (more convenient if abstracted to tibble with library(tidySummarizedExperiment))
.formula: A formula with no response variable, representing the desired linear model where the first covariate is the factor of interest and the second covariate is the unwanted variation (of the kind ~ factor_of_interest + batch)
suffix: A character string. This is added to the imputed count column names. If empty the count column are overwritten
force_scaling: A boolean. In case a abundance-containing column is not scaled (columns with _scale suffix), setting force_scaling = TRUE will result in a scaling by library size, to compensating for a possible difference in sequencing depth.
...: Further arguments.
abundance: The name of the transcript/gene abundance column (character, preferred)
.abundance: DEPRECATED. The name of the transcript/gene abundance column (symbolic, for backward compatibility)

Value

A consistent object (to the input) non-sparse abundance

A `SummarizedExperiment` object

Details

`r lifecycle::badge("maturing")`

This function imputes the abundance of missing sample-transcript pair using the median of the sample group defined by the formula

References

Mangiola, S., Molania, R., Dong, R., Doyle, M. A., & Papenfuss, A. T. (2021). tidybulk: an R tidy framework for modular transcriptomic data analysis. Genome Biology, 22(1), 42. doi:10.1186/s13059-020-02233-7

Examples

## Load airway dataset for examples

  data('airway', package = 'airway')
  # Ensure a 'condition' column exists for examples expecting it

    SummarizedExperiment::colData(airway)$condition <- SummarizedExperiment::colData(airway)$dex


library(airway)
data(airway)
airway <- airway[1:100, 1:5]

airway |>
  impute_missing_abundance(.formula = ~ dex)
#> tidybulk says: counts appears not to be scaled for sequencing depth (missing _scaled suffix; if you think this column is idependent of sequencing depth ignore this message), therefore the imputation can produce non meaningful results if sequencing depth for samples are highly variable. If you use force_scaling = TRUE library size will be used for eliminatig some sequencig depth effect before imputation
#> # A SummarizedExperiment-tibble abstraction: 500 × 25
#> # Features=100 | Samples=5 | Assays=counts, .imputed
#>    .feature .sample counts .imputed SampleName cell  dex   albut Run   avgLength
#>    <chr>    <chr>    <dbl>    <int> <fct>      <fct> <fct> <fct> <fct>     <int>
#>  1 ENSG000… SRR103…    679        0 GSM1275862 N613… untrt untrt SRR1…       126
#>  2 ENSG000… SRR103…      0        0 GSM1275862 N613… untrt untrt SRR1…       126
#>  3 ENSG000… SRR103…    467        0 GSM1275862 N613… untrt untrt SRR1…       126
#>  4 ENSG000… SRR103…    260        0 GSM1275862 N613… untrt untrt SRR1…       126
#>  5 ENSG000… SRR103…     60        0 GSM1275862 N613… untrt untrt SRR1…       126
#>  6 ENSG000… SRR103…      0        0 GSM1275862 N613… untrt untrt SRR1…       126
#>  7 ENSG000… SRR103…   3251        0 GSM1275862 N613… untrt untrt SRR1…       126
#>  8 ENSG000… SRR103…   1433        0 GSM1275862 N613… untrt untrt SRR1…       126
#>  9 ENSG000… SRR103…    519        0 GSM1275862 N613… untrt untrt SRR1…       126
#> 10 ENSG000… SRR103…    394        0 GSM1275862 N613… untrt untrt SRR1…       126
#> # ℹ 40 more rows
#> # ℹ 15 more variables: Experiment <fct>, Sample <fct>, BioSample <fct>,
#> #   condition <fct>, gene_id <chr>, gene_name <chr>, entrezid <int>,
#> #   gene_biotype <chr>, gene_seq_start <int>, gene_seq_end <int>,
#> #   seq_name <chr>, seq_strand <int>, seq_coord_system <int>, symbol <chr>,
#> #   GRangesList <list>