Resolve Complete Confounders of Non-Interest

This function identifies and resolves complete confounders among specified factors of non-interest within a `SummarizedExperiment` object. Complete confounders occur when the levels of one factor are entirely predictable based on the levels of another factor. Such relationships can interfere with downstream analyses by introducing redundancy or collinearity.

resolve_complete_confounders_of_non_interest(se, ...)

Arguments

se: A `SummarizedExperiment` object. This object contains assay data, row data (e.g., gene annotations), and column data (e.g., sample annotations).
...: Factors of non-interest (column names from `colData(se)`) to examine for complete confounders.

Value

A `SummarizedExperiment` object with resolved confounders. The object retains its structure, including assays and metadata, but the column data (`colData`) is updated with new "___altered" columns containing the resolved factors.

Details

The function systematically examines pairs of specified factors and determines whether they are completely confounded. If a pair of factors is found to be confounded, one of the factors is adjusted or removed to resolve the issue. The adjusted `SummarizedExperiment` object is returned, preserving all assays and metadata except the resolved factors.

Complete confounders of non-interest can create dependencies between variables that may bias statistical models or violate their assumptions. This function systematically addresses this by: 1. Creating new columns with the suffix "___altered" for each specified factor to preserve original values 2. Identifying pairs of factors in the specified columns that are fully confounded 3. Resolving confounding by adjusting one of the factors in the "___altered" columns

The function creates new columns with the "___altered" suffix to store the modified values while preserving the original data. This allows users to compare the original and adjusted values if needed.

The resolution strategy depends on the analysis context and can be modified in the helper function `resolve_complete_confounders_of_non_interest_pair_SE()`. By default, the function adjusts one of the confounded factors in the "___altered" columns.

Examples

# Load necessary libraries
library(SummarizedExperiment)
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#> 
#> Attaching package: ‘MatrixGenerics’
#> The following objects are masked from ‘package:matrixStats’:
#> 
#>     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#>     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#>     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#>     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#>     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#>     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#>     colWeightedMeans, colWeightedMedians, colWeightedSds,
#>     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#>     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#>     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#>     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#>     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#>     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#>     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#>     rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: generics
#> 
#> Attaching package: ‘generics’
#> The following objects are masked from ‘package:base’:
#> 
#>     as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
#>     setequal, union
#> 
#> Attaching package: ‘BiocGenerics’
#> The following objects are masked from ‘package:stats’:
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from ‘package:base’:
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
#>     mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#>     rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
#>     unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: ‘S4Vectors’
#> The following object is masked from ‘package:tidybulk’:
#> 
#>     rename
#> The following object is masked from ‘package:utils’:
#> 
#>     findMatches
#> The following objects are masked from ‘package:base’:
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: Seqinfo
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> 
#> Attaching package: ‘Biobase’
#> The following object is masked from ‘package:MatrixGenerics’:
#> 
#>     rowMedians
#> The following objects are masked from ‘package:matrixStats’:
#> 
#>     anyMissing, rowMedians
library(dplyr)
#> 
#> Attaching package: ‘dplyr’
#> The following object is masked from ‘package:Biobase’:
#> 
#>     combine
#> The following objects are masked from ‘package:GenomicRanges’:
#> 
#>     intersect, setdiff, union
#> The following object is masked from ‘package:Seqinfo’:
#> 
#>     intersect
#> The following objects are masked from ‘package:IRanges’:
#> 
#>     collapse, desc, intersect, setdiff, slice, union
#> The following objects are masked from ‘package:S4Vectors’:
#> 
#>     first, intersect, rename, setdiff, setequal, union
#> The following objects are masked from ‘package:BiocGenerics’:
#> 
#>     combine, intersect, setdiff, setequal, union
#> The following object is masked from ‘package:generics’:
#> 
#>     explain
#> The following object is masked from ‘package:matrixStats’:
#> 
#>     count
#> The following object is masked from ‘package:tidybulk’:
#> 
#>     bind_cols
#> The following objects are masked from ‘package:ttservice’:
#> 
#>     bind_cols, bind_rows
#> The following objects are masked from ‘package:stats’:
#> 
#>     filter, lag
#> The following objects are masked from ‘package:base’:
#> 
#>     intersect, setdiff, setequal, union

# Sample annotations
sample_annotations <- data.frame(
  sample_id = paste0("Sample", seq(1, 9)),
  factor_of_interest = c(rep("treated", 4), rep("untreated", 5)),
  A = c("a1", "a2", "a1", "a2", "a1", "a2", "a1", "a2", "a3"),
  B = c("b1", "b1", "b2", "b1", "b1", "b1", "b2", "b1", "b3"),
  C = c("c1", "c1", "c1", "c1", "c1", "c1", "c1", "c1", "c3"),
  stringsAsFactors = FALSE
)

# Simulated assay data
assay_data <- matrix(rnorm(100 * 9), nrow = 100, ncol = 9)

# Row data (e.g., gene annotations)
row_data <- data.frame(gene_id = paste0("Gene", seq_len(100)))

# Create SummarizedExperiment object
se <- SummarizedExperiment(
  assays = list(counts = assay_data),
  rowData = row_data,
  colData = DataFrame(sample_annotations)
)

# Apply the function to resolve confounders
se_resolved <- resolve_complete_confounders_of_non_interest(se, A, B, C)
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
#> `.name_repair` is omitted as of tibble 2.0.0.
#> ℹ Using compatibility `.name_repair`.
#> ℹ The deprecated feature was likely used in the tidybulk package.
#>   Please report the issue at <https://github.com/stemangiola/tidybulk/issues>.
#> tidybulk says: New columns created with resolved confounders: A___altered, B___altered, C___altered
#> tidybulk says: IMPORTANT! the columns A___altered and B___altered, have been corrected for complete confounders and now are NOT interpretable. 
#>       They cannot be used in hypothesis testing. However they can be used in the model to capture the unwanted variability in the data.
#> tidybulk says: The value(s) b3 in column B___altered from sample(s) rowid, has been changed to b1.
#> tidybulk says: IMPORTANT! the columns A___altered and C___altered, have been corrected for complete confounders and now are NOT interpretable. 
#>       They cannot be used in hypothesis testing. However they can be used in the model to capture the unwanted variability in the data.
#> tidybulk says: The value(s) c3 in column C___altered from sample(s) rowid, has been changed to c1.
#> Warning: tidybulk says: The following columns have only one unique value and cannot be estimated by a linear model: C___altered

# View the updated column data
colData(se_resolved)
#> DataFrame with 9 rows and 8 columns
#>     sample_id factor_of_interest           A           B           C
#>   <character>        <character> <character> <character> <character>
#> 1     Sample1            treated          a1          b1          c1
#> 2     Sample2            treated          a2          b1          c1
#> 3     Sample3            treated          a1          b2          c1
#> 4     Sample4            treated          a2          b1          c1
#> 5     Sample5          untreated          a1          b1          c1
#> 6     Sample6          untreated          a2          b1          c1
#> 7     Sample7          untreated          a1          b2          c1
#> 8     Sample8          untreated          a2          b1          c1
#> 9     Sample9          untreated          a3          b3          c3
#>   A___altered B___altered C___altered
#>   <character> <character> <character>
#> 1          a1          b1          c1
#> 2          a2          b1          c1
#> 3          a1          b2          c1
#> 4          a2          b1          c1
#> 5          a1          b1          c1
#> 6          a2          b1          c1
#> 7          a1          b2          c1
#> 8          a2          b1          c1
#> 9          a3          b1          c1

Arguments

Value

Details

See also

Examples