R/methods.R
resolve_complete_confounders_of_non_interest.Rd
This function identifies and resolves complete confounders among specified factors of non-interest within a `SummarizedExperiment` object. Complete confounders occur when the levels of one factor are entirely predictable based on the levels of another factor. Such relationships can interfere with downstream analyses by introducing redundancy or collinearity.
resolve_complete_confounders_of_non_interest(se, ...)
A `SummarizedExperiment` object with resolved confounders. The object retains its structure, including assays and metadata, but the column data (`colData`) is updated with new "___altered" columns containing the resolved factors.
The function systematically examines pairs of specified factors and determines whether they are completely confounded. If a pair of factors is found to be confounded, one of the factors is adjusted or removed to resolve the issue. The adjusted `SummarizedExperiment` object is returned, preserving all assays and metadata except the resolved factors.
Complete confounders of non-interest can create dependencies between variables that may bias statistical models or violate their assumptions. This function systematically addresses this by: 1. Creating new columns with the suffix "___altered" for each specified factor to preserve original values 2. Identifying pairs of factors in the specified columns that are fully confounded 3. Resolving confounding by adjusting one of the factors in the "___altered" columns
The function creates new columns with the "___altered" suffix to store the modified values while preserving the original data. This allows users to compare the original and adjusted values if needed.
The resolution strategy depends on the analysis context and can be modified in the helper function `resolve_complete_confounders_of_non_interest_pair_SE()`. By default, the function adjusts one of the confounded factors in the "___altered" columns.
SummarizedExperiment
for creating and handling `SummarizedExperiment` objects.
# Load necessary libraries
library(SummarizedExperiment)
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#>
#> Attaching package: ‘MatrixGenerics’
#> The following objects are masked from ‘package:matrixStats’:
#>
#> colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#> colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#> colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#> colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#> colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#> colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#> colWeightedMeans, colWeightedMedians, colWeightedSds,
#> colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#> rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#> rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#> rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#> rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#> rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#> rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#> rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: generics
#>
#> Attaching package: ‘generics’
#> The following objects are masked from ‘package:base’:
#>
#> as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
#> setequal, union
#>
#> Attaching package: ‘BiocGenerics’
#> The following objects are masked from ‘package:stats’:
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from ‘package:base’:
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
#> mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#> rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
#> unsplit, which.max, which.min
#> Loading required package: S4Vectors
#>
#> Attaching package: ‘S4Vectors’
#> The following object is masked from ‘package:tidybulk’:
#>
#> rename
#> The following object is masked from ‘package:utils’:
#>
#> findMatches
#> The following objects are masked from ‘package:base’:
#>
#> I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
#>
#> Attaching package: ‘Biobase’
#> The following object is masked from ‘package:MatrixGenerics’:
#>
#> rowMedians
#> The following objects are masked from ‘package:matrixStats’:
#>
#> anyMissing, rowMedians
library(dplyr)
#>
#> Attaching package: ‘dplyr’
#> The following object is masked from ‘package:Biobase’:
#>
#> combine
#> The following objects are masked from ‘package:GenomicRanges’:
#>
#> intersect, setdiff, union
#> The following object is masked from ‘package:GenomeInfoDb’:
#>
#> intersect
#> The following objects are masked from ‘package:IRanges’:
#>
#> collapse, desc, intersect, setdiff, slice, union
#> The following objects are masked from ‘package:S4Vectors’:
#>
#> first, intersect, rename, setdiff, setequal, union
#> The following objects are masked from ‘package:BiocGenerics’:
#>
#> combine, intersect, setdiff, setequal, union
#> The following object is masked from ‘package:generics’:
#>
#> explain
#> The following object is masked from ‘package:matrixStats’:
#>
#> count
#> The following object is masked from ‘package:tidybulk’:
#>
#> bind_cols
#> The following objects are masked from ‘package:ttservice’:
#>
#> bind_cols, bind_rows
#> The following objects are masked from ‘package:stats’:
#>
#> filter, lag
#> The following objects are masked from ‘package:base’:
#>
#> intersect, setdiff, setequal, union
# Sample annotations
sample_annotations <- data.frame(
sample_id = paste0("Sample", seq(1, 9)),
factor_of_interest = c(rep("treated", 4), rep("untreated", 5)),
A = c("a1", "a2", "a1", "a2", "a1", "a2", "a1", "a2", "a3"),
B = c("b1", "b1", "b2", "b1", "b1", "b1", "b2", "b1", "b3"),
C = c("c1", "c1", "c1", "c1", "c1", "c1", "c1", "c1", "c3"),
stringsAsFactors = FALSE
)
# Simulated assay data
assay_data <- matrix(rnorm(100 * 9), nrow = 100, ncol = 9)
# Row data (e.g., gene annotations)
row_data <- data.frame(gene_id = paste0("Gene", seq_len(100)))
# Create SummarizedExperiment object
se <- SummarizedExperiment(
assays = list(counts = assay_data),
rowData = row_data,
colData = DataFrame(sample_annotations)
)
# Apply the function to resolve confounders
se_resolved <- resolve_complete_confounders_of_non_interest(se, A, B, C)
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
#> `.name_repair` is omitted as of tibble 2.0.0.
#> ℹ Using compatibility `.name_repair`.
#> ℹ The deprecated feature was likely used in the tidybulk package.
#> Please report the issue at <https://github.com/stemangiola/tidybulk/issues>.
#> tidybulk says: New columns created with resolved confounders: A___altered, B___altered, C___altered
#> tidybulk says: IMPORTANT! the columns A___altered and B___altered, have been corrected for complete confounders and now are NOT interpretable.
#> They cannot be used in hypothesis testing. However they can be used in the model to capture the unwanted variability in the data.
#> tidybulk says: The value(s) b3 in column B___altered from sample(s) rowid, has been changed to b1.
#> tidybulk says: IMPORTANT! the columns A___altered and C___altered, have been corrected for complete confounders and now are NOT interpretable.
#> They cannot be used in hypothesis testing. However they can be used in the model to capture the unwanted variability in the data.
#> tidybulk says: The value(s) c3 in column C___altered from sample(s) rowid, has been changed to c1.
#> Warning: tidybulk says: The following columns have only one unique value and cannot be estimated by a linear model: C___altered
# View the updated column data
colData(se_resolved)
#> DataFrame with 9 rows and 8 columns
#> sample_id factor_of_interest A B C
#> <character> <character> <character> <character> <character>
#> 1 Sample1 treated a1 b1 c1
#> 2 Sample2 treated a2 b1 c1
#> 3 Sample3 treated a1 b2 c1
#> 4 Sample4 treated a2 b1 c1
#> 5 Sample5 untreated a1 b1 c1
#> 6 Sample6 untreated a2 b1 c1
#> 7 Sample7 untreated a1 b2 c1
#> 8 Sample8 untreated a2 b1 c1
#> 9 Sample9 untreated a3 b3 c3
#> A___altered B___altered C___altered
#> <character> <character> <character>
#> 1 a1 b1 c1
#> 2 a2 b1 c1
#> 3 a1 b2 c1
#> 4 a2 b1 c1
#> 5 a1 b1 c1
#> 6 a2 b1 c1
#> 7 a1 b2 c1
#> 8 a2 b1 c1
#> 9 a3 b1 c1