Last updated: 2025-09-12
Checks: 7 0
Knit directory: stability_selection/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20250906)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 017d47b. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown
(analysis/sparse_overlap_setting.Rmd
) and HTML
(docs/sparse_overlap_setting.html
) files. If you’ve
configured a remote Git repository (see ?wflow_git_remote
),
click on the hyperlinks in the table below to view the files as they
were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 017d47b | Annie Xie | 2025-09-12 | Add exploration of sparse overlap setting |
In this analysis, we are interested in testing stability selection approaches in the sparse, overlapping setting.
At a high level, the stability selection involves 1) splitting the data into two subsets, 2) applying the method to each subset, 3) choosing the components that have high correspondence across the two sets of results. We will test two different approaches to step 1). The first approach is splitting the data by splitting the columns. This approach feels intuitive since we are interested in the loadings matrix, which says something about the samples in the dataset. In a population genetics application, one could argue that all of the chromosomes are undergoing evolution independently, and so you could split the data by even vs. odd chromosomes to get two different datasets. However, in a single-cell RNA-seq application, it feels more natural to split the data by cells – this feels more like creating sub-datasets compared to splitting by genes (unless you want to make some assumption that the genes are pulled from a “population”, but I think that feels less natural). This motivates the second approach: splitting the data by splitting the rows.
I imagine that splitting the columns will be better than splitting the rows in this setting because each sample is only active in a couple of factors. Therefore, splitting by samples may lead some factors to have weaker signal and thus be harder to find.
library(dplyr)
library(ggplot2)
library(pheatmap)
source('code/visualization_functions.R')
source('code/stability_selection_functions.R')
permute_L <- function(est, truth){
K_est <- ncol(est)
K_truth <- ncol(truth)
n <- nrow(est)
#if estimates don't have same number of columns, try padding the estimate with zeros and make cosine similarity zero
if (K_est < K_truth){
est <- cbind(est, matrix(rep(0, n*(K_truth-K_est)), nrow = n))
}
if (K_est > K_truth){
truth <- cbind(truth, matrix(rep(0, n*(K_est - K_truth)), nrow = n))
}
#normalize est and truth
norms_est <- apply(est, 2, function(x){sqrt(sum(x^2))})
norms_est[norms_est == 0] <- Inf
norms_truth <- apply(truth, 2, function(x){sqrt(sum(x^2))})
norms_truth[norms_truth == 0] <- Inf
est_normalized <- t(t(est)/norms_est)
truth_normalized <- t(t(truth)/norms_truth)
#compute matrix of cosine similarities
cosine_sim_matrix <- abs(crossprod(est_normalized, truth_normalized))
assignment_problem <- lpSolve::lp.assign(t(cosine_sim_matrix), direction = "max")
perm <- apply(assignment_problem$solution, 1, which.max)
return(est[,perm])
}
In this analysis, we will focus on the sparse overlapping setting.
sim_binary_loadings_data <- function(args) {
set.seed(args$seed)
FF <- matrix(rnorm(args$k * args$p, sd = args$group_sd), ncol = args$k)
if (args$constrain_F) {
FF_svd <- svd(FF)
FF <- FF_svd$u
FF <- t(t(FF) * rep(args$group_sd, args$k) * sqrt(p))
}
LL <- matrix(rbinom(args$n*args$k, 1, args$pi1), nrow = args$n, ncol = args$k)
E <- matrix(rnorm(args$n * args$p, sd = args$indiv_sd), nrow = args$n)
Y <- LL %*% t(FF) + E
YYt <- (1/args$p)*tcrossprod(Y)
return(list(Y = Y, YYt = YYt, LL = LL, FF = FF, K = ncol(LL)))
}
n <- 100
p <- 1000
k <- 10
pi1 <- 0.1
indiv_sd <- 1
group_sd <- 1
seed <- 1
sim_args = list(n = n, p = p, k = k, pi1 = pi1, indiv_sd = indiv_sd, group_sd = group_sd, seed = seed, constrain_F = FALSE)
sim_data <- sim_binary_loadings_data(sim_args)
This is a heatmap of the true loadings matrix:
plot_heatmap(sim_data$LL)
In this section, I try stability selection with the GBCD method. In my experiments, I’ve found that GBCD tends to return extra factors (partially because the point-Laplace initialization will yield extra factors).
First, I try splitting the data by splitting the columns.
set.seed(1)
X_split_by_col <- stability_selection_split_data(sim_data$Y, dim = 'columns')
gbcd_fits_by_col <- list()
for (i in 1:length(X_split_by_col)){
gbcd_fits_by_col[[i]] <- gbcd::fit_gbcd(X_split_by_col[[i]],
Kmax = 10,
prior = ebnm::ebnm_generalized_binary,
verbose = 0)$L
}
[1] "Form cell by cell covariance matrix..."
user system elapsed
0.004 0.000 0.004
[1] "Initialize GEP membership matrix L..."
Adding factor 1 to flash object...
Wrapping up...
Done.
Adding factor 2 to flash object...
Adding factor 3 to flash object...
Adding factor 4 to flash object...
Adding factor 5 to flash object...
Adding factor 6 to flash object...
Adding factor 7 to flash object...
Adding factor 8 to flash object...
Adding factor 9 to flash object...
Adding factor 10 to flash object...
Wrapping up...
Done.
Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.
user system elapsed
4.068 0.133 4.210
[1] "Estimate GEP membership matrix L..."
--Estimate of factor 10 is numerically zero!
--Estimate of factor 10 is numerically zero!
--Estimate of factor 20 is numerically zero!
--Estimate of factor 9 is numerically zero!
--Estimate of factor 10 is numerically zero!
--Estimate of factor 9 is numerically zero!
--Estimate of factor 10 is numerically zero!
--Estimate of factor 9 is numerically zero!
--Estimate of factor 10 is numerically zero!
--Estimate of factor 9 is numerically zero!
--Estimate of factor 10 is numerically zero!
--Estimate of factor 9 is numerically zero!
--Estimate of factor 10 is numerically zero!
--Estimate of factor 9 is numerically zero!
--Estimate of factor 10 is numerically zero!
--Estimate of factor 9 is numerically zero!
--Estimate of factor 10 is numerically zero!
--Estimate of factor 9 is numerically zero!
--Estimate of factor 10 is numerically zero!
--Estimate of factor 9 is numerically zero!
--Estimate of factor 10 is numerically zero!
--Estimate of factor 10 is numerically zero!
--Estimate of factor 10 is numerically zero!
--Estimate of factor 6 is numerically zero!
--Estimate of factor 10 is numerically zero!
--Estimate of factor 10 is numerically zero!
--Estimate of factor 10 is numerically zero!
--Estimate of factor 10 is numerically zero!
--Estimate of factor 10 is numerically zero!
--Estimate of factor 10 is numerically zero!
--Estimate of factor 10 is numerically zero!
Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.
--Estimate of factor 8 is numerically zero!
--Estimate of factor 8 is numerically zero!
--Estimate of factor 2 is numerically zero!
user system elapsed
13.633 0.323 14.071
[1] "Estimate GEP signature matrix F..."
Backfitting 16 factors (tolerance: 7.45e-04)...
--Estimate of factor 13 is numerically zero!
--Estimate of factor 14 is numerically zero!
--Estimate of factor 12 is numerically zero!
An update to factor 12 decreased the objective by 5.888e-04.
Difference between iterations is within 1.0e+01...
--Estimate of factor 12 is numerically zero!
An update to factor 12 decreased the objective by 4.755e-04.
An update to factor 12 decreased the objective by 2.141e-04.
--Estimate of factor 12 is numerically zero!
An update to factor 12 decreased the objective by 2.006e-04.
--Estimate of factor 12 is numerically zero!
An update to factor 12 decreased the objective by 4.800e-05.
--Estimate of factor 12 is numerically zero!
--Estimate of factor 9 is numerically zero!
Difference between iterations is within 1.0e+00...
--Maximum number of iterations reached!
Wrapping up...
Done.
user system elapsed
18.873 0.443 19.663
[1] "Form cell by cell covariance matrix..."
user system elapsed
0.001 0.000 0.001
[1] "Initialize GEP membership matrix L..."
Adding factor 1 to flash object...
Wrapping up...
Done.
Adding factor 2 to flash object...
Adding factor 3 to flash object...
Adding factor 4 to flash object...
Adding factor 5 to flash object...
Adding factor 6 to flash object...
Adding factor 7 to flash object...
Adding factor 8 to flash object...
Adding factor 9 to flash object...
Adding factor 10 to flash object...
Wrapping up...
Done.
Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.
user system elapsed
3.503 0.131 3.649
[1] "Estimate GEP membership matrix L..."
--Estimate of factor 17 is numerically zero!
--Estimate of factor 20 is numerically zero!
--Estimate of factor 10 is numerically zero!
--Estimate of factor 14 is numerically zero!
--Estimate of factor 9 is numerically zero!
--Estimate of factor 16 is numerically zero!
--Estimate of factor 6 is numerically zero!
--Estimate of factor 2 is numerically zero!
Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.
user system elapsed
8.842 0.182 9.111
[1] "Estimate GEP signature matrix F..."
Backfitting 11 factors (tolerance: 7.45e-04)...
Difference between iterations is within 1.0e+01...
--Estimate of factor 11 is numerically zero!
Difference between iterations is within 1.0e+00...
Difference between iterations is within 1.0e-01...
Wrapping up...
Done.
user system elapsed
0.831 0.021 0.852
This is heatmap of the loadings estimate from the first subset:
L1_permuted <- permute_L(gbcd_fits_by_col[[1]], sim_data$LL)
plot_heatmap(L1_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L1_permuted)), max(abs(L1_permuted)), length.out = 50))
This is a heatmap of the loadings estimate from the second subset:
L2_permuted <- permute_L(gbcd_fits_by_col[[2]], sim_data$LL)
plot_heatmap(L2_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L2_permuted)), max(abs(L2_permuted)), length.out = 50))
results_by_col <- stability_selection_post_processing(gbcd_fits_by_col[[1]], gbcd_fits_by_col[[2]], threshold = 0.99)
L_est_by_col <- results_by_col$L
This is a heatmap of the final loadings estimate:
L_est_by_col_permuted <- permute_L(L_est_by_col, sim_data$LL)
L_est_by_col_permuted <- L_est_by_col_permuted[, apply(L_est_by_col_permuted, 2, function(x){sum(x^2)}) > 10^(-10)]
plot_heatmap(L_est_by_col_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L_est_by_col_permuted)), max(abs(L_est_by_col_permuted)), length.out = 50))
In this case, the method recovered seven of the ten components.
Now, we try splitting the rows:
set.seed(1)
X_split_by_row <- stability_selection_split_data(sim_data$Y, dim = 'rows')
gbcd_fits_by_row <- list()
for (i in 1:length(X_split_by_row)){
gbcd_fits_by_row[[i]] <- gbcd::fit_gbcd(X_split_by_row[[i]],
Kmax = 10,
prior = ebnm::ebnm_generalized_binary,
verbose = 0)$L
}
[1] "Form cell by cell covariance matrix..."
user system elapsed
0.000 0.000 0.001
[1] "Initialize GEP membership matrix L..."
Adding factor 1 to flash object...
Wrapping up...
Done.
Adding factor 2 to flash object...
Adding factor 3 to flash object...
Adding factor 4 to flash object...
Adding factor 5 to flash object...
Adding factor 6 to flash object...
Adding factor 7 to flash object...
Adding factor 8 to flash object...
Adding factor 9 to flash object...
Adding factor 10 to flash object...
Wrapping up...
Done.
Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.
user system elapsed
5.785 0.190 5.974
[1] "Estimate GEP membership matrix L..."
Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.
user system elapsed
35.470 0.957 37.058
[1] "Estimate GEP signature matrix F..."
Backfitting 16 factors (tolerance: 7.45e-05)...
Difference between iterations is within 1.0e+00...
Difference between iterations is within 1.0e-01...
Difference between iterations is within 1.0e-02...
Difference between iterations is within 1.0e-03...
Difference between iterations is within 1.0e-04...
Wrapping up...
Done.
user system elapsed
0.538 0.009 0.547
[1] "Form cell by cell covariance matrix..."
user system elapsed
0.003 0.000 0.002
[1] "Initialize GEP membership matrix L..."
Adding factor 1 to flash object...
Wrapping up...
Done.
Adding factor 2 to flash object...
Adding factor 3 to flash object...
Adding factor 4 to flash object...
Adding factor 5 to flash object...
Adding factor 6 to flash object...
Adding factor 7 to flash object...
Adding factor 8 to flash object...
Adding factor 9 to flash object...
Adding factor 10 to flash object...
Wrapping up...
Done.
Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.
user system elapsed
2.743 0.095 2.842
[1] "Estimate GEP membership matrix L..."
--Estimate of factor 18 is numerically zero!
--Estimate of factor 19 is numerically zero!
--Estimate of factor 8 is numerically zero!
--Estimate of factor 2 is numerically zero!
--Estimate of factor 6 is numerically zero!
--Estimate of factor 2 is numerically zero!
--Estimate of factor 2 is numerically zero!
--Estimate of factor 2 is numerically zero!
--Estimate of factor 2 is numerically zero!
Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.
--Estimate of factor 5 is numerically zero!
user system elapsed
13.273 0.278 13.761
[1] "Estimate GEP signature matrix F..."
Backfitting 12 factors (tolerance: 1.42e-03)...
Difference between iterations is within 1.0e+02...
--Estimate of factor 12 is numerically zero!
--Estimate of factor 4 is numerically zero!
Difference between iterations is within 1.0e+01...
Difference between iterations is within 1.0e+00...
Difference between iterations is within 1.0e-01...
Wrapping up...
Done.
user system elapsed
1.665 0.038 1.705
This is heatmap of the loadings estimate from the first subset:
L1_permuted <- permute_L(gbcd_fits_by_row[[1]], sim_data$LL)
plot_heatmap(L1_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L1_permuted)), max(abs(L1_permuted)), length.out = 50))
This is heatmap of the loadings estimate from the second subset:
L2_permuted <- permute_L(gbcd_fits_by_row[[2]], sim_data$LL)
plot_heatmap(L2_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L2_permuted)), max(abs(L2_permuted)), length.out = 50))
results_by_row <- stability_selection_post_processing(gbcd_fits_by_row[[1]], gbcd_fits_by_row[[2]], threshold = 0.99)
L_est_by_row <- results_by_row$L
This is a heatmap of the final loadings estimate:
L_est_by_row_permuted <- permute_L(L_est_by_row, sim_data$LL)
L_est_by_row_permuted <- L_est_by_row_permuted[, apply(L_est_by_row_permuted, 2, function(x){sum(x^2)}) > 10^(-10)]
plot_heatmap(L_est_by_row_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L_est_by_row_permuted)), max(abs(L_est_by_row_permuted)), length.out = 50))
In this case, the method recovered two of the ten factors.
In this section, I try stability selection with the GBCD method. When
given a Kmax
value that is larger than the true number of
components, I’ve found that EBCD usually returns extra factors. So in
this section, when I run EBCD, I give the method double the true number
of components.
set.seed(1)
ebcd_fits_by_col <- list()
for (i in 1:length(X_split_by_col)){
ebcd_fits_by_col[[i]] <- ebcd::ebcd(t(X_split_by_col[[i]]),
Kmax = 20,
ebnm_fn = ebnm::ebnm_generalized_binary)$EL
}
This is heatmap of the loadings estimate from the first subset:
L1_permuted <- permute_L(ebcd_fits_by_col[[1]], sim_data$LL)
plot_heatmap(L1_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L1_permuted)), max(abs(L1_permuted)), length.out = 50))
This is a heatmap of the loadings estimate from the second subset:
L2_permuted <- permute_L(ebcd_fits_by_col[[2]], sim_data$LL)
plot_heatmap(L2_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L2_permuted)), max(abs(L2_permuted)), length.out = 50))
results_by_col <- stability_selection_post_processing(ebcd_fits_by_col[[1]], ebcd_fits_by_col[[2]], threshold = 0.99)
L_est_by_col <- results_by_col$L
This is a heatmap of the final loadings estimate:
L_est_by_col_permuted <- permute_L(L_est_by_col, sim_data$LL)
L_est_by_col_permuted <- L_est_by_col_permuted[, apply(L_est_by_col_permuted, 2, function(x){sum(x^2)}) > 10^(-10)]
plot_heatmap(L_est_by_col_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L_est_by_col_permuted)), max(abs(L_est_by_col_permuted)), length.out = 50))
These are the correlations between the estimates and the true factors (which are paired to maximize crossproduct similarity):
diag(cor(L_est_by_col_permuted, sim_data$LL))
[1] 0.9975203 0.9974854 0.9983737 0.9988182 0.9992998 0.9991493 0.9990425
[8] 0.9982839 0.9981471 0.9978499
The method recovers all ten components nearly perfectly.
set.seed(1)
ebcd_fits_by_row <- list()
for (i in 1:length(X_split_by_row)){
ebcd_fits_by_row[[i]] <- ebcd::ebcd(t(X_split_by_row[[i]]),
Kmax = 20,
ebnm_fn = ebnm::ebnm_generalized_binary)$EL
}
This is heatmap of the loadings estimate from the first subset:
L1_permuted <- permute_L(ebcd_fits_by_row[[1]], sim_data$LL)
plot_heatmap(L1_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L1_permuted)), max(abs(L1_permuted)), length.out = 50))
This is heatmap of the loadings estimate from the second subset:
L2_permuted <- permute_L(ebcd_fits_by_row[[2]], sim_data$LL)
plot_heatmap(L2_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L2_permuted)), max(abs(L2_permuted)), length.out = 50))
results_by_row <- stability_selection_post_processing(ebcd_fits_by_row[[1]], ebcd_fits_by_row[[2]], threshold = 0.99)
L_est_by_row <- results_by_row$L
This is a heatmap of the final loadings estimate:
L_est_by_row_permuted <- permute_L(L_est_by_row, sim_data$LL)
L_est_by_row_permuted <- L_est_by_row_permuted[, apply(L_est_by_row_permuted, 2, function(x){sum(x^2)}) > 10^(-10)]
plot_heatmap(L_est_by_row_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L_est_by_row_permuted)), max(abs(L_est_by_row_permuted)), length.out = 50))
In this case, the method returns only seven of the ten factors.
In this section, I try stability selection with the CoDesymNMF
method. Similar to EBCD, when given a Kmax
value that is
larger than the true number of components, the method usually returns
extra factors. Note that in this section, when I run CoDesymNMF, I give
the method double the true number of components.
codesymnmf_fits_by_col <- list()
for (i in 1:length(X_split_by_col)){
cov_mat <- tcrossprod(X_split_by_col[[i]])/ncol(X_split_by_col[[i]])
codesymnmf_fits_by_col[[i]] <- codesymnmf::codesymnmf(cov_mat, 20)$H
}
This is heatmap of the loadings estimate from the first subset:
L1_permuted <- permute_L(codesymnmf_fits_by_col[[1]], sim_data$LL)
plot_heatmap(L1_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L1_permuted)), max(abs(L1_permuted)), length.out = 50))
This is a heatmap of the loadings estimate from the second subset:
L2_permuted <- permute_L(codesymnmf_fits_by_col[[2]], sim_data$LL)
plot_heatmap(L2_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L2_permuted)), max(abs(L2_permuted)), length.out = 50))
results_by_col <- stability_selection_post_processing(codesymnmf_fits_by_col[[1]], codesymnmf_fits_by_col[[2]], threshold = 0.99)
L_est_by_col <- results_by_col$L
In this case, the method does not return any factors. Reducing the similarity threshold to 0.98 allows the method to return four of the ten factors. So it seems like the estimates do contain the signal. However, perhaps due to the fact the method doesn’t encourage sparsity or is not binary, the estimates are not exactly the same across the two estimates.
results_by_col <- stability_selection_post_processing(codesymnmf_fits_by_col[[1]], codesymnmf_fits_by_col[[2]], threshold = 0.98)
L_est_by_col <- results_by_col$L
This is a heatmap of the final loadings estimate with a decreased threshold:
L_est_by_col_permuted <- permute_L(L_est_by_col, sim_data$LL)
L_est_by_col_permuted <- L_est_by_col_permuted[, apply(L_est_by_col_permuted, 2, function(x){sum(x^2)}) > 10^(-10)]
plot_heatmap(L_est_by_col_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L_est_by_col_permuted)), max(abs(L_est_by_col_permuted)), length.out = 50))
codesymnmf_fits_by_row <- list()
for (i in 1:length(X_split_by_row)){
cov_mat <- tcrossprod(X_split_by_row[[i]])/ncol(X_split_by_row[[i]])
codesymnmf_fits_by_row[[i]] <- codesymnmf::codesymnmf(cov_mat, 20)$H
}
This is heatmap of the loadings estimate from the first subset:
L1_permuted <- permute_L(codesymnmf_fits_by_row[[1]], sim_data$LL)
plot_heatmap(L1_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L1_permuted)), max(abs(L1_permuted)), length.out = 50))
This is heatmap of the loadings estimate from the second subset:
L2_permuted <- permute_L(codesymnmf_fits_by_row[[2]], sim_data$LL)
plot_heatmap(L2_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L2_permuted)), max(abs(L2_permuted)), length.out = 50))
results_by_row <- stability_selection_post_processing(codesymnmf_fits_by_row[[1]], codesymnmf_fits_by_row[[2]], threshold = 0.99)
L_est_by_row <- results_by_row$L
In this setting, the method again does not return any factors. The method only starts returning factors when the threshold is reduced to 0.92, so this suggests the estimates are more dissimilar in this case.
results_by_row <- stability_selection_post_processing(codesymnmf_fits_by_row[[1]], codesymnmf_fits_by_row[[2]], threshold = 0.92)
L_est_by_row <- results_by_row$L
This is a heatmap of the final loadings estimate with a reduced threshold:
L_est_by_row_permuted <- permute_L(L_est_by_row, sim_data$LL)
L_est_by_row_permuted <- L_est_by_row_permuted[, apply(L_est_by_row_permuted, 2, function(x){sum(x^2)}) > 10^(-10)]
plot_heatmap(L_est_by_row_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L_est_by_row_permuted)), max(abs(L_est_by_row_permuted)), length.out = 50))
In this section, I try stability selection with the EBMFcov method.
In my experiments, I’ve found that when given a larger
Kmax
, EBMFcov will get around the correct number of
factors, and then it will add essentially trivial factors. I wonder if
the stability selection procedure will help get rid of these extra
factors. The hope is that it can. However, if both estimates have
factors that look like this, then maybe these factors will still get
returned. Note that in this section, when I run EBMFcov, I give the
method double the true number of components.
cov_fit <- function(covmat, ebnm_fn = ebnm::ebnm_point_laplace, Kmax = 1000, verbose.lvl = 0, backfit = TRUE) {
fl <- flash_init(covmat, var_type = 0) %>%
flash_set_verbose(verbose.lvl) %>%
flash_greedy(ebnm_fn = ebnm_fn, Kmax = Kmax)
if (backfit == TRUE){
fl <- flash_backfit(fl)
}
s2 <- max(0, mean(diag(covmat) - diag(fitted(fl))))
s2_diff <- Inf
while(s2 > 0 && abs(s2_diff - 1) > 1e-4) {
covmat_minuss2 <- covmat - diag(rep(s2, ncol(covmat)))
fl <- flash_init(covmat_minuss2, var_type = 0) %>%
flash_set_verbose(verbose.lvl) %>%
flash_greedy(ebnm_fn = ebnm_fn, Kmax = Kmax)
if (backfit == TRUE){
fl <- flash_backfit(fl)
}
old_s2 <- s2
s2 <- max(0, mean(diag(covmat) - diag(fitted(fl))))
s2_diff <- s2 / old_s2
}
return(list(fl=fl, s2 = s2))
}
library(flashier)
Loading required package: ebnm
ebmfcov_fits_by_col <- list()
for (i in 1:length(X_split_by_col)){
cov_mat <- tcrossprod(X_split_by_col[[i]])/ncol(X_split_by_col[[i]])
fl <- cov_fit(cov_mat, ebnm_fn = ebnm::ebnm_generalized_binary, Kmax = 20)$fl
fl_scaled <- ldf(fl)
L_scaled <- fl_scaled$L %*% diag(sqrt(fl_scaled$D))
ebmfcov_fits_by_col[[i]] <- L_scaled
}
Just a note: this took a really long time to run. I didn’t expect it to take so long.
This is heatmap of the loadings estimate from the first subset:
L1_permuted <- permute_L(ebmfcov_fits_by_col[[1]], sim_data$LL)
plot_heatmap(L1_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L1_permuted)), max(abs(L1_permuted)), length.out = 50))
This is a heatmap of the loadings estimate from the second subset:
L2_permuted <- permute_L(ebmfcov_fits_by_col[[2]], sim_data$LL)
plot_heatmap(L2_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L2_permuted)), max(abs(L2_permuted)), length.out = 50))
results_by_col <- stability_selection_post_processing(ebmfcov_fits_by_col[[1]], ebmfcov_fits_by_col[[2]], threshold = 0.99)
L_est_by_col <- results_by_col$L
This is a heatmap of the final loadings estimate:
L_est_by_col_permuted <- permute_L(L_est_by_col, sim_data$LL)
L_est_by_col_permuted <- L_est_by_col_permuted[, apply(L_est_by_col_permuted, 2, function(x){sum(x^2)}) > 10^(-10)]
plot_heatmap(L_est_by_col_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L_est_by_col_permuted)), max(abs(L_est_by_col_permuted)), length.out = 50))
This is a heatmap of the factors in the final loadings estimate that appear to be trivial:
plot_heatmap(L_est_by_col_permuted[, c(3, 6, 11:19)], colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L_est_by_col_permuted[, c(3, 6, 11:19)])), max(abs(L_est_by_col_permuted[, c(3, 6, 11:19)])), length.out = 50))
So it appears that the method still returns since both estimates have
factors of this type. I think this is something specific to EBMFcov and
the generalized binary prior when Kmax
is large, so perhaps
another type of ad-hoc procedure is needed to get rid of these extra
factors.
ebmfcov_fits_by_row <- list()
for (i in 1:length(X_split_by_row)){
cov_mat <- tcrossprod(X_split_by_row[[i]])/ncol(X_split_by_row[[i]])
fl <- cov_fit(cov_mat, ebnm_fn = ebnm::ebnm_generalized_binary, Kmax = 20)$fl
fl_scaled <- ldf(fl)
L_scaled <- fl_scaled$L %*% diag(sqrt(fl_scaled$D))
ebmfcov_fits_by_row[[i]] <- L_scaled
}
This is heatmap of the loadings estimate from the first subset:
L1_permuted <- permute_L(ebmfcov_fits_by_row[[1]], sim_data$LL)
plot_heatmap(L1_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L1_permuted)), max(abs(L1_permuted)), length.out = 50))
This is heatmap of the loadings estimate from the second subset:
L2_permuted <- permute_L(ebmfcov_fits_by_row[[2]], sim_data$LL)
plot_heatmap(L2_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L2_permuted)), max(abs(L2_permuted)), length.out = 50))
results_by_row <- stability_selection_post_processing(ebmfcov_fits_by_row[[1]], ebmfcov_fits_by_row[[2]], threshold = 0.99)
L_est_by_row <- results_by_row$L
This is a heatmap of the final loadings estimate:
L_est_by_row_permuted <- permute_L(L_est_by_row, sim_data$LL)
L_est_by_row_permuted <- L_est_by_row_permuted[, apply(L_est_by_row_permuted, 2, function(x){sum(x^2)}) > 10^(-10)]
plot_heatmap(L_est_by_row_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L_est_by_row_permuted)), max(abs(L_est_by_row_permuted)), length.out = 50))
The method returns three of the ten factors. Again, the method also returns some of the extra factors.
As expected, the variant of stability selection which split the data
via the rows struggled more than the variant which split the data via
the columns. I thought it was interesting that EBCD with the
column-splitting based stability selection was able to return all ten
factors – this suggests that when given a higher Kmax
, EBCD
will return extra factors on top of the correct factors (as opposed to
representing the matrix with a characteristically different
representation that utilizes its full factor allotment). CoDesymNMF did
not return any factors with either stability selection method. This may
be a result of the lack of a sparsity penalty in CoDesymNMF.
sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Chicago
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] flashier_1.0.56 ebnm_1.1-34 pheatmap_1.0.12 ggplot2_3.5.2
[5] dplyr_1.1.4 workflowr_1.7.1
loaded via a namespace (and not attached):
[1] tidyselect_1.2.1 viridisLite_0.4.2 farver_2.1.2
[4] fastmap_1.2.0 lazyeval_0.2.2 codesymnmf_0.0.0.9000
[7] promises_1.3.3 digest_0.6.37 lifecycle_1.0.4
[10] processx_3.8.4 invgamma_1.1 magrittr_2.0.3
[13] compiler_4.3.2 rlang_1.1.6 sass_0.4.10
[16] progress_1.2.3 tools_4.3.2 yaml_2.3.10
[19] data.table_1.17.6 knitr_1.50 prettyunits_1.2.0
[22] htmlwidgets_1.6.4 scatterplot3d_0.3-44 RColorBrewer_1.1-3
[25] Rtsne_0.17 withr_3.0.2 purrr_1.0.4
[28] grid_4.3.2 git2r_0.33.0 fastTopics_0.6-192
[31] colorspace_2.1-1 scales_1.4.0 gtools_3.9.5
[34] cli_3.6.5 rmarkdown_2.29 crayon_1.5.3
[37] generics_0.1.4 RcppParallel_5.1.10 rstudioapi_0.16.0
[40] httr_1.4.7 pbapply_1.7-2 cachem_1.1.0
[43] stringr_1.5.1 splines_4.3.2 parallel_4.3.2
[46] softImpute_1.4-3 vctrs_0.6.5 Matrix_1.6-5
[49] jsonlite_2.0.0 callr_3.7.6 hms_1.1.3
[52] mixsqp_0.3-54 ggrepel_0.9.6 irlba_2.3.5.1
[55] horseshoe_0.2.0 trust_0.1-8 plotly_4.11.0
[58] jquerylib_0.1.4 tidyr_1.3.1 ebcd_0.0.0.9000
[61] glue_1.8.0 ps_1.7.7 cowplot_1.1.3
[64] gbcd_0.2-17 uwot_0.2.3 stringi_1.8.7
[67] Polychrome_1.5.1 gtable_0.3.6 later_1.4.2
[70] quadprog_1.5-8 tibble_3.3.0 pillar_1.10.2
[73] htmltools_0.5.8.1 truncnorm_1.0-9 R6_2.6.1
[76] rprojroot_2.0.4 lpSolve_5.6.20 evaluate_1.0.4
[79] lattice_0.22-6 RhpcBLASctl_0.23-42 SQUAREM_2021.1
[82] ashr_2.2-66 httpuv_1.6.15 bslib_0.9.0
[85] Rcpp_1.0.14 deconvolveR_1.2-1 whisker_0.4.1
[88] xfun_0.52 fs_1.6.6 getPass_0.2-4
[91] pkgconfig_2.0.3