Last updated: 2025-09-12

Checks: 7 0

Knit directory: stability_selection/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20250906) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 017d47b. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/sparse_overlap_setting.Rmd) and HTML (docs/sparse_overlap_setting.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 017d47b Annie Xie 2025-09-12 Add exploration of sparse overlap setting

Introduction

In this analysis, we are interested in testing stability selection approaches in the sparse, overlapping setting.

At a high level, the stability selection involves 1) splitting the data into two subsets, 2) applying the method to each subset, 3) choosing the components that have high correspondence across the two sets of results. We will test two different approaches to step 1). The first approach is splitting the data by splitting the columns. This approach feels intuitive since we are interested in the loadings matrix, which says something about the samples in the dataset. In a population genetics application, one could argue that all of the chromosomes are undergoing evolution independently, and so you could split the data by even vs. odd chromosomes to get two different datasets. However, in a single-cell RNA-seq application, it feels more natural to split the data by cells – this feels more like creating sub-datasets compared to splitting by genes (unless you want to make some assumption that the genes are pulled from a “population”, but I think that feels less natural). This motivates the second approach: splitting the data by splitting the rows.

I imagine that splitting the columns will be better than splitting the rows in this setting because each sample is only active in a couple of factors. Therefore, splitting by samples may lead some factors to have weaker signal and thus be harder to find.

library(dplyr)
library(ggplot2)
library(pheatmap)
source('code/visualization_functions.R')
source('code/stability_selection_functions.R')
permute_L <- function(est, truth){
  K_est <- ncol(est)
  K_truth <- ncol(truth)
  n <- nrow(est)

  #if estimates don't have same number of columns, try padding the estimate with zeros and make cosine similarity zero
  if (K_est < K_truth){
    est <- cbind(est, matrix(rep(0, n*(K_truth-K_est)), nrow = n))
  }

  if (K_est > K_truth){
    truth <- cbind(truth, matrix(rep(0, n*(K_est - K_truth)), nrow = n))
  }

  #normalize est and truth
  norms_est <- apply(est, 2, function(x){sqrt(sum(x^2))})
  norms_est[norms_est == 0] <- Inf

  norms_truth <- apply(truth, 2, function(x){sqrt(sum(x^2))})
  norms_truth[norms_truth == 0] <- Inf

  est_normalized <- t(t(est)/norms_est)
  truth_normalized <- t(t(truth)/norms_truth)

  #compute matrix of cosine similarities
  cosine_sim_matrix <- abs(crossprod(est_normalized, truth_normalized))
  assignment_problem <- lpSolve::lp.assign(t(cosine_sim_matrix), direction = "max")

  perm <- apply(assignment_problem$solution, 1, which.max)
  return(est[,perm])
}

Data Generation

In this analysis, we will focus on the sparse overlapping setting.

sim_binary_loadings_data <- function(args) {
  set.seed(args$seed)
  
  FF <- matrix(rnorm(args$k * args$p, sd = args$group_sd), ncol = args$k)
  if (args$constrain_F) {
    FF_svd <- svd(FF)
    FF <- FF_svd$u
    FF <- t(t(FF) * rep(args$group_sd, args$k) * sqrt(p))
  }
  LL <- matrix(rbinom(args$n*args$k, 1, args$pi1), nrow = args$n, ncol = args$k)
  E <- matrix(rnorm(args$n * args$p, sd = args$indiv_sd), nrow = args$n)
  
  Y <- LL %*% t(FF) + E
  YYt <- (1/args$p)*tcrossprod(Y)
  
  return(list(Y = Y, YYt = YYt, LL = LL, FF = FF, K = ncol(LL)))
}
n <- 100
p <- 1000
k <- 10
pi1 <- 0.1
indiv_sd <- 1
group_sd <- 1
seed <- 1
sim_args = list(n = n, p = p, k = k, pi1 = pi1, indiv_sd = indiv_sd, group_sd = group_sd, seed = seed, constrain_F = FALSE)
sim_data <- sim_binary_loadings_data(sim_args)

This is a heatmap of the true loadings matrix:

plot_heatmap(sim_data$LL)

GBCD

In this section, I try stability selection with the GBCD method. In my experiments, I’ve found that GBCD tends to return extra factors (partially because the point-Laplace initialization will yield extra factors).

Stability Selection via Splitting Columns

First, I try splitting the data by splitting the columns.

set.seed(1)
X_split_by_col <- stability_selection_split_data(sim_data$Y, dim = 'columns')
gbcd_fits_by_col <- list()
for (i in 1:length(X_split_by_col)){
  gbcd_fits_by_col[[i]] <- gbcd::fit_gbcd(X_split_by_col[[i]], 
                                   Kmax = 10, 
                                   prior = ebnm::ebnm_generalized_binary,
                                   verbose = 0)$L
}
[1] "Form cell by cell covariance matrix..."
   user  system elapsed 
  0.004   0.000   0.004 
[1] "Initialize GEP membership matrix L..."
Adding factor 1 to flash object...
Wrapping up...
Done.
Adding factor 2 to flash object...
Adding factor 3 to flash object...
Adding factor 4 to flash object...
Adding factor 5 to flash object...
Adding factor 6 to flash object...
Adding factor 7 to flash object...
Adding factor 8 to flash object...
Adding factor 9 to flash object...
Adding factor 10 to flash object...
Wrapping up...
Done.
Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.
   user  system elapsed 
  4.068   0.133   4.210 
[1] "Estimate GEP membership matrix L..."
  --Estimate of factor 10 is numerically zero!
  --Estimate of factor 10 is numerically zero!
  --Estimate of factor 20 is numerically zero!
  --Estimate of factor 9 is numerically zero!
  --Estimate of factor 10 is numerically zero!
  --Estimate of factor 9 is numerically zero!
  --Estimate of factor 10 is numerically zero!
  --Estimate of factor 9 is numerically zero!
  --Estimate of factor 10 is numerically zero!
  --Estimate of factor 9 is numerically zero!
  --Estimate of factor 10 is numerically zero!
  --Estimate of factor 9 is numerically zero!
  --Estimate of factor 10 is numerically zero!
  --Estimate of factor 9 is numerically zero!
  --Estimate of factor 10 is numerically zero!
  --Estimate of factor 9 is numerically zero!
  --Estimate of factor 10 is numerically zero!
  --Estimate of factor 9 is numerically zero!
  --Estimate of factor 10 is numerically zero!
  --Estimate of factor 9 is numerically zero!
  --Estimate of factor 10 is numerically zero!
  --Estimate of factor 10 is numerically zero!
  --Estimate of factor 10 is numerically zero!
  --Estimate of factor 6 is numerically zero!
  --Estimate of factor 10 is numerically zero!
  --Estimate of factor 10 is numerically zero!
  --Estimate of factor 10 is numerically zero!
  --Estimate of factor 10 is numerically zero!
  --Estimate of factor 10 is numerically zero!
  --Estimate of factor 10 is numerically zero!
  --Estimate of factor 10 is numerically zero!
Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.
  --Estimate of factor 8 is numerically zero!
  --Estimate of factor 8 is numerically zero!
  --Estimate of factor 2 is numerically zero!
   user  system elapsed 
 13.633   0.323  14.071 
[1] "Estimate GEP signature matrix F..."
Backfitting 16 factors (tolerance: 7.45e-04)...
  --Estimate of factor 13 is numerically zero!
  --Estimate of factor 14 is numerically zero!
  --Estimate of factor 12 is numerically zero!
An update to factor 12 decreased the objective by 5.888e-04.
  Difference between iterations is within 1.0e+01...
  --Estimate of factor 12 is numerically zero!
An update to factor 12 decreased the objective by 4.755e-04.
An update to factor 12 decreased the objective by 2.141e-04.
  --Estimate of factor 12 is numerically zero!
An update to factor 12 decreased the objective by 2.006e-04.
  --Estimate of factor 12 is numerically zero!
An update to factor 12 decreased the objective by 4.800e-05.
  --Estimate of factor 12 is numerically zero!
  --Estimate of factor 9 is numerically zero!
  Difference between iterations is within 1.0e+00...
  --Maximum number of iterations reached!
Wrapping up...
Done.
   user  system elapsed 
 18.873   0.443  19.663 
[1] "Form cell by cell covariance matrix..."
   user  system elapsed 
  0.001   0.000   0.001 
[1] "Initialize GEP membership matrix L..."
Adding factor 1 to flash object...
Wrapping up...
Done.
Adding factor 2 to flash object...
Adding factor 3 to flash object...
Adding factor 4 to flash object...
Adding factor 5 to flash object...
Adding factor 6 to flash object...
Adding factor 7 to flash object...
Adding factor 8 to flash object...
Adding factor 9 to flash object...
Adding factor 10 to flash object...
Wrapping up...
Done.
Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.
   user  system elapsed 
  3.503   0.131   3.649 
[1] "Estimate GEP membership matrix L..."
  --Estimate of factor 17 is numerically zero!
  --Estimate of factor 20 is numerically zero!
  --Estimate of factor 10 is numerically zero!
  --Estimate of factor 14 is numerically zero!
  --Estimate of factor 9 is numerically zero!
  --Estimate of factor 16 is numerically zero!
  --Estimate of factor 6 is numerically zero!
  --Estimate of factor 2 is numerically zero!
Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.
   user  system elapsed 
  8.842   0.182   9.111 
[1] "Estimate GEP signature matrix F..."
Backfitting 11 factors (tolerance: 7.45e-04)...
  Difference between iterations is within 1.0e+01...
  --Estimate of factor 11 is numerically zero!
  Difference between iterations is within 1.0e+00...
  Difference between iterations is within 1.0e-01...
Wrapping up...
Done.
   user  system elapsed 
  0.831   0.021   0.852 

This is heatmap of the loadings estimate from the first subset:

L1_permuted <- permute_L(gbcd_fits_by_col[[1]], sim_data$LL)
plot_heatmap(L1_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L1_permuted)), max(abs(L1_permuted)), length.out = 50))

This is a heatmap of the loadings estimate from the second subset:

L2_permuted <- permute_L(gbcd_fits_by_col[[2]], sim_data$LL)
plot_heatmap(L2_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L2_permuted)), max(abs(L2_permuted)), length.out = 50))

results_by_col <- stability_selection_post_processing(gbcd_fits_by_col[[1]], gbcd_fits_by_col[[2]], threshold = 0.99)
L_est_by_col <- results_by_col$L

This is a heatmap of the final loadings estimate:

L_est_by_col_permuted <- permute_L(L_est_by_col, sim_data$LL)
L_est_by_col_permuted <- L_est_by_col_permuted[, apply(L_est_by_col_permuted, 2, function(x){sum(x^2)}) > 10^(-10)]
plot_heatmap(L_est_by_col_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L_est_by_col_permuted)), max(abs(L_est_by_col_permuted)), length.out = 50))

In this case, the method recovered seven of the ten components.

Stability Selection via Splitting Rows

Now, we try splitting the rows:

set.seed(1)
X_split_by_row <- stability_selection_split_data(sim_data$Y, dim = 'rows')
gbcd_fits_by_row <- list()
for (i in 1:length(X_split_by_row)){
  gbcd_fits_by_row[[i]] <- gbcd::fit_gbcd(X_split_by_row[[i]], 
                                   Kmax = 10, 
                                   prior = ebnm::ebnm_generalized_binary,
                                   verbose = 0)$L
}
[1] "Form cell by cell covariance matrix..."
   user  system elapsed 
  0.000   0.000   0.001 
[1] "Initialize GEP membership matrix L..."
Adding factor 1 to flash object...
Wrapping up...
Done.
Adding factor 2 to flash object...
Adding factor 3 to flash object...
Adding factor 4 to flash object...
Adding factor 5 to flash object...
Adding factor 6 to flash object...
Adding factor 7 to flash object...
Adding factor 8 to flash object...
Adding factor 9 to flash object...
Adding factor 10 to flash object...
Wrapping up...
Done.
Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.
   user  system elapsed 
  5.785   0.190   5.974 
[1] "Estimate GEP membership matrix L..."
Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.
   user  system elapsed 
 35.470   0.957  37.058 
[1] "Estimate GEP signature matrix F..."
Backfitting 16 factors (tolerance: 7.45e-05)...
  Difference between iterations is within 1.0e+00...
  Difference between iterations is within 1.0e-01...
  Difference between iterations is within 1.0e-02...
  Difference between iterations is within 1.0e-03...
  Difference between iterations is within 1.0e-04...
Wrapping up...
Done.
   user  system elapsed 
  0.538   0.009   0.547 
[1] "Form cell by cell covariance matrix..."
   user  system elapsed 
  0.003   0.000   0.002 
[1] "Initialize GEP membership matrix L..."
Adding factor 1 to flash object...
Wrapping up...
Done.
Adding factor 2 to flash object...
Adding factor 3 to flash object...
Adding factor 4 to flash object...
Adding factor 5 to flash object...
Adding factor 6 to flash object...
Adding factor 7 to flash object...
Adding factor 8 to flash object...
Adding factor 9 to flash object...
Adding factor 10 to flash object...
Wrapping up...
Done.
Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.
   user  system elapsed 
  2.743   0.095   2.842 
[1] "Estimate GEP membership matrix L..."
  --Estimate of factor 18 is numerically zero!
  --Estimate of factor 19 is numerically zero!
  --Estimate of factor 8 is numerically zero!
  --Estimate of factor 2 is numerically zero!
  --Estimate of factor 6 is numerically zero!
  --Estimate of factor 2 is numerically zero!
  --Estimate of factor 2 is numerically zero!
  --Estimate of factor 2 is numerically zero!
  --Estimate of factor 2 is numerically zero!
Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.
  --Estimate of factor 5 is numerically zero!
   user  system elapsed 
 13.273   0.278  13.761 
[1] "Estimate GEP signature matrix F..."
Backfitting 12 factors (tolerance: 1.42e-03)...
  Difference between iterations is within 1.0e+02...
  --Estimate of factor 12 is numerically zero!
  --Estimate of factor 4 is numerically zero!
  Difference between iterations is within 1.0e+01...
  Difference between iterations is within 1.0e+00...
  Difference between iterations is within 1.0e-01...
Wrapping up...
Done.
   user  system elapsed 
  1.665   0.038   1.705 

This is heatmap of the loadings estimate from the first subset:

L1_permuted <- permute_L(gbcd_fits_by_row[[1]], sim_data$LL)
plot_heatmap(L1_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L1_permuted)), max(abs(L1_permuted)), length.out = 50))

This is heatmap of the loadings estimate from the second subset:

L2_permuted <- permute_L(gbcd_fits_by_row[[2]], sim_data$LL)
plot_heatmap(L2_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L2_permuted)), max(abs(L2_permuted)), length.out = 50))

results_by_row <- stability_selection_post_processing(gbcd_fits_by_row[[1]], gbcd_fits_by_row[[2]], threshold = 0.99)
L_est_by_row <- results_by_row$L

This is a heatmap of the final loadings estimate:

L_est_by_row_permuted <- permute_L(L_est_by_row, sim_data$LL)
L_est_by_row_permuted <- L_est_by_row_permuted[, apply(L_est_by_row_permuted, 2, function(x){sum(x^2)}) > 10^(-10)]
plot_heatmap(L_est_by_row_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L_est_by_row_permuted)), max(abs(L_est_by_row_permuted)), length.out = 50))

In this case, the method recovered two of the ten factors.

EBCD

In this section, I try stability selection with the GBCD method. When given a Kmax value that is larger than the true number of components, I’ve found that EBCD usually returns extra factors. So in this section, when I run EBCD, I give the method double the true number of components.

Stability Selection via Splitting Columns

set.seed(1)
ebcd_fits_by_col <- list()
for (i in 1:length(X_split_by_col)){
  ebcd_fits_by_col[[i]] <- ebcd::ebcd(t(X_split_by_col[[i]]), 
                                   Kmax = 20, 
                                   ebnm_fn = ebnm::ebnm_generalized_binary)$EL
}

This is heatmap of the loadings estimate from the first subset:

L1_permuted <- permute_L(ebcd_fits_by_col[[1]], sim_data$LL)
plot_heatmap(L1_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L1_permuted)), max(abs(L1_permuted)), length.out = 50))

This is a heatmap of the loadings estimate from the second subset:

L2_permuted <- permute_L(ebcd_fits_by_col[[2]], sim_data$LL)
plot_heatmap(L2_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L2_permuted)), max(abs(L2_permuted)), length.out = 50))

results_by_col <- stability_selection_post_processing(ebcd_fits_by_col[[1]], ebcd_fits_by_col[[2]], threshold = 0.99)
L_est_by_col <- results_by_col$L

This is a heatmap of the final loadings estimate:

L_est_by_col_permuted <- permute_L(L_est_by_col, sim_data$LL)
L_est_by_col_permuted <- L_est_by_col_permuted[, apply(L_est_by_col_permuted, 2, function(x){sum(x^2)}) > 10^(-10)]
plot_heatmap(L_est_by_col_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L_est_by_col_permuted)), max(abs(L_est_by_col_permuted)), length.out = 50))

These are the correlations between the estimates and the true factors (which are paired to maximize crossproduct similarity):

diag(cor(L_est_by_col_permuted, sim_data$LL))
 [1] 0.9975203 0.9974854 0.9983737 0.9988182 0.9992998 0.9991493 0.9990425
 [8] 0.9982839 0.9981471 0.9978499

The method recovers all ten components nearly perfectly.

Stability Selection via Splitting Rows

set.seed(1)
ebcd_fits_by_row <- list()
for (i in 1:length(X_split_by_row)){
  ebcd_fits_by_row[[i]] <- ebcd::ebcd(t(X_split_by_row[[i]]), 
                                   Kmax = 20, 
                                   ebnm_fn = ebnm::ebnm_generalized_binary)$EL
}

This is heatmap of the loadings estimate from the first subset:

L1_permuted <- permute_L(ebcd_fits_by_row[[1]], sim_data$LL)
plot_heatmap(L1_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L1_permuted)), max(abs(L1_permuted)), length.out = 50))

This is heatmap of the loadings estimate from the second subset:

L2_permuted <- permute_L(ebcd_fits_by_row[[2]], sim_data$LL)
plot_heatmap(L2_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L2_permuted)), max(abs(L2_permuted)), length.out = 50))

results_by_row <- stability_selection_post_processing(ebcd_fits_by_row[[1]], ebcd_fits_by_row[[2]], threshold = 0.99)
L_est_by_row <- results_by_row$L

This is a heatmap of the final loadings estimate:

L_est_by_row_permuted <- permute_L(L_est_by_row, sim_data$LL)
L_est_by_row_permuted <- L_est_by_row_permuted[, apply(L_est_by_row_permuted, 2, function(x){sum(x^2)}) > 10^(-10)]
plot_heatmap(L_est_by_row_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L_est_by_row_permuted)), max(abs(L_est_by_row_permuted)), length.out = 50))

In this case, the method returns only seven of the ten factors.

CoDesymNMF

In this section, I try stability selection with the CoDesymNMF method. Similar to EBCD, when given a Kmax value that is larger than the true number of components, the method usually returns extra factors. Note that in this section, when I run CoDesymNMF, I give the method double the true number of components.

Stability Selection via Splitting Columns

codesymnmf_fits_by_col <- list()
for (i in 1:length(X_split_by_col)){
  cov_mat <- tcrossprod(X_split_by_col[[i]])/ncol(X_split_by_col[[i]])
  codesymnmf_fits_by_col[[i]] <- codesymnmf::codesymnmf(cov_mat, 20)$H
}

This is heatmap of the loadings estimate from the first subset:

L1_permuted <- permute_L(codesymnmf_fits_by_col[[1]], sim_data$LL)
plot_heatmap(L1_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L1_permuted)), max(abs(L1_permuted)), length.out = 50))

This is a heatmap of the loadings estimate from the second subset:

L2_permuted <- permute_L(codesymnmf_fits_by_col[[2]], sim_data$LL)
plot_heatmap(L2_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L2_permuted)), max(abs(L2_permuted)), length.out = 50))

results_by_col <- stability_selection_post_processing(codesymnmf_fits_by_col[[1]], codesymnmf_fits_by_col[[2]], threshold = 0.99)
L_est_by_col <- results_by_col$L

In this case, the method does not return any factors. Reducing the similarity threshold to 0.98 allows the method to return four of the ten factors. So it seems like the estimates do contain the signal. However, perhaps due to the fact the method doesn’t encourage sparsity or is not binary, the estimates are not exactly the same across the two estimates.

results_by_col <- stability_selection_post_processing(codesymnmf_fits_by_col[[1]], codesymnmf_fits_by_col[[2]], threshold = 0.98)
L_est_by_col <- results_by_col$L

This is a heatmap of the final loadings estimate with a decreased threshold:

L_est_by_col_permuted <- permute_L(L_est_by_col, sim_data$LL)
L_est_by_col_permuted <- L_est_by_col_permuted[, apply(L_est_by_col_permuted, 2, function(x){sum(x^2)}) > 10^(-10)]
plot_heatmap(L_est_by_col_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L_est_by_col_permuted)), max(abs(L_est_by_col_permuted)), length.out = 50))

Stability Selection via Splitting Rows

codesymnmf_fits_by_row <- list()
for (i in 1:length(X_split_by_row)){
  cov_mat <- tcrossprod(X_split_by_row[[i]])/ncol(X_split_by_row[[i]])
  codesymnmf_fits_by_row[[i]] <- codesymnmf::codesymnmf(cov_mat, 20)$H
}

This is heatmap of the loadings estimate from the first subset:

L1_permuted <- permute_L(codesymnmf_fits_by_row[[1]], sim_data$LL)
plot_heatmap(L1_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L1_permuted)), max(abs(L1_permuted)), length.out = 50))

This is heatmap of the loadings estimate from the second subset:

L2_permuted <- permute_L(codesymnmf_fits_by_row[[2]], sim_data$LL)
plot_heatmap(L2_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L2_permuted)), max(abs(L2_permuted)), length.out = 50))

results_by_row <- stability_selection_post_processing(codesymnmf_fits_by_row[[1]], codesymnmf_fits_by_row[[2]], threshold = 0.99)
L_est_by_row <- results_by_row$L

In this setting, the method again does not return any factors. The method only starts returning factors when the threshold is reduced to 0.92, so this suggests the estimates are more dissimilar in this case.

results_by_row <- stability_selection_post_processing(codesymnmf_fits_by_row[[1]], codesymnmf_fits_by_row[[2]], threshold = 0.92)
L_est_by_row <- results_by_row$L

This is a heatmap of the final loadings estimate with a reduced threshold:

L_est_by_row_permuted <- permute_L(L_est_by_row, sim_data$LL)
L_est_by_row_permuted <- L_est_by_row_permuted[, apply(L_est_by_row_permuted, 2, function(x){sum(x^2)}) > 10^(-10)]
plot_heatmap(L_est_by_row_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L_est_by_row_permuted)), max(abs(L_est_by_row_permuted)), length.out = 50))

EBMFcov

In this section, I try stability selection with the EBMFcov method. In my experiments, I’ve found that when given a larger Kmax, EBMFcov will get around the correct number of factors, and then it will add essentially trivial factors. I wonder if the stability selection procedure will help get rid of these extra factors. The hope is that it can. However, if both estimates have factors that look like this, then maybe these factors will still get returned. Note that in this section, when I run EBMFcov, I give the method double the true number of components.

cov_fit <- function(covmat, ebnm_fn = ebnm::ebnm_point_laplace, Kmax = 1000, verbose.lvl = 0, backfit = TRUE) {
  fl <- flash_init(covmat, var_type = 0) %>%
    flash_set_verbose(verbose.lvl) %>%
    flash_greedy(ebnm_fn = ebnm_fn, Kmax = Kmax)
  if (backfit == TRUE){
    fl <- flash_backfit(fl)
  }
  s2 <- max(0, mean(diag(covmat) - diag(fitted(fl))))
  s2_diff <- Inf
  while(s2 > 0 && abs(s2_diff - 1) > 1e-4) {
    covmat_minuss2 <- covmat - diag(rep(s2, ncol(covmat)))
    fl <- flash_init(covmat_minuss2, var_type = 0) %>%
      flash_set_verbose(verbose.lvl) %>%
      flash_greedy(ebnm_fn = ebnm_fn, Kmax = Kmax)
    if (backfit == TRUE){
      fl <- flash_backfit(fl)
    }
    old_s2 <- s2
    s2 <- max(0, mean(diag(covmat) - diag(fitted(fl))))
    s2_diff <- s2 / old_s2
  }
  
  return(list(fl=fl, s2 = s2))
}

Stability Selection via Splitting Columns

library(flashier)
Loading required package: ebnm
ebmfcov_fits_by_col <- list()
for (i in 1:length(X_split_by_col)){
  cov_mat <- tcrossprod(X_split_by_col[[i]])/ncol(X_split_by_col[[i]])
  fl <- cov_fit(cov_mat, ebnm_fn = ebnm::ebnm_generalized_binary, Kmax = 20)$fl
  fl_scaled <- ldf(fl)
  L_scaled <- fl_scaled$L %*% diag(sqrt(fl_scaled$D))
  ebmfcov_fits_by_col[[i]] <- L_scaled
}

Just a note: this took a really long time to run. I didn’t expect it to take so long.

This is heatmap of the loadings estimate from the first subset:

L1_permuted <- permute_L(ebmfcov_fits_by_col[[1]], sim_data$LL)
plot_heatmap(L1_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L1_permuted)), max(abs(L1_permuted)), length.out = 50))

This is a heatmap of the loadings estimate from the second subset:

L2_permuted <- permute_L(ebmfcov_fits_by_col[[2]], sim_data$LL)
plot_heatmap(L2_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L2_permuted)), max(abs(L2_permuted)), length.out = 50))

results_by_col <- stability_selection_post_processing(ebmfcov_fits_by_col[[1]], ebmfcov_fits_by_col[[2]], threshold = 0.99)
L_est_by_col <- results_by_col$L

This is a heatmap of the final loadings estimate:

L_est_by_col_permuted <- permute_L(L_est_by_col, sim_data$LL)
L_est_by_col_permuted <- L_est_by_col_permuted[, apply(L_est_by_col_permuted, 2, function(x){sum(x^2)}) > 10^(-10)]
plot_heatmap(L_est_by_col_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L_est_by_col_permuted)), max(abs(L_est_by_col_permuted)), length.out = 50))

This is a heatmap of the factors in the final loadings estimate that appear to be trivial:

plot_heatmap(L_est_by_col_permuted[, c(3, 6, 11:19)], colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L_est_by_col_permuted[, c(3, 6, 11:19)])), max(abs(L_est_by_col_permuted[, c(3, 6, 11:19)])), length.out = 50))

So it appears that the method still returns since both estimates have factors of this type. I think this is something specific to EBMFcov and the generalized binary prior when Kmax is large, so perhaps another type of ad-hoc procedure is needed to get rid of these extra factors.

Stability Selection via Splitting Rows

ebmfcov_fits_by_row <- list()
for (i in 1:length(X_split_by_row)){
  cov_mat <- tcrossprod(X_split_by_row[[i]])/ncol(X_split_by_row[[i]])
  fl <- cov_fit(cov_mat, ebnm_fn = ebnm::ebnm_generalized_binary, Kmax = 20)$fl
  fl_scaled <- ldf(fl)
  L_scaled <- fl_scaled$L %*% diag(sqrt(fl_scaled$D))
  ebmfcov_fits_by_row[[i]] <- L_scaled
}

This is heatmap of the loadings estimate from the first subset:

L1_permuted <- permute_L(ebmfcov_fits_by_row[[1]], sim_data$LL)
plot_heatmap(L1_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L1_permuted)), max(abs(L1_permuted)), length.out = 50))

This is heatmap of the loadings estimate from the second subset:

L2_permuted <- permute_L(ebmfcov_fits_by_row[[2]], sim_data$LL)
plot_heatmap(L2_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L2_permuted)), max(abs(L2_permuted)), length.out = 50))

results_by_row <- stability_selection_post_processing(ebmfcov_fits_by_row[[1]], ebmfcov_fits_by_row[[2]], threshold = 0.99)
L_est_by_row <- results_by_row$L

This is a heatmap of the final loadings estimate:

L_est_by_row_permuted <- permute_L(L_est_by_row, sim_data$LL)
L_est_by_row_permuted <- L_est_by_row_permuted[, apply(L_est_by_row_permuted, 2, function(x){sum(x^2)}) > 10^(-10)]
plot_heatmap(L_est_by_row_permuted, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L_est_by_row_permuted)), max(abs(L_est_by_row_permuted)), length.out = 50))

The method returns three of the ten factors. Again, the method also returns some of the extra factors.

Observations

As expected, the variant of stability selection which split the data via the rows struggled more than the variant which split the data via the columns. I thought it was interesting that EBCD with the column-splitting based stability selection was able to return all ten factors – this suggests that when given a higher Kmax, EBCD will return extra factors on top of the correct factors (as opposed to representing the matrix with a characteristically different representation that utilizes its full factor allotment). CoDesymNMF did not return any factors with either stability selection method. This may be a result of the lack of a sparsity penalty in CoDesymNMF.


sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] flashier_1.0.56 ebnm_1.1-34     pheatmap_1.0.12 ggplot2_3.5.2  
[5] dplyr_1.1.4     workflowr_1.7.1

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1      viridisLite_0.4.2     farver_2.1.2         
 [4] fastmap_1.2.0         lazyeval_0.2.2        codesymnmf_0.0.0.9000
 [7] promises_1.3.3        digest_0.6.37         lifecycle_1.0.4      
[10] processx_3.8.4        invgamma_1.1          magrittr_2.0.3       
[13] compiler_4.3.2        rlang_1.1.6           sass_0.4.10          
[16] progress_1.2.3        tools_4.3.2           yaml_2.3.10          
[19] data.table_1.17.6     knitr_1.50            prettyunits_1.2.0    
[22] htmlwidgets_1.6.4     scatterplot3d_0.3-44  RColorBrewer_1.1-3   
[25] Rtsne_0.17            withr_3.0.2           purrr_1.0.4          
[28] grid_4.3.2            git2r_0.33.0          fastTopics_0.6-192   
[31] colorspace_2.1-1      scales_1.4.0          gtools_3.9.5         
[34] cli_3.6.5             rmarkdown_2.29        crayon_1.5.3         
[37] generics_0.1.4        RcppParallel_5.1.10   rstudioapi_0.16.0    
[40] httr_1.4.7            pbapply_1.7-2         cachem_1.1.0         
[43] stringr_1.5.1         splines_4.3.2         parallel_4.3.2       
[46] softImpute_1.4-3      vctrs_0.6.5           Matrix_1.6-5         
[49] jsonlite_2.0.0        callr_3.7.6           hms_1.1.3            
[52] mixsqp_0.3-54         ggrepel_0.9.6         irlba_2.3.5.1        
[55] horseshoe_0.2.0       trust_0.1-8           plotly_4.11.0        
[58] jquerylib_0.1.4       tidyr_1.3.1           ebcd_0.0.0.9000      
[61] glue_1.8.0            ps_1.7.7              cowplot_1.1.3        
[64] gbcd_0.2-17           uwot_0.2.3            stringi_1.8.7        
[67] Polychrome_1.5.1      gtable_0.3.6          later_1.4.2          
[70] quadprog_1.5-8        tibble_3.3.0          pillar_1.10.2        
[73] htmltools_0.5.8.1     truncnorm_1.0-9       R6_2.6.1             
[76] rprojroot_2.0.4       lpSolve_5.6.20        evaluate_1.0.4       
[79] lattice_0.22-6        RhpcBLASctl_0.23-42   SQUAREM_2021.1       
[82] ashr_2.2-66           httpuv_1.6.15         bslib_0.9.0          
[85] Rcpp_1.0.14           deconvolveR_1.2-1     whisker_0.4.1        
[88] xfun_0.52             fs_1.6.6              getPass_0.2-4        
[91] pkgconfig_2.0.3