Last updated: 2025-09-11

Checks: 7 0

Knit directory: stability_selection/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20250906)

The command set.seed(20250906) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: b240d2c

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version b240d2c. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory

Untracked files:
    Untracked:  analysis/sparse_overlap_setting.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/baltree_setting.Rmd) and HTML (docs/baltree_setting.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	b240d2c	Annie Xie	2025-09-11	Add exploration of balanced tree setting

Introduction

In this analysis, we are interested in testing stability selection approaches in the balanced tree setting. I am curious to see how the stability selection does in this setting because many methods struggle due to the non-identifiability issues. I’m curious to see if the subsetting of the data leads to different loadings estimates. Furthermore, will that lead to only a couple of factors being returned?

At a high level, the stability selection involves 1) splitting the data into two subsets, 2) applying the method to each subset, 3) choosing the components that have high correspondence across the two sets of results. We will test two different approaches to step 1). The first approach is splitting the data by splitting the columns. This approach feels intuitive since we are interested in the loadings matrix, which says something about the samples in the dataset. In a population genetics application, one could argue that all of the chromosomes are undergoing evolution independently, and so you could split the data by even vs. odd chromosomes to get two different datasets. However, in a single-cell RNA-seq application, it feels more natural to split the data by cells – this feels more like creating sub-datasets compared to splitting by genes (unless you want to make some assumption that the genes are pulled from a “population”, but I think that feels less natural). This motivates the second approach: splitting the data by splitting the rows.

library(dplyr)
library(ggplot2)
library(pheatmap)

source('code/visualization_functions.R')
source('code/stability_selection_functions.R')

Data Generation

In this analysis, we will focus on the balanced tree setting.

sim_4pops <- function(args) {
  set.seed(args$seed)
  
  n <- sum(args$pop_sizes)
  p <- args$n_genes
  
  FF <- matrix(rnorm(7 * p, sd = rep(args$branch_sds, each = p)), ncol = 7)
  # if (args$constrain_F) {
  #   FF_svd <- svd(FF)
  #   FF <- FF_svd$u
  #   FF <- t(t(FF) * branch_sds * sqrt(p))
  # }
  
  LL <- matrix(0, nrow = n, ncol = 7)
  LL[, 1] <- 1
  LL[, 2] <- rep(c(1, 1, 0, 0), times = args$pop_sizes)
  LL[, 3] <- rep(c(0, 0, 1, 1), times = args$pop_sizes)
  LL[, 4] <- rep(c(1, 0, 0, 0), times = args$pop_sizes)
  LL[, 5] <- rep(c(0, 1, 0, 0), times = args$pop_sizes)
  LL[, 6] <- rep(c(0, 0, 1, 0), times = args$pop_sizes)
  LL[, 7] <- rep(c(0, 0, 0, 1), times = args$pop_sizes)
  
  E <- matrix(rnorm(n * p, sd = args$indiv_sd), nrow = n)
  Y <- LL %*% t(FF) + E
  YYt <- (1/p)*tcrossprod(Y)
  return(list(Y = Y, YYt = YYt, LL = LL, FF = FF, K = ncol(LL)))
}

sim_args = list(pop_sizes = rep(40, 4), n_genes = 1000, branch_sds = rep(2,7), indiv_sd = 1, seed = 1)
sim_data <- sim_4pops(sim_args)

This is a heatmap of the true loadings matrix:

plot_heatmap(sim_data$LL)

GBCD

In this section, I try stability selection with the GBCD method. In my experiments, I’ve found that GBCD tends to return extra factors (partially because the point-Laplace initialization will yield extra factors). GBCD usually does a good job at recovering the correct number of components in the tree setting, so I expect the method will still return most, if not all of the factors.

Stability Selection via Splitting Columns

First, I try splitting the data by splitting the columns.

set.seed(1)
X_split_by_col <- stability_selection_split_data(sim_data$Y, dim = 'columns')

gbcd_fits_by_col <- list()
for (i in 1:length(X_split_by_col)){
  gbcd_fits_by_col[[i]] <- gbcd::fit_gbcd(X_split_by_col[[i]], 
                                   Kmax = 7, 
                                   prior = ebnm::ebnm_generalized_binary,
                                   verbose = 0)$L
}

[1] "Form cell by cell covariance matrix..."
   user  system elapsed 
  0.005   0.000   0.006 
[1] "Initialize GEP membership matrix L..."
Adding factor 1 to flash object...
Wrapping up...
Done.
Adding factor 2 to flash object...
Adding factor 3 to flash object...
Adding factor 4 to flash object...
Adding factor 5 to flash object...
Factor doesn't significantly increase objective and won't be added.
Wrapping up...
Done.

Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.

   user  system elapsed 
  1.100   0.027   1.129 
[1] "Estimate GEP membership matrix L..."

Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.
Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.
Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.

   user  system elapsed 
 11.486   0.115  11.603 
[1] "Estimate GEP signature matrix F..."
Backfitting 7 factors (tolerance: 1.19e-03)...
  Difference between iterations is within 1.0e+00...
  Difference between iterations is within 1.0e-01...
  --Maximum number of iterations reached!
Wrapping up...
Done.
   user  system elapsed 
 11.387   0.227  11.615 
[1] "Form cell by cell covariance matrix..."
   user  system elapsed 
  0.003   0.000   0.003 
[1] "Initialize GEP membership matrix L..."
Adding factor 1 to flash object...
Wrapping up...
Done.
Adding factor 2 to flash object...
Adding factor 3 to flash object...
Adding factor 4 to flash object...
Adding factor 5 to flash object...
Factor doesn't significantly increase objective and won't be added.
Wrapping up...
Done.

Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.

   user  system elapsed 
  3.551   0.080   3.632 
[1] "Estimate GEP membership matrix L..."

Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.
Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.
Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.

   user  system elapsed 
 11.081   0.111  11.193 
[1] "Estimate GEP signature matrix F..."
Backfitting 7 factors (tolerance: 1.19e-03)...
  Difference between iterations is within 1.0e+00...
  Difference between iterations is within 1.0e-01...
  --Maximum number of iterations reached!
Wrapping up...
Done.
   user  system elapsed 
 11.320   0.218  11.539

This is heatmap of the loadings estimate from the first subset:

plot_heatmap(gbcd_fits_by_col[[1]], colors_range = c('blue','gray96','red'), brks = seq(-max(abs(gbcd_fits_by_col[[1]])), max(abs(gbcd_fits_by_col[[1]])), length.out = 50))

This is a heatmap of the loadings estimate from the second subset:

plot_heatmap(gbcd_fits_by_col[[2]], colors_range = c('blue','gray96','red'), brks = seq(-max(abs(gbcd_fits_by_col[[2]])), max(abs(gbcd_fits_by_col[[2]])), length.out = 50))

results_by_col <- stability_selection_post_processing(gbcd_fits_by_col[[1]], gbcd_fits_by_col[[2]], threshold = 0.99)
L_est_by_col <- results_by_col$L

This is a heatmap of the final loadings estimate:

plot_heatmap(L_est_by_col, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L_est_by_col)), max(abs(L_est_by_col)), length.out = 50))

In this case, the method was able to recover six of the seven factors. However, by reducing the similarity threshold from 0.99 to 0.98, we are able to recover the last component.

results_by_col <- stability_selection_post_processing(gbcd_fits_by_col[[1]], gbcd_fits_by_col[[2]], threshold = 0.98)
L_est_by_col <- results_by_col$L

This is a heatmap of the final loadings estimate:

plot_heatmap(L_est_by_col, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L_est_by_col)), max(abs(L_est_by_col)), length.out = 50))

Stability Selection via Splitting Rows

Now, we try splitting the rows:

set.seed(1)
X_split_by_row <- stability_selection_split_data(sim_data$Y, dim = 'rows')

gbcd_fits_by_row <- list()
for (i in 1:length(X_split_by_row)){
  gbcd_fits_by_row[[i]] <- gbcd::fit_gbcd(X_split_by_row[[i]], 
                                   Kmax = 7, 
                                   prior = ebnm::ebnm_generalized_binary,
                                   verbose = 0)$L
}

[1] "Form cell by cell covariance matrix..."
   user  system elapsed 
  0.001   0.000   0.001 
[1] "Initialize GEP membership matrix L..."
Adding factor 1 to flash object...
Wrapping up...
Done.
Adding factor 2 to flash object...
Adding factor 3 to flash object...
Adding factor 4 to flash object...
Adding factor 5 to flash object...
Factor doesn't significantly increase objective and won't be added.
Wrapping up...
Done.

Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.
Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.

   user  system elapsed 
 12.332   0.209  12.541 
[1] "Estimate GEP membership matrix L..."

Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.
Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.

   user  system elapsed 
 24.830   0.477  25.307 
[1] "Estimate GEP signature matrix F..."
Backfitting 6 factors (tolerance: 1.91e-04)...
  Difference between iterations is within 1.0e-01...
  Difference between iterations is within 1.0e-02...
  Difference between iterations is within 1.0e-03...
Wrapping up...
Done.
   user  system elapsed 
  1.621   0.041   1.661 
[1] "Form cell by cell covariance matrix..."
   user  system elapsed 
  0.005   0.000   0.005 
[1] "Initialize GEP membership matrix L..."
Adding factor 1 to flash object...
Wrapping up...
Done.
Adding factor 2 to flash object...
Adding factor 3 to flash object...
Adding factor 4 to flash object...
Adding factor 5 to flash object...
Factor doesn't significantly increase objective and won't be added.
Wrapping up...
Done.

Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.

   user  system elapsed 
  1.807   0.034   1.841 
[1] "Estimate GEP membership matrix L..."

Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.
Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.
Warning in report.maxiter.reached(verbose.lvl): Maximum number of iterations
reached.

   user  system elapsed 
 11.469   0.162  11.657 
[1] "Estimate GEP signature matrix F..."
Backfitting 7 factors (tolerance: 2.19e-03)...
  Difference between iterations is within 1.0e+00...
  --Maximum number of iterations reached!
Wrapping up...
Done.
   user  system elapsed 
 17.220   0.377  17.600

This is heatmap of the loadings estimate from the first subset:

plot_heatmap(gbcd_fits_by_row[[1]], colors_range = c('blue','gray96','red'), brks = seq(-max(abs(gbcd_fits_by_row[[1]])), max(abs(gbcd_fits_by_row[[1]])), length.out = 50))

This is heatmap of the loadings estimate from the second subset:

plot_heatmap(gbcd_fits_by_row[[2]], colors_range = c('blue','gray96','red'), brks = seq(-max(abs(gbcd_fits_by_row[[2]])), max(abs(gbcd_fits_by_row[[2]])), length.out = 50))

results_by_row <- stability_selection_post_processing(gbcd_fits_by_row[[1]], gbcd_fits_by_row[[2]], threshold = 0.99)
L_est_by_row <- results_by_row$L

This is a heatmap of the final loadings estimate:

plot_heatmap(L_est_by_row, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L_est_by_row)), max(abs(L_est_by_row)), length.out = 50))

In this case, the method only recovers two of the seven factors. Visually, the second estimate seems to capture all of the components. Meanwhile, the first estimate seems to capture only six components (It looks like the first group effect and fourth group effect were merged a bit into one component).

Note if I reduce the threshold from 0.99 to 0.9, then it finds six factors.

results_by_row_threshold0.9 <- stability_selection_post_processing(gbcd_fits_by_row[[1]], gbcd_fits_by_row[[2]], threshold = 0.9)
L_est_by_row_threshold0.9 <- results_by_row_threshold0.9$L

plot_heatmap(L_est_by_row_threshold0.9, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L_est_by_row_threshold0.9)), max(abs(L_est_by_row_threshold0.9)), length.out = 50))

EBCD

In this section, I try stability selection with the GBCD method. When given a Kmax value that is larger than the true number of components, I’ve found that EBCD usually returns extra factors. So in this section, when I run EBCD, I give the method double the true number of components.

Stability Selection via Splitting Columns

set.seed(1)
ebcd_fits_by_col <- list()
for (i in 1:length(X_split_by_col)){
  ebcd_fits_by_col[[i]] <- ebcd::ebcd(t(X_split_by_col[[i]]), 
                                   Kmax = 14, 
                                   ebnm_fn = ebnm::ebnm_generalized_binary)$EL
}

This is a heatmap of the estimated loadings from the first subset:

plot_heatmap(ebcd_fits_by_col[[1]], colors_range = c('blue','gray96','red'), brks = seq(-max(abs(ebcd_fits_by_col[[1]])), max(abs(ebcd_fits_by_col[[1]])), length.out = 50))

This is a heatmap of the estimated loadings from the second subset:

plot_heatmap(ebcd_fits_by_col[[2]], colors_range = c('blue','gray96','red'), brks = seq(-max(abs(ebcd_fits_by_col[[2]])), max(abs(ebcd_fits_by_col[[2]])), length.out = 50))

results_by_col <- stability_selection_post_processing(ebcd_fits_by_col[[1]], ebcd_fits_by_col[[2]], threshold = 0.99)
L_est_by_col <- results_by_col$L

In this example, the method does not return any factors.

Stability Selection via Splitting Rows

set.seed(1)
ebcd_fits_by_row <- list()
for (i in 1:length(X_split_by_row)){
  ebcd_fits_by_row[[i]] <- ebcd::ebcd(t(X_split_by_row[[i]]), 
                                   Kmax = 14, 
                                   ebnm_fn = ebnm::ebnm_generalized_binary)$EL
}

This is a heatmap of the loadings estimate from the first subset:

plot_heatmap(ebcd_fits_by_row[[1]], colors_range = c('blue','gray96','red'), brks = seq(-max(abs(ebcd_fits_by_row[[1]])), max(abs(ebcd_fits_by_row[[1]])), length.out = 50))

This is a heatmap of the loadings estimate from the second subset:

plot_heatmap(ebcd_fits_by_row[[2]], colors_range = c('blue','gray96','red'), brks = seq(-max(abs(ebcd_fits_by_row[[2]])), max(abs(ebcd_fits_by_row[[2]])), length.out = 50))

results_by_row <- stability_selection_post_processing(ebcd_fits_by_row[[1]], ebcd_fits_by_row[[2]], threshold = 0.99)
L_est_by_row <- results_by_row$L

In this case, the stability selection method also does not return any factors.

CoDesymNMF

In this section, I try stability selection with the CoDesymNMF method. Similar to EBCD, when given a Kmax value that is larger than the true number of components, the method usually returns extra factors. Note that in this section, when I run CoDesymNMF, I give the method double the true number of components.

Stability Selection via Splitting Columns

codesymnmf_fits_by_col <- list()
for (i in 1:length(X_split_by_col)){
  cov_mat <- tcrossprod(X_split_by_col[[i]])/ncol(X_split_by_col[[i]])
  codesymnmf_fits_by_col[[i]] <- codesymnmf::codesymnmf(cov_mat, 14)$H
}

This is a heatmap of the estimated loadings from the first subset:

plot_heatmap(codesymnmf_fits_by_col[[1]], colors_range = c('blue','gray96','red'), brks = seq(-max(abs(codesymnmf_fits_by_col[[1]])), max(abs(codesymnmf_fits_by_col[[1]])), length.out = 50))

This is a heatmap of the estimated loadings from the second subset:

plot_heatmap(codesymnmf_fits_by_col[[2]], colors_range = c('blue','gray96','red'), brks = seq(-max(abs(codesymnmf_fits_by_col[[2]])), max(abs(codesymnmf_fits_by_col[[2]])), length.out = 50))

results_by_col <- stability_selection_post_processing(codesymnmf_fits_by_col[[1]], codesymnmf_fits_by_col[[2]], threshold = 0.99)
L_est_by_col <- results_by_col$L

plot_heatmap(L_est_by_col, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L_est_by_col)), max(abs(L_est_by_col)), length.out = 50))

The method recovers five factors. The factors kind of look like some of the tree factors. However, these factors are not as sparse (as expected since CoDesymNMF does not explicitly encourage sparsity).

Stability Selection via Splitting Rows

codesymnmf_fits_by_row <- list()
for (i in 1:length(X_split_by_row)){
  cov_mat <- tcrossprod(X_split_by_row[[i]])/ncol(X_split_by_row[[i]])
  codesymnmf_fits_by_row[[i]] <- codesymnmf::codesymnmf(cov_mat, 14)$H
}

This is a heatmap of the estimated loadings from the first subset:

plot_heatmap(codesymnmf_fits_by_row[[1]], colors_range = c('blue','gray96','red'), brks = seq(-max(abs(codesymnmf_fits_by_row[[1]])), max(abs(codesymnmf_fits_by_row[[1]])), length.out = 50))

This is a heatmap of the estimated loadings from the second subset:

plot_heatmap(codesymnmf_fits_by_row[[2]], colors_range = c('blue','gray96','red'), brks = seq(-max(abs(codesymnmf_fits_by_row[[2]])), max(abs(codesymnmf_fits_by_row[[2]])), length.out = 50))

results_by_row <- stability_selection_post_processing(codesymnmf_fits_by_row[[1]], codesymnmf_fits_by_row[[2]], threshold = 0.99)
L_est_by_row <- results_by_row$L

This is a heatmap of the final loadings estimate:

plot_heatmap(L_est_by_row, colors_range = c('blue','gray96','red'), brks = seq(-max(abs(L_est_by_row)), max(abs(L_est_by_row)), length.out = 50))

In this case, the method only returns two components.

Observations

In this setting, a couple of the methods return only a couple of components. This is not entirely surprising to me due to the non-identifiability issues in the tree case. Interestingly, EBCD with stability selection did not return any factors. This may suggest that EBCD is more sensitive to the data input? Or perhaps the data subsets are similar enough, but the method is just finding different representations.

sessionInfo()

R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] pheatmap_1.0.12 ggplot2_3.5.2   dplyr_1.1.4     workflowr_1.7.1

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1      viridisLite_0.4.2     farver_2.1.2         
 [4] fastmap_1.2.0         lazyeval_0.2.2        codesymnmf_0.0.0.9000
 [7] promises_1.3.3        digest_0.6.37         lifecycle_1.0.4      
[10] processx_3.8.4        invgamma_1.1          magrittr_2.0.3       
[13] compiler_4.3.2        rlang_1.1.6           sass_0.4.10          
[16] progress_1.2.3        tools_4.3.2           yaml_2.3.10          
[19] data.table_1.17.6     knitr_1.50            prettyunits_1.2.0    
[22] htmlwidgets_1.6.4     scatterplot3d_0.3-44  RColorBrewer_1.1-3   
[25] Rtsne_0.17            withr_3.0.2           purrr_1.0.4          
[28] flashier_1.0.56       grid_4.3.2            git2r_0.33.0         
[31] fastTopics_0.6-192    colorspace_2.1-1      scales_1.4.0         
[34] gtools_3.9.5          cli_3.6.5             rmarkdown_2.29       
[37] crayon_1.5.3          generics_0.1.4        RcppParallel_5.1.10  
[40] rstudioapi_0.16.0     httr_1.4.7            pbapply_1.7-2        
[43] cachem_1.1.0          stringr_1.5.1         splines_4.3.2        
[46] parallel_4.3.2        softImpute_1.4-3      vctrs_0.6.5          
[49] Matrix_1.6-5          jsonlite_2.0.0        callr_3.7.6          
[52] hms_1.1.3             mixsqp_0.3-54         ggrepel_0.9.6        
[55] irlba_2.3.5.1         horseshoe_0.2.0       trust_0.1-8          
[58] plotly_4.11.0         jquerylib_0.1.4       tidyr_1.3.1          
[61] ebcd_0.0.0.9000       glue_1.8.0            ebnm_1.1-34          
[64] ps_1.7.7              cowplot_1.1.3         gbcd_0.2-17          
[67] uwot_0.2.3            stringi_1.8.7         Polychrome_1.5.1     
[70] gtable_0.3.6          later_1.4.2           quadprog_1.5-8       
[73] tibble_3.3.0          pillar_1.10.2         htmltools_0.5.8.1    
[76] truncnorm_1.0-9       R6_2.6.1              rprojroot_2.0.4      
[79] evaluate_1.0.4        lattice_0.22-6        RhpcBLASctl_0.23-42  
[82] SQUAREM_2021.1        ashr_2.2-66           httpuv_1.6.15        
[85] bslib_0.9.0           Rcpp_1.0.14           deconvolveR_1.2-1    
[88] whisker_0.4.1         xfun_0.52             fs_1.6.6             
[91] getPass_0.2-4         pkgconfig_2.0.3

baltree_setting

Annie Xie

2025-09-11

Introduction

Data Generation

GBCD

Stability Selection via Splitting Columns

Stability Selection via Splitting Rows

EBCD

Stability Selection via Splitting Columns

Stability Selection via Splitting Rows

CoDesymNMF

Stability Selection via Splitting Columns

Stability Selection via Splitting Rows

Observations