Inconsistent results between seeds #135

michaelplynch · 2024-06-12T10:28:33Z

Hi Stefano,
Flagging some interesting behaviour I've come across with the package.

Problem: The 1d credible intervals plot is showing different results in terms of rank, credible interval and FDR significance when run with different seeds. I see the "EXPERIMENTAL ALGORITHM" message, is this a byproduct? I'm working off sccomp v1.8 from the most recent Bioconductor release.
I also notice on the Bioconductor vignette that for the boxplot all cell types are coloured as significant for c_typecancer, which looks inconsistent with the credible interval plot. Is this also a bug or have I misinterpreted the plot?

Dataset: Working with a SingleCellExperiment object, happy to share privately if you want to replicate.
Code:
`set.seed(3)
sccomp_result =
sce_pm |>
sccomp_estimate(
formula_composition = ~ type,
.sample = sample_id,
.cell_group = celltype_num,
bimodal_mean_variability_association = TRUE,
cores = 4,
max_sampling_iterations = 200000
) |>
sccomp_remove_outliers(cores = 1) |> # Optional
sccomp_test()

sccomp_result

plots = sccomp_result |> plot()
plots$boxplot
plots$credible_intervals_1D

set.seed(1)
sccomp_result =
sce_pm |>
sccomp_estimate(
formula_composition = ~ type,
.sample = sample_id,
.cell_group = celltype_num,
bimodal_mean_variability_association = TRUE,
cores = 4,
max_sampling_iterations = 200000
) |>
sccomp_remove_outliers(cores = 1) |> # Optional
sccomp_test()

sccomp_result

plots = sccomp_result |> plot()
plots$boxplot
plots$credible_intervals_1D`
Output:

sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Red Hat Enterprise Linux 9.1 (Plow)

Matrix products: default
BLAS/LAPACK: FlexiBLAS OPENBLAS-OPENMP; LAPACK version 3.9.0

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: Europe/Dublin
tzcode source: system (glibc)

attached base packages:
[1] stats4 stats graphics grDevices utils datasets
[7] methods base

other attached packages:
[1] sccomp_1.8.0
[2] ggplotify_0.1.2
[3] dittoSeq_1.16.0
[4] Seurat_5.1.0
[5] SeuratObject_5.0.2
[6] sp_2.1-4
[7] SCpubr_2.0.2
[8] ggpubr_0.6.0
[9] ggplot2_3.5.1
[10] dplyr_1.1.4
[11] edgeR_4.2.0
[12] limma_3.60.2
[13] SingleCellExperiment_1.26.0
[14] SummarizedExperiment_1.34.0
[15] Biobase_2.64.0
[16] GenomicRanges_1.56.0
[17] GenomeInfoDb_1.40.1
[18] IRanges_2.38.0
[19] S4Vectors_0.42.0
[20] BiocGenerics_0.50.0
[21] MatrixGenerics_1.16.0
[22] matrixStats_1.3.0
[23] ccrccprimarymetsanalysis_0.0.0.9000
[24] devtools_2.4.5
[25] usethis_2.2.3

loaded via a namespace (and not attached):
[1] fs_1.6.4 spatstat.sparse_3.0-3
[3] httr_1.4.7 RColorBrewer_1.1-3
[5] profvis_0.3.8 tools_4.4.0
[7] sctransform_0.4.1 backports_1.5.0
[9] utf8_1.2.4 R6_2.5.1
[11] ResidualMatrix_1.14.0 lazyeval_0.2.2
[13] uwot_0.2.2 urlchecker_1.0.1
[15] withr_3.0.0 gridExtra_2.3
[17] progressr_0.14.0 cli_3.6.2
[19] spatstat.explore_3.2-7 fastDummies_1.7.3
[21] labeling_0.4.3 spatstat.data_3.0-4
[23] readr_2.1.5 ggridges_0.5.6
[25] pbapply_1.7-2 QuickJSR_1.2.2
[27] yulab.utils_0.1.4 StanHeaders_2.32.9
[29] parallelly_1.37.1 sessioninfo_1.2.2
[31] rstudioapi_0.16.0 generics_0.1.3
[33] gridGraphics_0.5-1 ica_1.0-3
[35] spatstat.random_3.2-3 car_3.1-2
[37] inline_0.3.19 loo_2.7.0
[39] Matrix_1.7-0 fansi_1.0.6
[41] abind_1.4-5 lifecycle_1.0.4
[43] yaml_2.3.8 carData_3.0-5
[45] SparseArray_1.4.8 Rtsne_0.17
[47] grid_4.4.0 promises_1.3.0
[49] dqrng_0.4.1 crayon_1.5.2
[51] miniUI_0.1.1.1 lattice_0.22-6
[53] beachmat_2.20.0 cowplot_1.1.3
[55] demuxmix_1.6.0 pillar_1.9.0
[57] knitr_1.47 metapod_1.12.0
[59] boot_1.3-30 future.apply_1.11.2
[61] codetools_0.2-20 leiden_0.4.3.1
[63] glue_1.7.0 data.table_1.15.4
[65] remotes_2.5.0 vctrs_0.6.5
[67] png_0.1-8 spam_2.10-0
[69] gtable_0.3.5 assertthat_0.2.1
[71] cachem_1.1.0 xfun_0.44
[73] S4Arrays_1.4.1 mime_0.12
[75] survival_3.5-8 pheatmap_1.0.12
[77] statmod_1.5.0 bluster_1.14.0
[79] ellipsis_0.3.2 fitdistrplus_1.1-11
[81] ROCR_1.0-11 nlme_3.1-164
[83] RcppAnnoy_0.0.22 rstan_2.32.6
[85] irlba_2.3.5.1 KernSmooth_2.23-22
[87] colorspace_2.1-0 tidyselect_1.2.1
[89] compiler_4.4.0 BiocNeighbors_1.22.0
[91] DelayedArray_0.30.1 plotly_4.10.4
[93] scales_1.3.0 lmtest_0.9-40
[95] stringr_1.5.1 digest_0.6.35
[97] goftest_1.2-3 spatstat.utils_3.0-4
[99] rmarkdown_2.27 XVector_0.44.0
[101] htmltools_0.5.8.1 pkgconfig_2.0.3
[103] sparseMatrixStats_1.16.0 fastmap_1.2.0
[105] rlang_1.1.4 htmlwidgets_1.6.4
[107] UCSC.utils_1.0.0 shiny_1.8.1.1
[109] DelayedMatrixStats_1.26.0 farver_2.1.2
[111] zoo_1.8-12 jsonlite_1.8.8
[113] BiocParallel_1.38.0 BiocSingular_1.20.0
[115] magrittr_2.0.3 scuttle_1.14.0
[117] GenomeInfoDbData_1.2.12 dotCall64_1.1-1
[119] patchwork_1.2.0 munsell_0.5.1
[121] Rcpp_1.0.12 viridis_0.6.5
[123] reticulate_1.37.0 stringi_1.8.4
[125] zlibbioc_1.50.0 MASS_7.3-60.2
[127] plyr_1.8.9 pkgbuild_1.4.4
[129] parallel_4.4.0 listenv_0.9.1
[131] ggrepel_0.9.5 forcats_1.0.0
[133] deldir_2.0-4 splines_4.4.0
[135] tensor_1.5 hms_1.1.3
[137] locfit_1.5-9.9 igraph_2.0.3
[139] spatstat.geom_3.2-9 ggsignif_0.6.4
[141] RcppHNSW_0.6.0 reshape2_1.4.4
[143] ScaledMatrix_1.12.0 pkgload_1.3.4
[145] rstantools_2.4.0 evaluate_0.24.0
[147] BiocManager_1.30.23 RcppParallel_5.1.7
[149] scran_1.32.0 tzdb_0.4.0
[151] httpuv_1.6.15 batchelor_1.20.0
[153] RANN_2.6.1 tidyr_1.3.1
[155] purrr_1.0.2 polyclip_1.10-6
[157] future_1.33.2 scattermore_1.2
[159] rsvd_1.0.5 broom_1.0.6
[161] xtable_1.8-4 RSpectra_0.16-1
[163] rstatix_0.7.2 later_1.3.2
[165] viridisLite_0.4.2 tibble_3.2.1
[167] memoise_2.0.1 cluster_2.1.6
[169] globals_0.16.3

Thanks!

stemangiola · 2024-06-12T12:08:43Z

Yes please share sce_pm at [email protected]

Thanks.

stemangiola · 2024-06-12T12:09:59Z

I also notice on the Bioconductor vignette that for the boxplot all cell types are coloured as significant for c_typecancer, which looks inconsistent with the credible interval plot

I'll have a look tomorrow morning.

stemangiola · 2024-06-17T13:02:22Z

Hello, please try the branch https://github.com/MangiolaLaboratory/sccomp/tree/allow-vb-output-samples

I have allowed sampling iterations > 1000 for vb

Also I implemented the more modern backend with the fast and very reliable pathfinder, please try it out as well

https://github.com/MangiolaLaboratory/sccomp/tree/cmdstanr

stemangiola · 2024-06-18T02:18:15Z

VB with custom number of draws

full-bayes HMC

set.seed(3)
sccomp_result =
  sccomp_test_sce |>
  sccomp_estimate(
    formula_composition = ~ type,
    .sample = sample_id,
    .cell_group = celltype_num,
    bimodal_mean_variability_association = TRUE,
    variational_inference = F
  ) |>
  sccomp_remove_outliers(variational_inference = F) |> # Optional
  sccomp_test()

sccomp_result

plots = sccomp_result |> plot()
#plots$boxplot
#plots$credible_intervals_1D

set.seed(1)
sccomp_result =
  sccomp_test_sce |>
  sccomp_estimate(
    formula_composition = ~ type,
    .sample = sample_id,
    .cell_group = celltype_num,
    bimodal_mean_variability_association = TRUE,
    variational_inference = F
  ) |>
  sccomp_remove_outliers(variational_inference = F) |> # Optional
  sccomp_test()

sccomp_result

plots2 = sccomp_result |> plot()
#plots$boxplot


plots$credible_intervals_1D + plots2$credible_intervals_1D

Now the new cmdstanr backend (updated now, please reinstall), using the fast pathfinder in this case (HMC is always available)

set.seed(3)
sccomp_result =
    sccomp_test_sce |>
    sccomp_estimate(
        formula_composition = ~ type,
        .sample = sample_id,
        .cell_group = celltype_num,
        bimodal_mean_variability_association = TRUE
    ) |>
    sccomp_remove_outliers() |> # Optional
    sccomp_test()

sccomp_result

plots = sccomp_result |> plot()
#plots$boxplot
#plots$credible_intervals_1D

set.seed(1)
sccomp_result =
    sccomp_test_sce |>
    sccomp_estimate(
        formula_composition = ~ type,
        .sample = sample_id,
        .cell_group = celltype_num,
        bimodal_mean_variability_association = TRUE
    ) |>
    sccomp_remove_outliers() |> # Optional
    sccomp_test()

sccomp_result

plots2 = sccomp_result |> plot()
#plots$boxplot


plots$credible_intervals_1D + plots2$credible_intervals_1D

stemangiola · 2024-06-18T04:37:32Z

Having tested this a little, I would say, use

variational_bayes = FALSE,

it is the gold standard and gives the most consistent results.

I will investigate further the variational strategy.

michaelplynch · 2024-06-18T15:44:57Z

Hi Stefano,
Can you confirm that it's the cmdstanr branch I should be reinstalling on? On Windows I get an install error similar to your GHA. On Linux oddly enough it will install but errors with "Further attempt with Variational Bayes: Error: Model not compiled. Try running the compile() method first."

stemangiola · 2024-06-19T07:16:22Z

You can

use the master branch but using variational_bayes = FALSE. A great option, that you should always use if you can (a bit slower).
install cmdstan branch and use inference_method = "pathfinder" (very fast and accurate), and do clear_stan_model_cache() before executing, if you have compile() method first problem. (not sure how you can have it if you install it from scratch, some feedback is helpful on your experience, so to improve this branch). With cmdstanr branch of course you can still use inference_method = "hmc", the gold standard.

michaelplynch · 2024-06-21T15:09:23Z

Thanks Stefano.

Assuming you mean variational_inference=F (rather than variational_bayes=F). variational_bayes=F returns an unused argument error.

For master branch, variational_inference=F looks more consistent.
For cmdstanr branch, inference_method="pathfinder" also looks more consistent.

Other feedback:
I'm not sure was the pathfinder implementation much faster than HMC on the master branch in this case but I haven't formally tested this.
For cmdstanr branch with HMC e.g.:
sccomp_result = sccomp_test_sce |> sccomp_estimate( formula_composition = ~ type, .sample = sample_id, .cell_group = celltype_num, bimodal_mean_variability_association = TRUE, inference_method = "HMC" ) |> sccomp_remove_outliers() |> # Optional sccomp_test()
and
sccomp_result = sccomp_test_sce |> sccomp_estimate( formula_composition = ~ type, .sample = sample_id, .cell_group = celltype_num, bimodal_mean_variability_association = TRUE, inference_method = "HMC", variational_inference = FALSE ) |> sccomp_remove_outliers() |> # Optional sccomp_test()
return errors Error in vb_iterative(mod, output_samples = output_samples, iter = 10000, : sccomp says: variational Bayes did not converge after 5 attempts. Please use variational_inference = FALSE for a HMC fitting. and Error in fit2$num_chains() : attempt to apply non-function, at least with this dataset.

stemangiola linked a pull request Jun 17, 2024 that will close this issue

allow custom output samples for vb #137

Merged

stemangiola closed this as completed in #137 Jun 18, 2024

stemangiola reopened this Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent results between seeds #135

Inconsistent results between seeds #135

michaelplynch commented Jun 12, 2024 •

edited

Loading

stemangiola commented Jun 12, 2024

stemangiola commented Jun 12, 2024

stemangiola commented Jun 17, 2024

stemangiola commented Jun 18, 2024

stemangiola commented Jun 18, 2024

michaelplynch commented Jun 18, 2024

stemangiola commented Jun 19, 2024 •

edited

Loading

michaelplynch commented Jun 21, 2024

Inconsistent results between seeds #135

Inconsistent results between seeds #135

Comments

michaelplynch commented Jun 12, 2024 • edited Loading

stemangiola commented Jun 12, 2024

stemangiola commented Jun 12, 2024

stemangiola commented Jun 17, 2024

stemangiola commented Jun 18, 2024

stemangiola commented Jun 18, 2024

michaelplynch commented Jun 18, 2024

stemangiola commented Jun 19, 2024 • edited Loading

michaelplynch commented Jun 21, 2024

michaelplynch commented Jun 12, 2024 •

edited

Loading

stemangiola commented Jun 19, 2024 •

edited

Loading