Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent results between seeds #135

Open
michaelplynch opened this issue Jun 12, 2024 · 8 comments · Fixed by #137
Open

Inconsistent results between seeds #135

michaelplynch opened this issue Jun 12, 2024 · 8 comments · Fixed by #137

Comments

@michaelplynch
Copy link

michaelplynch commented Jun 12, 2024

Hi Stefano,
Flagging some interesting behaviour I've come across with the package.

Problem: The 1d credible intervals plot is showing different results in terms of rank, credible interval and FDR significance when run with different seeds. I see the "EXPERIMENTAL ALGORITHM" message, is this a byproduct? I'm working off sccomp v1.8 from the most recent Bioconductor release.
I also notice on the Bioconductor vignette that for the boxplot all cell types are coloured as significant for c_typecancer, which looks inconsistent with the credible interval plot. Is this also a bug or have I misinterpreted the plot?

Dataset: Working with a SingleCellExperiment object, happy to share privately if you want to replicate.
Code:
`set.seed(3)
sccomp_result =
sce_pm |>
sccomp_estimate(
formula_composition = ~ type,
.sample = sample_id,
.cell_group = celltype_num,
bimodal_mean_variability_association = TRUE,
cores = 4,
max_sampling_iterations = 200000
) |>
sccomp_remove_outliers(cores = 1) |> # Optional
sccomp_test()

sccomp_result

plots = sccomp_result |> plot()
plots$boxplot
plots$credible_intervals_1D

set.seed(1)
sccomp_result =
sce_pm |>
sccomp_estimate(
formula_composition = ~ type,
.sample = sample_id,
.cell_group = celltype_num,
bimodal_mean_variability_association = TRUE,
cores = 4,
max_sampling_iterations = 200000
) |>
sccomp_remove_outliers(cores = 1) |> # Optional
sccomp_test()

sccomp_result

plots = sccomp_result |> plot()
plots$boxplot
plots$credible_intervals_1D`
Output:
image
image

sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Red Hat Enterprise Linux 9.1 (Plow)

Matrix products: default
BLAS/LAPACK: FlexiBLAS OPENBLAS-OPENMP; LAPACK version 3.9.0

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: Europe/Dublin
tzcode source: system (glibc)

attached base packages:
[1] stats4 stats graphics grDevices utils datasets
[7] methods base

other attached packages:
[1] sccomp_1.8.0
[2] ggplotify_0.1.2
[3] dittoSeq_1.16.0
[4] Seurat_5.1.0
[5] SeuratObject_5.0.2
[6] sp_2.1-4
[7] SCpubr_2.0.2
[8] ggpubr_0.6.0
[9] ggplot2_3.5.1
[10] dplyr_1.1.4
[11] edgeR_4.2.0
[12] limma_3.60.2
[13] SingleCellExperiment_1.26.0
[14] SummarizedExperiment_1.34.0
[15] Biobase_2.64.0
[16] GenomicRanges_1.56.0
[17] GenomeInfoDb_1.40.1
[18] IRanges_2.38.0
[19] S4Vectors_0.42.0
[20] BiocGenerics_0.50.0
[21] MatrixGenerics_1.16.0
[22] matrixStats_1.3.0
[23] ccrccprimarymetsanalysis_0.0.0.9000
[24] devtools_2.4.5
[25] usethis_2.2.3

loaded via a namespace (and not attached):
[1] fs_1.6.4 spatstat.sparse_3.0-3
[3] httr_1.4.7 RColorBrewer_1.1-3
[5] profvis_0.3.8 tools_4.4.0
[7] sctransform_0.4.1 backports_1.5.0
[9] utf8_1.2.4 R6_2.5.1
[11] ResidualMatrix_1.14.0 lazyeval_0.2.2
[13] uwot_0.2.2 urlchecker_1.0.1
[15] withr_3.0.0 gridExtra_2.3
[17] progressr_0.14.0 cli_3.6.2
[19] spatstat.explore_3.2-7 fastDummies_1.7.3
[21] labeling_0.4.3 spatstat.data_3.0-4
[23] readr_2.1.5 ggridges_0.5.6
[25] pbapply_1.7-2 QuickJSR_1.2.2
[27] yulab.utils_0.1.4 StanHeaders_2.32.9
[29] parallelly_1.37.1 sessioninfo_1.2.2
[31] rstudioapi_0.16.0 generics_0.1.3
[33] gridGraphics_0.5-1 ica_1.0-3
[35] spatstat.random_3.2-3 car_3.1-2
[37] inline_0.3.19 loo_2.7.0
[39] Matrix_1.7-0 fansi_1.0.6
[41] abind_1.4-5 lifecycle_1.0.4
[43] yaml_2.3.8 carData_3.0-5
[45] SparseArray_1.4.8 Rtsne_0.17
[47] grid_4.4.0 promises_1.3.0
[49] dqrng_0.4.1 crayon_1.5.2
[51] miniUI_0.1.1.1 lattice_0.22-6
[53] beachmat_2.20.0 cowplot_1.1.3
[55] demuxmix_1.6.0 pillar_1.9.0
[57] knitr_1.47 metapod_1.12.0
[59] boot_1.3-30 future.apply_1.11.2
[61] codetools_0.2-20 leiden_0.4.3.1
[63] glue_1.7.0 data.table_1.15.4
[65] remotes_2.5.0 vctrs_0.6.5
[67] png_0.1-8 spam_2.10-0
[69] gtable_0.3.5 assertthat_0.2.1
[71] cachem_1.1.0 xfun_0.44
[73] S4Arrays_1.4.1 mime_0.12
[75] survival_3.5-8 pheatmap_1.0.12
[77] statmod_1.5.0 bluster_1.14.0
[79] ellipsis_0.3.2 fitdistrplus_1.1-11
[81] ROCR_1.0-11 nlme_3.1-164
[83] RcppAnnoy_0.0.22 rstan_2.32.6
[85] irlba_2.3.5.1 KernSmooth_2.23-22
[87] colorspace_2.1-0 tidyselect_1.2.1
[89] compiler_4.4.0 BiocNeighbors_1.22.0
[91] DelayedArray_0.30.1 plotly_4.10.4
[93] scales_1.3.0 lmtest_0.9-40
[95] stringr_1.5.1 digest_0.6.35
[97] goftest_1.2-3 spatstat.utils_3.0-4
[99] rmarkdown_2.27 XVector_0.44.0
[101] htmltools_0.5.8.1 pkgconfig_2.0.3
[103] sparseMatrixStats_1.16.0 fastmap_1.2.0
[105] rlang_1.1.4 htmlwidgets_1.6.4
[107] UCSC.utils_1.0.0 shiny_1.8.1.1
[109] DelayedMatrixStats_1.26.0 farver_2.1.2
[111] zoo_1.8-12 jsonlite_1.8.8
[113] BiocParallel_1.38.0 BiocSingular_1.20.0
[115] magrittr_2.0.3 scuttle_1.14.0
[117] GenomeInfoDbData_1.2.12 dotCall64_1.1-1
[119] patchwork_1.2.0 munsell_0.5.1
[121] Rcpp_1.0.12 viridis_0.6.5
[123] reticulate_1.37.0 stringi_1.8.4
[125] zlibbioc_1.50.0 MASS_7.3-60.2
[127] plyr_1.8.9 pkgbuild_1.4.4
[129] parallel_4.4.0 listenv_0.9.1
[131] ggrepel_0.9.5 forcats_1.0.0
[133] deldir_2.0-4 splines_4.4.0
[135] tensor_1.5 hms_1.1.3
[137] locfit_1.5-9.9 igraph_2.0.3
[139] spatstat.geom_3.2-9 ggsignif_0.6.4
[141] RcppHNSW_0.6.0 reshape2_1.4.4
[143] ScaledMatrix_1.12.0 pkgload_1.3.4
[145] rstantools_2.4.0 evaluate_0.24.0
[147] BiocManager_1.30.23 RcppParallel_5.1.7
[149] scran_1.32.0 tzdb_0.4.0
[151] httpuv_1.6.15 batchelor_1.20.0
[153] RANN_2.6.1 tidyr_1.3.1
[155] purrr_1.0.2 polyclip_1.10-6
[157] future_1.33.2 scattermore_1.2
[159] rsvd_1.0.5 broom_1.0.6
[161] xtable_1.8-4 RSpectra_0.16-1
[163] rstatix_0.7.2 later_1.3.2
[165] viridisLite_0.4.2 tibble_3.2.1
[167] memoise_2.0.1 cluster_2.1.6
[169] globals_0.16.3

Thanks!

@stemangiola
Copy link
Collaborator

Yes please share sce_pm at [email protected]

Thanks.

@stemangiola
Copy link
Collaborator

I also notice on the Bioconductor vignette that for the boxplot all cell types are coloured as significant for c_typecancer, which looks inconsistent with the credible interval plot

I'll have a look tomorrow morning.

@stemangiola
Copy link
Collaborator

Hello, please try the branch https://github.com/MangiolaLaboratory/sccomp/tree/allow-vb-output-samples

I have allowed sampling iterations > 1000 for vb

Also I implemented the more modern backend with the fast and very reliable pathfinder, please try it out as well

https://github.com/MangiolaLaboratory/sccomp/tree/cmdstanr

@stemangiola stemangiola linked a pull request Jun 17, 2024 that will close this issue
@stemangiola
Copy link
Collaborator

VB with custom number of draws

image

full-bayes HMC

set.seed(3)
sccomp_result =
  sccomp_test_sce |>
  sccomp_estimate(
    formula_composition = ~ type,
    .sample = sample_id,
    .cell_group = celltype_num,
    bimodal_mean_variability_association = TRUE,
    variational_inference = F
  ) |>
  sccomp_remove_outliers(variational_inference = F) |> # Optional
  sccomp_test()

sccomp_result

plots = sccomp_result |> plot()
#plots$boxplot
#plots$credible_intervals_1D

set.seed(1)
sccomp_result =
  sccomp_test_sce |>
  sccomp_estimate(
    formula_composition = ~ type,
    .sample = sample_id,
    .cell_group = celltype_num,
    bimodal_mean_variability_association = TRUE,
    variational_inference = F
  ) |>
  sccomp_remove_outliers(variational_inference = F) |> # Optional
  sccomp_test()

sccomp_result

plots2 = sccomp_result |> plot()
#plots$boxplot


plots$credible_intervals_1D + plots2$credible_intervals_1D
image

Now the new cmdstanr backend (updated now, please reinstall), using the fast pathfinder in this case (HMC is always available)

set.seed(3)
sccomp_result =
    sccomp_test_sce |>
    sccomp_estimate(
        formula_composition = ~ type,
        .sample = sample_id,
        .cell_group = celltype_num,
        bimodal_mean_variability_association = TRUE
    ) |>
    sccomp_remove_outliers() |> # Optional
    sccomp_test()

sccomp_result

plots = sccomp_result |> plot()
#plots$boxplot
#plots$credible_intervals_1D

set.seed(1)
sccomp_result =
    sccomp_test_sce |>
    sccomp_estimate(
        formula_composition = ~ type,
        .sample = sample_id,
        .cell_group = celltype_num,
        bimodal_mean_variability_association = TRUE
    ) |>
    sccomp_remove_outliers() |> # Optional
    sccomp_test()

sccomp_result

plots2 = sccomp_result |> plot()
#plots$boxplot


plots$credible_intervals_1D + plots2$credible_intervals_1D
image

@stemangiola
Copy link
Collaborator

Having tested this a little, I would say, use

variational_bayes = FALSE,

it is the gold standard and gives the most consistent results.

I will investigate further the variational strategy.

@michaelplynch
Copy link
Author

Hi Stefano,
Can you confirm that it's the cmdstanr branch I should be reinstalling on? On Windows I get an install error similar to your GHA. On Linux oddly enough it will install but errors with "Further attempt with Variational Bayes: Error: Model not compiled. Try running the compile() method first."

@stemangiola
Copy link
Collaborator

stemangiola commented Jun 19, 2024

You can

  • use the master branch but using variational_bayes = FALSE. A great option, that you should always use if you can (a bit slower).

  • install cmdstan branch and use inference_method = "pathfinder" (very fast and accurate), and do clear_stan_model_cache() before executing, if you have compile() method first problem. (not sure how you can have it if you install it from scratch, some feedback is helpful on your experience, so to improve this branch). With cmdstanr branch of course you can still use inference_method = "hmc", the gold standard.

@michaelplynch
Copy link
Author

Thanks Stefano.

Assuming you mean variational_inference=F (rather than variational_bayes=F). variational_bayes=F returns an unused argument error.

For master branch, variational_inference=F looks more consistent.
For cmdstanr branch, inference_method="pathfinder" also looks more consistent.

Other feedback:
I'm not sure was the pathfinder implementation much faster than HMC on the master branch in this case but I haven't formally tested this.
For cmdstanr branch with HMC e.g.:
sccomp_result = sccomp_test_sce |> sccomp_estimate( formula_composition = ~ type, .sample = sample_id, .cell_group = celltype_num, bimodal_mean_variability_association = TRUE, inference_method = "HMC" ) |> sccomp_remove_outliers() |> # Optional sccomp_test()
and
sccomp_result = sccomp_test_sce |> sccomp_estimate( formula_composition = ~ type, .sample = sample_id, .cell_group = celltype_num, bimodal_mean_variability_association = TRUE, inference_method = "HMC", variational_inference = FALSE ) |> sccomp_remove_outliers() |> # Optional sccomp_test()
return errors Error in vb_iterative(mod, output_samples = output_samples, iter = 10000, : sccomp says: variational Bayes did not converge after 5 attempts. Please use variational_inference = FALSE for a HMC fitting. and Error in fit2$num_chains() : attempt to apply non-function, at least with this dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants