Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genes missing after SCTransform and integration - how best to set # genes retained #9018

Open
beazors opened this issue Jun 14, 2024 · 1 comment

Comments

@beazors
Copy link

beazors commented Jun 14, 2024

Hi,

I am trying to analyze a dataset that includes 10 single cell experiments and about 66,000 cells. The actual set of cells I'm interested in is a small group of those, of about 7,000 cells. With the old method (log normalization) I had created a global UMAP, pulled out the clusters of interest with known marker genes, then reclustered that smaller dataset.

However, I am finding that SCTransform and the downstream integration steps cause me to lose some genes of interest in my smaller dataset, presumably because they aren't in the top 2000/3000 variable genes in the global dataset (whatever the default is, I know it varies by command).

After using the following code, altering the number and then checking for my genes of interest in the features list, I know that I need to increase the # to 3500 for three of my most important missing genes to come back.

# select integration features and prep step
    features<- SelectIntegrationFeatures(pbmc.list,nfeatures = 3500,
                                         assay = NULL,
                                         verbose = TRUE,
                                         fvf.nfeatures = 3500)
    #checking presence of genes of interest in features
    c('Gene1','Gene2','Gene3') %in% features

Questions:

(1) Would you recommend that users take the above approach to ensure that the info about all their genes of interest are captured at the initial, global steps, if they want these genes retained for downstream analyses?

(2) Would it be better to take a mixed approach that might be less computationally expensive, such as the list of 2000 most variable + some genes that are of interest for the smaller subset? In this latter case, I would fear that genes which are important for my subset of cells (but not in my list of known genes of interest) would still be eliminated at the global analysis step.

(3) As a crude final alternative, would you recommend filtering each dataset first using some kind of rule like expression of Gene1>0, then running SCTransform on the filtered individual datasets and proceeding from there?

@joaldersey
Copy link

I am commenting with a similar issue. I used SCTransform and identified markers using FindAllMarkers, but not all genes are transformed and don't appear in the heatmap. There is not a method for transforming specific genes like there is for ScaleData. How can we get around this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants