Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zeroshot Topic Modeling With no Embedding Model #2011

Open
amirarsalan90 opened this issue May 24, 2024 · 1 comment
Open

Zeroshot Topic Modeling With no Embedding Model #2011

amirarsalan90 opened this issue May 24, 2024 · 1 comment

Comments

@amirarsalan90
Copy link

amirarsalan90 commented May 24, 2024

Hello @MaartenGr and thanks for the awesome bertopic library! I want to perform zeroshot topic modeling with no embedding model. I have used an external model to get embeddings of documents and zeroshot topic list. I have no access to that embedding model anymore.

Is it possible to run something like this without embedding model?

zeroshot_topic_list_embeddings = np.random.rand(len(zeroshot_topic_list), 1024).astype(np.float32)
document_embeddings = np.random.rand(len(docs), 1024).astype(np.float32)

sim = 0.8
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
representation_model = KeyBERTInspired(top_n_words=200)
topic_model = BERTopic(
    top_n_words = 20,
    ctfidf_model=ctfidf_model,
    verbose=True,
    calculate_probabilities = True,
    embedding_model=None,
    min_topic_size=200,
    zeroshot_topic_list=zeroshot_topic_list,
    zeroshot_min_similarity=sim,
    representation_model=representation_model
)
topics, probs = topic_model.fit_transform(docs,document_embeddings)
topics, probs = topic_model.transform(docs,document_embeddings)

freq = topic_model.get_topic_info()

I think somewhere in the code Bertopic is still trying to use the embedding model

@MaartenGr
Copy link
Owner

I think somewhere in the code Bertopic is still trying to use the embedding model

That's correct! However, not because of zero-shot topic modeling but because you are using KeyBERTInspired. That representation model creates word embeddings that need to be used in order to find which words are semantically similar to a collection of representative documents. As such, an embedding model is still needed for that particular representation model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants