EmbedQA

EmbedQA is a semantic embedding-based question-answering system that processes PDFs to generate embeddings for sentences, enabling semantic search and question answering using these embeddings. The system utilizes sentence-transformers for generating embeddings and transformers for question answering. Currently embeddings are generated with allenai/scibert_scivocab_uncased.

Overview

PDF Extraction: Load and extract text from PDF files.
Text Cleaning and Splitting: Clean and split the extracted text into semantically useful sentences.
Embedding Generation: Generate embeddings for the cleaned sentences using a pre-trained model.
Similarity Search: Find the top k most similar sentences to a given question.
Contextual Response Generation: Combine the top k similar sentences and their neighboring sentences to form a context for generating a response using a question-answering model.
Graphical User Interface: Provide a GUI for user interaction, allowing users to ask questions and get responses based on the processed PDF content.

Current Status

Our project has been rebuilt and will be in a WIP (Work In Progress) state for some time.

Installation

Clone the repository:

git clone https://github.com/your-username/EmbedQA.git
cd EmbedQA

Install the required packages:
```
pip install -r requirements.txt
```

Usage

Running the Application

Scrape PDF Content:
- Run the script to scrape PDF content.
```
python src/scrape.py
```
Clean the Extracted Text:
- Run the script to clean the extracted text.
```
python src/extract_clean_text.py
```
Generate Embeddings:
- Run the script to generate embeddings for the cleaned text.
```
python src/create_embeddings.py
```

Visualize the Embeddings:

Run the scripts to visualize embeddings using various dimensionality reduction techniques.

python src/visualize_embeddings_in_DR.py  # For dimensionality reduction visualization
python src/visualize_embeddings_in_PC.py  # For parallel coordinates visualization

Chat Interface:
- The chat interface is currently unimplemented in the new version.

Explanation of the Process

PDF Extraction:
- Load PDF files from the specified directory and extract text from each page using PyMuPDF.
Text Cleaning and Splitting:
- Clean the extracted text by removing HTML tags, non-ASCII characters, and multiple spaces.
- Split the cleaned text into sentences and filter out sentences that are not semantically useful (e.g., too short, contain mostly numbers/symbols).
Embedding Generation:
- Generate sentence embeddings using the sentence-transformers model nli-roberta-large.
Saving and Loading Embeddings:
- Save the generated embeddings and their corresponding sentences to a CSV file.
- Load the embeddings from the CSV file for similarity search.
Similarity Search:
- Use the sentence-transformers model to find the top k most similar sentences to the user query based on cosine similarity.
Contextual Response Generation:
- Extract the neighboring sentences for each of the top k similar sentences.
- Combine these sentences to form a context.
- Use the transformers model deepset/roberta-large-squad2 to generate a response based on the combined context.
Graphical User Interface:
- Provide a Tkinter-based GUI for users to input their questions and receive responses.
- Display the top k most similar sentences and the generated response.

Embeddings Generation for Non-NVIDIA Systems

To support systems without NVIDIA GPUs, we convert the PyTorch model to ONNX format and use ONNX Runtime with DirectML. This enables the use of hardware-accelerated inference on a broader range of devices, including those with AMD or Intel GPUs.

Subprograms for Clustering and Visualization

Clustering and Dimensionality Reduction

We use several techniques to explore clusters of semantic similarity within the embeddings:

PCA (Principal Component Analysis): Reduces the dimensionality of embeddings to three principal components for 3D visualization.
t-SNE (t-Distributed Stochastic Neighbor Embedding): Reduces the dimensionality to three components, focusing on preserving local similarities.
UMAP (Uniform Manifold Approximation and Projection): Another dimensionality reduction technique that preserves more of the global structure compared to t-SNE.

Visualization

3D Plots: We create 3D scatter plots of the reduced embeddings using Plotly. These plots allow interactive exploration of the clusters, with additional features like hovering over points to see the associated sentences.
JavaScript Integration: Custom JavaScript is added to enable copying the text of a sentence to the clipboard when clicked.

Tuning Hyperparameters with Dash

We provide a Dash application for tuning UMAP hyperparameters. This application allows users to adjust parameters such as the number of neighbors, minimum distance, and metric used for computing the UMAP embedding. The resulting embeddings are visualized in a 3D scatter plot, facilitating the exploration of different clustering behaviors.

Screenshots

UMAP Visualization 1

UMAP Visualization 2: Subcluster of Sentences with German Words

Dependencies

PyMuPDF: For extracting text from PDF files.
sentence-transformers: For generating sentence embeddings.
transformers: For question answering.
nltk: For natural language processing tasks such as tokenization and stopword removal.
spacy: For advanced natural language processing tasks.
scikit-learn: For dimensionality reduction (PCA, t-SNE).
umap-learn: For UMAP dimensionality reduction.
matplotlib: For plotting embeddings.
plotly: For interactive plotting.
tkinter: For creating the graphical user interface.
beautifulsoup4: For cleaning HTML tags from text.
dash: For building the hyperparameter tuning web app.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
ref		ref
screenshots		screenshots
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EmbedQA

Overview

Current Status

Installation

Usage

Running the Application

Explanation of the Process

Embeddings Generation for Non-NVIDIA Systems

Subprograms for Clustering and Visualization

Clustering and Dimensionality Reduction

Visualization

Tuning Hyperparameters with Dash

Screenshots

UMAP Visualization 1

UMAP Visualization 2: Subcluster of Sentences with German Words

Dependencies

License

About

Releases

Packages

Languages

License

AvaAvarai/EmbedQA

Folders and files

Latest commit

History

Repository files navigation

EmbedQA

Overview

Current Status

Installation

Usage

Running the Application

Explanation of the Process

Embeddings Generation for Non-NVIDIA Systems

Subprograms for Clustering and Visualization

Clustering and Dimensionality Reduction

Visualization

Tuning Hyperparameters with Dash

Screenshots

UMAP Visualization 1

UMAP Visualization 2: Subcluster of Sentences with German Words

Dependencies

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages