Skip to content

Semantic embedding-based system for question answering from PDFs with visual analysis tools.

License

Notifications You must be signed in to change notification settings

AvaAvarai/EmbedQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EmbedQA

EmbedQA is a semantic embedding-based question-answering system that processes PDFs to generate embeddings for sentences, enabling semantic search and question answering using these embeddings. The system utilizes sentence-transformers for generating embeddings and transformers for question answering. Currently embeddings are generated with allenai/scibert_scivocab_uncased.

Overview

  1. PDF Extraction: Load and extract text from PDF files.
  2. Text Cleaning and Splitting: Clean and split the extracted text into semantically useful sentences.
  3. Embedding Generation: Generate embeddings for the cleaned sentences using a pre-trained model.
  4. Similarity Search: Find the top k most similar sentences to a given question.
  5. Contextual Response Generation: Combine the top k similar sentences and their neighboring sentences to form a context for generating a response using a question-answering model.
  6. Graphical User Interface: Provide a GUI for user interaction, allowing users to ask questions and get responses based on the processed PDF content.

Current Status

Our project has been rebuilt and will be in a WIP (Work In Progress) state for some time.

Installation

  1. Clone the repository:

    git clone https://github.com/your-username/EmbedQA.git
    cd EmbedQA
  2. Install the required packages:

    pip install -r requirements.txt

Usage

Running the Application

  1. Scrape PDF Content:

    • Run the script to scrape PDF content.
    python src/scrape.py
  2. Clean the Extracted Text:

    • Run the script to clean the extracted text.
    python src/extract_clean_text.py
  3. Generate Embeddings:

    • Run the script to generate embeddings for the cleaned text.
    python src/create_embeddings.py
  4. Visualize the Embeddings:

    • Run the scripts to visualize embeddings using various dimensionality reduction techniques.
    python src/visualize_embeddings_in_DR.py  # For dimensionality reduction visualization
    python src/visualize_embeddings_in_PC.py  # For parallel coordinates visualization
  5. Chat Interface:

    • The chat interface is currently unimplemented in the new version.

Explanation of the Process

  1. PDF Extraction:

    • Load PDF files from the specified directory and extract text from each page using PyMuPDF.
  2. Text Cleaning and Splitting:

    • Clean the extracted text by removing HTML tags, non-ASCII characters, and multiple spaces.
    • Split the cleaned text into sentences and filter out sentences that are not semantically useful (e.g., too short, contain mostly numbers/symbols).
  3. Embedding Generation:

    • Generate sentence embeddings using the sentence-transformers model nli-roberta-large.
  4. Saving and Loading Embeddings:

    • Save the generated embeddings and their corresponding sentences to a CSV file.
    • Load the embeddings from the CSV file for similarity search.
  5. Similarity Search:

    • Use the sentence-transformers model to find the top k most similar sentences to the user query based on cosine similarity.
  6. Contextual Response Generation:

    • Extract the neighboring sentences for each of the top k similar sentences.
    • Combine these sentences to form a context.
    • Use the transformers model deepset/roberta-large-squad2 to generate a response based on the combined context.
  7. Graphical User Interface:

    • Provide a Tkinter-based GUI for users to input their questions and receive responses.
    • Display the top k most similar sentences and the generated response.

Embeddings Generation for Non-NVIDIA Systems

To support systems without NVIDIA GPUs, we convert the PyTorch model to ONNX format and use ONNX Runtime with DirectML. This enables the use of hardware-accelerated inference on a broader range of devices, including those with AMD or Intel GPUs.

Subprograms for Clustering and Visualization

Clustering and Dimensionality Reduction

We use several techniques to explore clusters of semantic similarity within the embeddings:

  • PCA (Principal Component Analysis): Reduces the dimensionality of embeddings to three principal components for 3D visualization.
  • t-SNE (t-Distributed Stochastic Neighbor Embedding): Reduces the dimensionality to three components, focusing on preserving local similarities.
  • UMAP (Uniform Manifold Approximation and Projection): Another dimensionality reduction technique that preserves more of the global structure compared to t-SNE.

Visualization

  • 3D Plots: We create 3D scatter plots of the reduced embeddings using Plotly. These plots allow interactive exploration of the clusters, with additional features like hovering over points to see the associated sentences.
  • JavaScript Integration: Custom JavaScript is added to enable copying the text of a sentence to the clipboard when clicked.

Tuning Hyperparameters with Dash

We provide a Dash application for tuning UMAP hyperparameters. This application allows users to adjust parameters such as the number of neighbors, minimum distance, and metric used for computing the UMAP embedding. The resulting embeddings are visualized in a 3D scatter plot, facilitating the exploration of different clustering behaviors.

Screenshots

UMAP Visualization 1

UMAP 1

UMAP Visualization 2: Subcluster of Sentences with German Words

UMAP 2

Dependencies

  • PyMuPDF: For extracting text from PDF files.
  • sentence-transformers: For generating sentence embeddings.
  • transformers: For question answering.
  • nltk: For natural language processing tasks such as tokenization and stopword removal.
  • spacy: For advanced natural language processing tasks.
  • scikit-learn: For dimensionality reduction (PCA, t-SNE).
  • umap-learn: For UMAP dimensionality reduction.
  • matplotlib: For plotting embeddings.
  • plotly: For interactive plotting.
  • tkinter: For creating the graphical user interface.
  • beautifulsoup4: For cleaning HTML tags from text.
  • dash: For building the hyperparameter tuning web app.

License

This project is licensed under the MIT License.

About

Semantic embedding-based system for question answering from PDFs with visual analysis tools.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages