Skip to content

Tool to detect duplicate Reddit posts in subreddits using semantic search

Notifications You must be signed in to change notification settings

adenletchworth/ThreadMatch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Thread Match

Project Overview

ThreadMatch is a project designed to utilize multi-modal semantic similarity search to find similar posts within a given subreddit. By combining the text content of posts and the text from URLs mentioned in the posts, ThreadMatch provides enriched and more accurate search results. Video and Image Search is WIP.

Features

  • Data Pipeline: Utilizes a Dockerized Apache Airflow instance to trigger Apache Kafka, streaming Reddit data from specified subreddits.
  • Indexing: Processes the streamed data using PySpark to create a composite HNSW32 FAISS index, serialized and stored in MongoDB with associated metadata.
  • Backend: Provides RESTful endpoints with a Dockerized Django application for multi-modal semantic search on the FAISS index.
  • Frontend: Implements a single-page React application using Tailwind CSS, served by an Nginx/Node container, offering a seamless and responsive user interface.
  • Containerization: Leverages Docker to containerize all components, ensuring consistent and scalable deployments.

Technologies Used

  • Apache Airflow: Orchestrates the data pipeline.
  • Apache Kafka: Streams Reddit data.
  • PySpark: Processes data and creates FAISS index.
  • FAISS: Provides efficient similarity search indexing.
  • MongoDB: Stores the FAISS index and metadata.
  • Django: Backend framework for RESTful endpoints.
  • React: Frontend framework for building the user interface.
  • Tailwind CSS: Utility-first CSS framework for styling.
  • Docker: Containerizes all components for consistent deployments.
  • Nginx and Node.js: Serve the frontend application.

Setup Instructions

Prerequisites

  • Docker
  • Docker Compose

Installation

  1. Clone the repository:

    git clone https://github.com/adenletchworth/ThreadMatch.git
    cd ThreadMatch
  2. Create a .env file in the root directory and configure your environment variables:

    AIRFLOW_UID=50000
  3. Create a .env.dev file in the root directory and configure your development environment variables:

    DEBUG=1
    SECRET_KEY=...
    DJANGO_ALLOWED_HOSTS=localhost 127.0.0.1 [::1]
  4. Configure dags/configs/kafka.py with the following settings:

    bootstrap_servers = 'kafka:9092'
    topic_name = 'reddit-posts'
    mongo_uri = 'mongodb://host.docker.internal:27017/'
    mongo_db_name = 'Reddit'
    mongo_collection_name = 'posts'
    mongo_collection_name_posts = 'posts'
    mongo_collection_name_index = 'faiss_index'
  5. Configure reddit.py with the following settings:

    client_id=...
    client_secret=...
    user_agent=...
    subreddits_list=...
  6. Start the services using Docker Compose:

    docker-compose up --build -d

Usage

Access the Apache Airflow web interface at http://localhost:8080 to manage the data pipeline.
The Django backend can be accessed at http://localhost:8000.
The React frontend can be accessed at http://localhost:3001.

Contributing

Contributions are welcome! Please fork the repository and submit a pull request.