ThreadMatch is a project designed to utilize multi-modal semantic similarity search to find similar posts within a given subreddit. By combining the text content of posts and the text from URLs mentioned in the posts, ThreadMatch provides enriched and more accurate search results. Video and Image Search is WIP.
- Data Pipeline: Utilizes a Dockerized Apache Airflow instance to trigger Apache Kafka, streaming Reddit data from specified subreddits.
- Indexing: Processes the streamed data using PySpark to create a composite HNSW32 FAISS index, serialized and stored in MongoDB with associated metadata.
- Backend: Provides RESTful endpoints with a Dockerized Django application for multi-modal semantic search on the FAISS index.
- Frontend: Implements a single-page React application using Tailwind CSS, served by an Nginx/Node container, offering a seamless and responsive user interface.
- Containerization: Leverages Docker to containerize all components, ensuring consistent and scalable deployments.
- Apache Airflow: Orchestrates the data pipeline.
- Apache Kafka: Streams Reddit data.
- PySpark: Processes data and creates FAISS index.
- FAISS: Provides efficient similarity search indexing.
- MongoDB: Stores the FAISS index and metadata.
- Django: Backend framework for RESTful endpoints.
- React: Frontend framework for building the user interface.
- Tailwind CSS: Utility-first CSS framework for styling.
- Docker: Containerizes all components for consistent deployments.
- Nginx and Node.js: Serve the frontend application.
- Docker
- Docker Compose
-
Clone the repository:
git clone https://github.com/adenletchworth/ThreadMatch.git cd ThreadMatch
-
Create a
.env
file in the root directory and configure your environment variables:AIRFLOW_UID=50000
-
Create a
.env.dev
file in the root directory and configure your development environment variables:DEBUG=1 SECRET_KEY=... DJANGO_ALLOWED_HOSTS=localhost 127.0.0.1 [::1]
-
Configure
dags/configs/kafka.py
with the following settings:bootstrap_servers = 'kafka:9092' topic_name = 'reddit-posts' mongo_uri = 'mongodb://host.docker.internal:27017/' mongo_db_name = 'Reddit' mongo_collection_name = 'posts' mongo_collection_name_posts = 'posts' mongo_collection_name_index = 'faiss_index'
-
Configure
reddit.py
with the following settings:client_id=... client_secret=... user_agent=... subreddits_list=...
-
Start the services using Docker Compose:
docker-compose up --build -d
Access the Apache Airflow web interface at http://localhost:8080 to manage the data pipeline.
The Django backend can be accessed at http://localhost:8000.
The React frontend can be accessed at http://localhost:3001.
Contributions are welcome! Please fork the repository and submit a pull request.