This project demonstrates an end-to-end ETL (Extract, Transform, Load) pipeline that extracts data from an API (I used DummyJSON.com/products), transforms the data, and exposes it through a FastAPI endpoint. The entire solution is containerized using Docker for easy deployment.
If you intend to run the final solution, skip to the last two steps in the Table of Contents & follow the steps for "Pull Docker Container" & "Run Docker Container".
-
Create a virtual environment:
python -m venv myenv
-
Activate the virtual environment:
-
Windows:
myenv\Scripts\activate
-
Create a project folder called my_project
and navigate into it:
mkdir my_project
cd my_project
-
Install Meltano:
pip install meltano
-
Initialize Meltano project:
meltano init my_project cd my_project
-
Install the
tap-rest-api-msdk
extractor:meltano add extractor tap-rest-api-msdk
-
Install the
target-csv
loader:meltano add loader target-csv
Configure the tap-rest-api-msdk
to extract data from the API by editing the meltano.yml
file. This involves specifying the API URL and the necessary configurations to extract the data.
Create a transformation script (transform_data.py
) to clean and process the data extracted from the API. This involves handling missing values, calculating new fields, and normalizing data. Follwoing are the basic transformations i applied to the fetched data.
-
Extract and Flatten Tags:
- Converted the nested JSON format in the
tags
column into a comma-separated string for easy readability and analysis.
- Converted the nested JSON format in the
-
Calculate Discounted Price:
- Created a new column
discounted_price
by applying the discount percentage to the original price. This gave us the actual cost after discount.
- Created a new column
-
Aggregate Reviews:
- Calculated the average rating from the
reviews
column by parsing the JSON data and computing the mean of all ratings. The average rating is rounded to one decimal value. - Additionally, counted the number of reviews for each product to understand its popularity and feedback volume.
- Calculated the average rating from the
-
Normalize Dimensions:
- Combined the individual dimension columns (
dimensions_width
,dimensions_height
,dimensions_depth
) into a singledimensions
column in the formatwidthxheightxdepth
. This provides a compact and standardized view of the product dimensions.
- Combined the individual dimension columns (
-
Determine Stock Status:
- Added a
stock_status
column that categorizes products as 'Low Stock' if the stock is less than 10, and 'In Stock' otherwise. This helps in quick identification of products that are running low.
- Added a
-
Handle Missing Values:
- Filled missing values in the
price
column with the mean price. This ensures that no product is left without a price, which is crucial for sales and analysis. - For numeric columns, replaced NaN values with 0, and for non-numeric columns, filled NaN values with the mode (most frequent value) of the respective column. This maintains the integrity of the dataset by avoiding gaps.
- Filled missing values in the
-
Convert Dates to Datetime Format:
- Converted the
meta_createdAt
andmeta_updatedAt
columns to datetime format, allowing for more accurate time-based analysis and operations.
- Converted the
-
Categorize Products:
- Introduced a new column
category_group
to group products into broader categories like 'Cosmetics' for beauty and makeup products, and 'Other' for the rest. This simplifies category-based filtering and analysis.
- Introduced a new column
-
Impute NaN Values:
- To ensure the dataset is free from any missing values, replaced NaN values in numeric columns with 0 and in non-numeric columns with the most frequent value (mode). This ensures consistency and completeness in the data.
The transformed data is saved into a CSV file (transformed_data.csv
).
Create an API using FastAPI to serve the transformed data. The API reads the transformed CSV file and returns the data in JSON format.
To streamline the execution of the entire ETL pipeline and the FastAPI application, a run_etl.sh
script is created. This script performs the following steps:
- Create the loaded_data directory if it doesn't exist.
- Run the Meltano ETL pipeline to extract and load the data.
- Transform the data using the transformation script.
- Start the FastAPI application to serve the data.
-
Create a
Dockerfile
in your project directory. -
Build the Docker image:
docker build -t swaviman/meltano_etl_project:latest .
-
Login to Docker Hub:
docker login
-
Tag your Docker image:
docker tag swaviman/meltano_etl_project:latest
-
Push the image:
docker push swaviman/meltano_etl_project:latest
Make sure Docker Desktop is up and running. To pull the docker image from Docker Hub, run in the terminal:
docker pull swaviman/meltano_etl_project:latest
To run the Docker container:
docker run -p 8000:8000 swaviman/meltano_etl_project:latest
Visit http://127.0.0.1:8000/data in a browser window to view the result.
As part of possible future enhancement to this project a few things can be done.
- Storing extracted files in file system is not ideal and scalable, hence a structured database can be used.
- Postgres or SQL Server can be used.
- Postgres is supported by dbt. For meltano dbt transformations postgres is ideal.
- The DB and the ETL module can be put in separate docket containers using docker compose.
- Unit tests must be written into the project.