Skip to content

df8-naufal-aldy-pradana/02-airflow-gcs-bigquery-pipeline

Repository files navigation

02-airflow-gcs-bigquery-pipeline

This is an data pipeline that are create by airflow's DAGS which ingest data from this endpoint https://datausa.io/api/data?drilldowns=Nation&measures=Population, save the response as json file, convert the json into csv, upload it into gcs and finally load into data warehouse external table. flow of the DAG also shown below:

Workflow

End Result

The end result of this pipeline is creating external table from the ingested data which can read through metadata of the parquet file and can be queried as well. Below are screenshot of the result:

Concepts

Airflow Concepts and Architecture

Setup - Official Version

(For the section on the Custom/Lightweight setup, scroll down)

Setup

Airflow Setup with Docker, through official guidelines

Execution

  1. Build the image (only first-time, or when there's any change in the Dockerfile, takes ~15 mins for the first-time):

    docker-compose build

    or (for legacy versions)

    docker build .
  2. Initialize the Airflow scheduler, DB, and other config

    docker-compose up airflow-init
  3. Kick up the all the services from the container:

    docker-compose up
  4. In another terminal, run docker-compose ps to see which containers are up & running (there should be 7, matching with the services in your docker-compose file).

  5. Login to Airflow web UI on localhost:8080 with default creds: airflow/airflow

  6. Run your DAG on the Web Console.

  7. On finishing your run or to shut down the container/s:

    docker-compose down

    To stop and delete containers, delete volumes with database data, and download images, run:

    docker-compose down --volumes --rmi all
    

    or

    docker-compose down --volumes --remove-orphans
    

Setup - Custom No-Frills Version (Lightweight)

This is a quick, simple & less memory-intensive setup of Airflow that works on a LocalExecutor.

Setup

Airflow Setup with Docker, customized

Execution

  1. Stop and delete containers, delete volumes with database data, & downloaded images (from the previous setup): docker-compose down --volumes --rmi all

or docker-compose down --volumes --remove-orphans

Or, if you need to clear your system of any pre-cached Docker issues: docker system prune

Also, empty the airflow logs directory.

  1. Build the image (only first-time, or when there's any change in the Dockerfile): Takes ~5-10 mins for the first-time shell docker-compose build or (for legacy versions) shell docker build .

  2. Kick up the all the services from the container (no need to specially initialize): shell docker-compose -f docker-compose-nofrills.yml up

  3. In another terminal, run docker ps to see which containers are up & running (there should be 3, matching with the services in your docker-compose file).

  4. Login to Airflow web UI on localhost:8080 with creds: admin/admin (explicit creation of admin user was required)

  5. Run your DAG on the Web Console.

  6. On finishing your run or to shut down the container/s: shell docker-compose down

Setup - Taken from DE Zoomcamp 2.3.4 - Optional: Lightweight Local Setup for Airflow

Use the docker-compose_2.3.4.yaml file (and rename it to docker-compose.yaml). Don't forget to replace the variables GCP_PROJECT_ID and GCP_GCS_BUCKET.

Future Enhancements

  • Deploy self-hosted Airflow setup on Kubernetes cluster, or use a Managed Airflow (Cloud Composer) service by GCP

References

For more info, check out these official docs:

About

Airflow script that ingest data from API and upload it to gcs bucket

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published