Generalist E-Commerce model for testing data pipelines and projects best practices
-
Updated
Jun 28, 2024 - Jupyter Notebook
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Generalist E-Commerce model for testing data pipelines and projects best practices
Big-Data with Apache Spark and Python.
Welcome to my self-learning journey!
Practice tasks in Python programming language using Hadoop, MRJob, PySpark for Big Data Analytics.
📘 FIWARE 306: Real-time Processing of Context Data using Apache Spark
Open source platform for the machine learning lifecycle
REST API for Apache Spark on K8S or YARN
This repo contains examples of high throughput ingestion using Apache Spark and Apache Iceberg. These examples cover IoT and CDC scenarios using best practices. The code can be deployed into any Spark compatible engine like Amazon EMR Serverless or AWS Glue. A fully local developer environment is also provided.
👷🌇 Set up and build a big data processing pipeline with Apache Spark, 📦 AWS services (S3, EMR, EC2, IAM, VPC, Redshift) and Terraform to setup the infrastructure🥊
Python package for working with demand-side grid projects, datasets and queries
Simple and Distributed Machine Learning
lakeFS - Data version control for your data lake | Git for data
The Proxima platform.
Experiment tracking server focused on speed and scalability
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
This repository contains the projects I have done in the past.
A dockerized application for performing sentiment analysis, tag recognition, and text summarization on YouTube videos.
Dataproc templates and pipelines for solving simple in-cloud data tasks
Oauth2/OIDC Authentication filter for Apache Spark Apps/History UIs
Created by Matei Zaharia
Released May 26, 2014