#

apache-spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

Here are 1,686 public repositories matching this topic...

VitorM3 / Ecommerce-ED

Generalist E-Commerce model for testing data pipelines and projects best practices

python sql apache-spark mkdocs apache-airflow astro-cli

Updated Jun 28, 2024
Jupyter Notebook

ckongala / SparkPythonBigData

Big-Data with Apache Spark and Python.

big-data apache-spark hadoop apache pyspark spark-streaming mllib mapreduce structured-streaming data-frames graphx hadoop-yarn

Updated Jun 28, 2024
Python

ritesh-ojha / Self-Learning

Welcome to my self-learning journey!

github docker aws sql apache-spark shell-script

Updated Jun 28, 2024
Jupyter Notebook

burhanahmed1 / Big-Data-Analytics

Practice tasks in Python programming language using Hadoop, MRJob, PySpark for Big Data Analytics.

python spark apache-spark hadoop jupyter-notebook pyspark sparksql hadoop-mapreduce spark-sql mrjob

Updated Jun 28, 2024
Jupyter Notebook

FIWARE / tutorials.Big-Data-Spark

📘 FIWARE 306: Real-time Processing of Context Data using Apache Spark

tutorial spark apache-spark fiware big-data-analytics fiware-cosmos orion-spark-connector

Updated Jun 28, 2024
Shell

mlflow / mlflow

Open source platform for the machine learning lifecycle

machine-learning ai apache-spark ml model-management mlflow

Updated Jun 28, 2024
Python

exacaster / lighter

REST API for Apache Spark on K8S or YARN

spark apache-spark yarn jupyter k8s livy sparkmagic

Updated Jun 28, 2024
Java

aws-samples / iceberg-streaming-examples

This repo contains examples of high throughput ingestion using Apache Spark and Apache Iceberg. These examples cover IoT and CDC scenarios using best practices. The code can be deployed into any Spark compatible engine like Amazon EMR Serverless or AWS Glue. A fully local developer environment is also provided.

apache-spark structured-streaming apache-iceberg

Updated Jun 28, 2024
Java

longNguyen010203 / Spark-Processing-AWS

👷🌇 Set up and build a big data processing pipeline with Apache Spark, 📦 AWS services (S3, EMR, EC2, IAM, VPC, Redshift) and Terraform to setup the infrastructure🥊

aws apache-spark terraform aws-s3 iam pyspark cloud-computing aws-ec2 redshift data-pipeline aws-services apache-airflow emr-cluster spark-cluster spark-master spark-worker

Updated Jun 28, 2024
Python

dsgrid / dsgrid

Python package for working with demand-side grid projects, datasets and queries

python apache-spark energy-data energy-demand energy-demand-forecasting electricity-load

Updated Jun 28, 2024
Python

SynapseML

microsoft / SynapseML

Simple and Distributed Machine Learning

Updated Jun 27, 2024
Scala

lakeFS

treeverse / lakeFS

lakeFS - Data version control for your data lake | Git for data

go golang apache-spark aws-s3 google-cloud-storage data-engineering data-lake azure-storage data-version-control object-storage datalake hadoop-filesystem data-quality data-versioning azure-blob-storage apache-sparksql git-for-data lakefs datalakes

Updated Jun 28, 2024
Go

O2-Czech-Republic / proxima-platform

The Proxima platform.

apache-spark stream-processing iot-platform apache-beam apache-flink batch-processing analytical-platform unified-data-processing data-mesh

Updated Jun 27, 2024
Java

G-Research / fasttrackml

Experiment tracking server focused on speed and scalability

visualization metadata data-science machine-learning ai apache-spark metrics tensorflow ml data-visualization pytorch tensorboard mlops mlflow mlflow-tracking-server experiment-tracking metadata-tracking

Updated Jun 28, 2024
Go

kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.

kubernetes spark apache-spark kubernetes-operator kubernetes-controller kubernetes-crd google-cloud-dataproc

Updated Jun 27, 2024
Go

Sayed-Nahid / Cyber-Threat-Analytics-for-Instant-Response-Using-Big-Data

iot machine-learning big-data apache-spark cyber-security cyber-threat-intelligence cyber-analytics

Updated Jun 27, 2024
Jupyter Notebook

Napevithanage-Gouri / Projects

This repository contains the projects I have done in the past.

python java machine-learning apache-spark data-engineering tableau algorithms-and-data-structures simulation-and-modeling-techniques

Updated Jun 27, 2024

SimoneBrancato / AudioMood

A dockerized application for performing sentiment analysis, tag recognition, and text summarization on YouTube videos.

docker elasticsearch machine-learning kibana logstash apache-spark sentiment-analysis artificial-intelligence summarization apache-kafka keyword-extraction elk-stack tags-recognition

Updated Jun 26, 2024
Python

GoogleCloudPlatform / dataproc-templates

Dataproc templates and pipelines for solving simple in-cloud data tasks

bigquery apache-spark jupyter-notebook gcp google-cloud pyspark google-cloud-platform

Updated Jun 25, 2024
Python

OKDP / okdp-spark-auth-filter

Oauth2/OIDC Authentication filter for Apache Spark Apps/History UIs

docker kubernetes oauth2 apache-spark openid-connect oidc-client spark-kubernetes k8s-spark

Updated Jun 25, 2024
Java

Created by Matei Zaharia

Released May 26, 2014

Followers: 419 followers
Repository: apache/spark
Website: spark.apache.org
Wikipedia: Wikipedia

Related Topics