Ce projet a pour but de réaliser une extraction de features, suivie d'une PCA sur des données volumineuses à l'aide de Spark dans le cloud.
-
Updated
Mar 14, 2024 - Jupyter Notebook
Ce projet a pour but de réaliser une extraction de features, suivie d'une PCA sur des données volumineuses à l'aide de Spark dans le cloud.
With online sales gaining popularity, tech companies are exploring ways to improve their sales by analyzing customer behavior and gaining insights about product trends. Furthermore, the websites make it easier for customers to find the products they require without much scavenging.
AWS Lambda function to send EMR events to Slack via SNS
A robust data pipeline leveraging Amazon EMR and PySpark, orchestrated seamlessly with Apache Airflow for efficient batch processing
Preventing churn is key to improving revenue for Sparkify, a subscription-based company (fictitious). This project is to analyze data from Sparkify to build a model to predict user churn. First, a sample dataset (128MB) was used on a local machine to explore relevant features and develop a working model. Then similar steps were used to develop a…
Parsing the common crawl database using Scala and Spark
Elastic Data Factory
Terraform module to provision an Elastic MapReduce (EMR) cluster on AWS using a Spotinst AWS MrScaler resource
Repo for playing around an AWS Elastic Map Reduce (EMR) cluster
Load data from S3, process the data into analytics tables using Spark and load them back into S3. Deployed this Spark process on a cluster using AWS EMR
In this project, we attempt to predict customer churn of a popular (not real) music service. We perform data analysis and machine learning model building on a large amount of data using Spark.
Creates a data lake by moving data held in an AWS S3 bucket to another S3 bucket after transforming it into tables based on a star schema.
Full code for UDACITY's Data Engineer Nano Degree project. Implementing a Data Lake in Amazon's cloud with AWS S3, AWS EMR and Spark.
Built a recommender system using Apache Mahout machine learning library carried out data analysis using Hadoop, Apache Hive & Pig on Amazon Customer Reviews Data set(130M+ reviews))
Coalesced and transformed various data sources to create a comprehensive data lake for the USA tourism sector.
Add a description, image, and links to the emr-cluster topic page so that developers can more easily learn about it.
To associate your repository with the emr-cluster topic, visit your repo's landing page and select "manage topics."