AIM

Implement a ‘Transaction Fraud Prevention System’ leveraging machine learning models, which aims to predict whether a given financial transaction is ‘Fraudulent’ or ‘Valid’.

REPOSITORY FILE GUIDE

Name: comp9417_final.py: Type: Python Script Description: Final script submitted for assessment.
Name: comp9417-unsw.ipynb: Type: Jupyter Notebook Description: Intial Exploratory Data Analysis + Data Preprocessing + Decision Tree model.
Name: corr_pairs_sorted.csv Type: CSV Description: correlation matrix of Transactional dataset features used in selecting the optimal number of features for the dataset for learning and processing optimization. (Sorted)
Name: corr_pairs.csv Type: CSV Description: correlation matrix of Transactional dataset features used in selecting the optimal number of features for the dataset for learning and processing optimization.
Name: EDA.ipynb Type: Jupyter Notebook Description: All the exploratory data anaylsis of both the transactional and identity train and test dataset.
Name: mohitkhanna-comp9417.ipynb Type: Jupyter Notebook Description: Data Preprocessing + Exploratory Data Analysis + Light Gradient Boost Machine (LGBM) model + Random Tree Model + Decision Tree Model + Bernouli Naive Bias Model + extreme gradient Boosting Model (XGB).
Name: submission.csv Type: CSV Description: One of the submission file generated to submit to kaggle competition.

DATASET

The dataset for the model was taken from the Kaggle competition: https://www.kaggle.com/c/ieee-fraud-detection and was provided via the collaboration of IEEE and Vesta Corporation.

TRANSACTION TABLE

TransactionDT: timedelta from a given reference datetime (not an actual timestamp).
TransactionAMT: transaction payment amount in USD.
(*) ProductCD: product code -> the product for each transaction. (categorical feature)
(*) [card1, card2, card3, card4, card5, card6] : payment card information For example card type, card category, issue bank, country, etc. (categorical feature)
(*) addr1: address. (categorical feature)
(*) addr2: address. (categorical feature)
dist: distance.
(*) P_emaildomain: Purchaser email domain. (categorical feature)
(*) R_emaildomain: Recipient email domain. (categorical feature)
[C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14]: The actual meaning is masked but can be said as a count such as how many addresses are found to associated with the payment card.
[D1,D2,D3,D4,D5,D6,D7,D8,D9,D10,D11,D12,D13,D14,D15]: timedelta in days between previous transaction.
(*) [M1,M2,M3,M4,M5,M6,M7,M8,M9]: match such as names on card and address etc. (categorical feature)
Vxxx: Vesta engineered rich features such as ranking, counting and other entity relations.

IDENTITY TABLE

The field names are masked for privacy protection and contract agreement as part fo Vesta's policies.
Mostly fields are related to identity information such as network connection information.

CATEGORICAL FEATURES

DeviceType.
DeviceInfo.
id_12 - id_38.

Note: Credit to Vesta (Competition Host) for providing the above data description and details. Link: https://www.kaggle.com/c/ieee-fraud-detection/discussion/101203

IMPLEMENTATION

SUMMARY

After solving class imbalance, leveraging feature selection and Exploratory Data Analysis, we executed tested the following models for the given data:
1. Decision Tree: This was our baseline model.
2. Bernoulli Naive Bayes.
3. K-Nearest Neighbour.
4. SVM: We could not get the conclusive answer via the SVM.
5. Random Forest.
6. Light Gradient Boost.
7. Integrated Stacked Model.
The final model is an LGB model with hyper parameter tuning giving the Kaggle Score of 93.

EDA

For exploratory data anaylsis please refer to Final_Report_COMP9417_Project.pdf and EDA.ipynb file in this repository

FEATURE ENGINEERING

SOLVING CLASS VARIABLE IMABALANCE USING SIMULATED MINORITY OVER-SAMPLING:

Above image shows the original distribution between the fraud and valid transactions

Approaches considered to solve the class imbalance are minority over sampling and majority over sampling.

Majority under sampling appraoch wass rejected since there could be a possiblity of losing important information.

We used Synthentic Minority Ovver Sampling (SMOTE) and the details are mentioned in the report and the notebook.

RESULT AFTER SMOTE

Since the dataset has over 400 features so we used correlation matrix, graphs created in the exploratory data analysis and result in the end was the most relevant features of the dataset. We also as a part of this process, we used sklearn's RFECV for a recursive feature elimation to get the optimal features of this dataset.

Feature Selection	Parameters
RFECV	BernoulliNB(), step = 15, scoring = 'roc_auc', ev = 5, verbose = 1, n_jobs = 3

MACHINE LEARNING MODELS

Decision Tree:

Model/Scenario	Parameters	Kaggle Score
Label encoding, Features\Columns with 50 percent or more null values removed, Balance of the class variable	random_state = 0, criterion = 'entropy', max_depth = 8, splitter = 'best', min_samples_split = 30	0.69
One hot encoding, Features\Columns with 90 percent or more null values removed, Balance of the class variable	random_state = 0, criterion = 'entropy', max_depth = 8, splitter = 'best', min_samples_split = 30	0.70
Label encoding, Features\Columns with 90 percent or more null values removed, Imbalance of the class vaiable	random_state = 0, criterion = 'entropy', max_depth = 8, splitter = 'best', min_samples_split = 30	0.72

Naive Bias: (check this from notebook)

Class variable is balance and grid search is utilized to fine tune the hyper parameters.

Grid Search Parameters

Model/Scenrio	Kaggle Score
Class variable is imbalance	0.50
Class variable is balance and no parameter tuning	0.63

Parameter	Value(s)
alpha	[0.001,0.01,0.1,1]
Fit_prior	[True]

Grid Search and Feature Selection

0.75

K Nearest Neighbour

Hyperparameters	Kaggle Score
N_neigbours = 3, metric = "minkowski" with p = 2	0.50

Hyperparameters	Kaggle Score
N_neigbours = 5, metric = "minkowski" with p = 2	0.50
N_neigbours = 7, metric = "minkowski" with p = 2	0.50

SelectKBest (Sklearn)

Hyperparameters	Value(s)
Score_func	f_classif
K	20

Hyperparameters	Kaggle Score
N_neigbours = 5, metric = "minkowski" with p = 2	0.67

SUPPORT VECTOR MACHINE

We could not get a conclusive answer for the SVM.

RANDOM FOREST

Hyperparameters	Kaggle Score
'n_estimators' = 100	0.85
'n_estimators' = 500, 'random_state' = 10, 'max_depth' = 20	0.82
'n_estimators' = 1000, 'random_state' = 200, 'bootstrap' = False, 'max_depth' = 5	0.86
'n_estimators' = 1000, 'random_state' = 121, 'min_samples_split' = 2, 'bootstrap' = False, 'max_depth' = 5	0.88

LIGHT GRADIENT BOOST MACHINE

Hyperparameters	Kaggle Score
'objective' = 'binary', 'n_estimators' = 300, 'learning_rate' = 0.1, 'subsample' = 0.8	0.84
'objective' = 'binary', 'n_estimators' = 200, 'learning_rate' = 0.1	0.83
'objective' = 'binary', 'n_estimators' = 500, 'learning_rate' = 0.1	0.87
'objective' = 'binary', 'n_estimators' = 500, 'learning_rate' = 0.1, 'num_leaves' = 50, 'max_depth' = 7, 'subsample' = 0.9, 'colsample_bytree = 0.9'	0.89
'objective' = 'binary', 'n_estimators' = 600, 'learning_rate' = 0.1, 'num_leaves' = 50, 'max_depth' = 7, 'subsample' = 0.9, 'colsample_bytree' = 0.9	0.90
'objective' = 'binary', 'n_estimators' = 700, 'learning_rate' = 0.1, 'num_leaves' = 50, 'max_depth' = 7, 'subsample' = 0.9, 'colsample_bytree' = 0.9, 'random_state' = 108	0.92

INTEGRATED STACKED MODEL

Hyperparameters	Kaggle Score
Decision Tree + K-Nearest Neighbour + Light Gradient Boost Machine + Random Forest + Bernouli Naive Bias	0.78

RESULTS

Model	Parameters	Kaggle Score
Decision Tree	random_state = 0, criterion = 'entropy', max_depth = 30, splitter = 'best', min_samples_split = 30	0.70
Naive Bayes	Alpha = 0.01, prior_class = True	0.75
K Nearest Neighbour		0.67
Random Forest	n_estimators = 1000, random_state = 121, min_samples_split = 2, bootstrap = False, max_depth = 5	0.87
Light Gradient Boosting Machine	objective = binary, n_estimators = 700, learning_rate = 0.1, num_leaves = 50, max_depth = 7, subsample = 0.9, colsample_bytree = 0.9, random_state = 108	0.92
Integrated Stacked Model	Decision Tree + Naive Bayes + K-Nearest Neighbour + Random Forest + Light Gradient Boosting Machine	0.77

* Light Gradient Boost Machine was chosen as the final model with the final prediction score of 0.92

CONTRIBUTORS

Usama Sadiq. (Github Profile: https://github.com/usama-sadiq)
Mohit Khanna. (Github Profile: https://github.com/mohitKhanna1411)
Uttkarsh Sharma. (Github Profile: https://github.com/khaamosh)
Sibo Zhang. (Github Profile: https://github.com/sibozhang400)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Code		Code
.gitattributes		.gitattributes
Class_variable_imbalance.png		Class_variable_imbalance.png
Class_variable_imbalance_result.png		Class_variable_imbalance_result.png
Decision Tree.gv.png		Decision Tree.gv.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AIM

REPOSITORY FILE GUIDE

DATASET

TRANSACTION TABLE

IDENTITY TABLE

CATEGORICAL FEATURES

IMPLEMENTATION

SUMMARY

EDA

FEATURE ENGINEERING

SOLVING CLASS VARIABLE IMABALANCE USING SIMULATED MINORITY OVER-SAMPLING:

RESULT AFTER SMOTE

MACHINE LEARNING MODELS

Decision Tree:

Naive Bias: (check this from notebook)

Class variable is balance and grid search is utilized to fine tune the hyper parameters.

Grid Search Parameters

K Nearest Neighbour

SelectKBest (Sklearn)

SUPPORT VECTOR MACHINE

RANDOM FOREST

LIGHT GRADIENT BOOST MACHINE

INTEGRATED STACKED MODEL

RESULTS

CONTRIBUTORS

About

Releases

Packages

Languages

khaamosh/Fraud-Detection-Machine-Learning-Model

Folders and files

Latest commit

History

Repository files navigation

AIM

REPOSITORY FILE GUIDE

DATASET

TRANSACTION TABLE

IDENTITY TABLE

CATEGORICAL FEATURES

IMPLEMENTATION

SUMMARY

EDA

FEATURE ENGINEERING

SOLVING CLASS VARIABLE IMABALANCE USING SIMULATED MINORITY OVER-SAMPLING:

RESULT AFTER SMOTE

MACHINE LEARNING MODELS

Decision Tree:

Naive Bias: (check this from notebook)

Class variable is balance and grid search is utilized to fine tune the hyper parameters.

Grid Search Parameters

K Nearest Neighbour

SelectKBest (Sklearn)

SUPPORT VECTOR MACHINE

RANDOM FOREST

LIGHT GRADIENT BOOST MACHINE

INTEGRATED STACKED MODEL

RESULTS

CONTRIBUTORS

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages