Skip to content

Evaluation and Implementation of various Machine Learning models for creating a "Banking/Financial Transaction Fraud Prevention System"

Notifications You must be signed in to change notification settings

khaamosh/Fraud-Detection-Machine-Learning-Model

Repository files navigation

AIM

  • Implement a ‘Transaction Fraud Prevention System’ leveraging machine learning models, which aims to predict whether a given financial transaction is ‘Fraudulent’ or ‘Valid’.

REPOSITORY FILE GUIDE

  • Name: comp9417_final.py: Type: Python Script Description: Final script submitted for assessment.
  • Name: comp9417-unsw.ipynb: Type: Jupyter Notebook Description: Intial Exploratory Data Analysis + Data Preprocessing + Decision Tree model.
  • Name: corr_pairs_sorted.csv Type: CSV Description: correlation matrix of Transactional dataset features used in selecting the optimal number of features for the dataset for learning and processing optimization. (Sorted)
  • Name: corr_pairs.csv Type: CSV Description: correlation matrix of Transactional dataset features used in selecting the optimal number of features for the dataset for learning and processing optimization.
  • Name: EDA.ipynb Type: Jupyter Notebook Description: All the exploratory data anaylsis of both the transactional and identity train and test dataset.
  • Name: mohitkhanna-comp9417.ipynb Type: Jupyter Notebook Description: Data Preprocessing + Exploratory Data Analysis + Light Gradient Boost Machine (LGBM) model + Random Tree Model + Decision Tree Model + Bernouli Naive Bias Model + extreme gradient Boosting Model (XGB).
  • Name: submission.csv Type: CSV Description: One of the submission file generated to submit to kaggle competition.

DATASET

TRANSACTION TABLE

  • TransactionDT: timedelta from a given reference datetime (not an actual timestamp).
  • TransactionAMT: transaction payment amount in USD.
  • (*) ProductCD: product code -> the product for each transaction. (categorical feature)
  • (*) [card1, card2, card3, card4, card5, card6] : payment card information For example card type, card category, issue bank, country, etc. (categorical feature)
  • (*) addr1: address. (categorical feature)
  • (*) addr2: address. (categorical feature)
  • dist: distance.
  • (*) P_emaildomain: Purchaser email domain. (categorical feature)
  • (*) R_emaildomain: Recipient email domain. (categorical feature)
  • [C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14]: The actual meaning is masked but can be said as a count such as how many addresses are found to associated with the payment card.
  • [D1,D2,D3,D4,D5,D6,D7,D8,D9,D10,D11,D12,D13,D14,D15]: timedelta in days between previous transaction.
  • (*) [M1,M2,M3,M4,M5,M6,M7,M8,M9]: match such as names on card and address etc. (categorical feature)
  • Vxxx: Vesta engineered rich features such as ranking, counting and other entity relations.

IDENTITY TABLE

  • The field names are masked for privacy protection and contract agreement as part fo Vesta's policies.

  • Mostly fields are related to identity information such as network connection information.

CATEGORICAL FEATURES

  • DeviceType.
  • DeviceInfo.
  • id_12 - id_38.

Note: Credit to Vesta (Competition Host) for providing the above data description and details. Link: https://www.kaggle.com/c/ieee-fraud-detection/discussion/101203

IMPLEMENTATION

SUMMARY

  • After solving class imbalance, leveraging feature selection and Exploratory Data Analysis, we executed tested the following models for the given data:

    1. Decision Tree: This was our baseline model.
    2. Bernoulli Naive Bayes.
    3. K-Nearest Neighbour.
    4. SVM: We could not get the conclusive answer via the SVM.
    5. Random Forest.
    6. Light Gradient Boost.
    7. Integrated Stacked Model.
  • The final model is an LGB model with hyper parameter tuning giving the Kaggle Score of 93.

EDA

For exploratory data anaylsis please refer to Final_Report_COMP9417_Project.pdf and EDA.ipynb file in this repository

FEATURE ENGINEERING

SOLVING CLASS VARIABLE IMABALANCE USING SIMULATED MINORITY OVER-SAMPLING:

Alt text

Above image shows the original distribution between the fraud and valid transactions

Approaches considered to solve the class imbalance are minority over sampling and majority over sampling.

Majority under sampling appraoch wass rejected since there could be a possiblity of losing important information.

We used Synthentic Minority Ovver Sampling (SMOTE) and the details are mentioned in the report and the notebook.

RESULT AFTER SMOTE

Alt text

Since the dataset has over 400 features so we used correlation matrix, graphs created in the exploratory data analysis and result in the end was the most relevant features of the dataset. We also as a part of this process, we used sklearn's RFECV for a recursive feature elimation to get the optimal features of this dataset.

Feature Selection Parameters
RFECV BernoulliNB(), step = 15, scoring = 'roc_auc', ev = 5, verbose = 1, n_jobs = 3

MACHINE LEARNING MODELS

Decision Tree:

Model/Scenario Parameters Kaggle Score
Label encoding, Features\Columns with 50 percent or more null values removed, Balance of the class variable random_state = 0, criterion = 'entropy', max_depth = 8, splitter = 'best', min_samples_split = 30 0.69
One hot encoding, Features\Columns with 90 percent or more null values removed, Balance of the class variable random_state = 0, criterion = 'entropy', max_depth = 8, splitter = 'best', min_samples_split = 30 0.70
Label encoding, Features\Columns with 90 percent or more null values removed, Imbalance of the class vaiable random_state = 0, criterion = 'entropy', max_depth = 8, splitter = 'best', min_samples_split = 30 0.72

Naive Bias: (check this from notebook)

Class variable is balance and grid search is utilized to fine tune the hyper parameters.

Grid Search Parameters
Model/Scenrio Kaggle Score
Class variable is imbalance 0.50
Class variable is balance and no parameter tuning 0.63
Parameter Value(s)
alpha [0.001,0.01,0.1,1]
Fit_prior [True]
Grid Search and Feature Selection 0.75

K Nearest Neighbour

Hyperparameters Kaggle Score
N_neigbours = 3, metric = "minkowski" with p = 2 0.50
Hyperparameters Kaggle Score
N_neigbours = 5, metric = "minkowski" with p = 2 0.50
N_neigbours = 7, metric = "minkowski" with p = 2 0.50

SelectKBest (Sklearn)

Hyperparameters Value(s)
Score_func f_classif
K 20
Hyperparameters Kaggle Score
N_neigbours = 5, metric = "minkowski" with p = 2 0.67

SUPPORT VECTOR MACHINE

We could not get a conclusive answer for the SVM.

RANDOM FOREST

Hyperparameters Kaggle Score
'n_estimators' = 100 0.85
'n_estimators' = 500, 'random_state' = 10, 'max_depth' = 20 0.82
'n_estimators' = 1000, 'random_state' = 200, 'bootstrap' = False, 'max_depth' = 5 0.86
'n_estimators' = 1000, 'random_state' = 121, 'min_samples_split' = 2, 'bootstrap' = False, 'max_depth' = 5 0.88

LIGHT GRADIENT BOOST MACHINE

Hyperparameters Kaggle Score
'objective' = 'binary', 'n_estimators' = 300, 'learning_rate' = 0.1, 'subsample' = 0.8 0.84
'objective' = 'binary', 'n_estimators' = 200, 'learning_rate' = 0.1 0.83
'objective' = 'binary', 'n_estimators' = 500, 'learning_rate' = 0.1 0.87
'objective' = 'binary', 'n_estimators' = 500, 'learning_rate' = 0.1, 'num_leaves' = 50, 'max_depth' = 7, 'subsample' = 0.9, 'colsample_bytree = 0.9' 0.89
'objective' = 'binary', 'n_estimators' = 600, 'learning_rate' = 0.1, 'num_leaves' = 50, 'max_depth' = 7, 'subsample' = 0.9, 'colsample_bytree' = 0.9 0.90
'objective' = 'binary', 'n_estimators' = 700, 'learning_rate' = 0.1, 'num_leaves' = 50, 'max_depth' = 7, 'subsample' = 0.9, 'colsample_bytree' = 0.9, 'random_state' = 108 0.92

INTEGRATED STACKED MODEL

Hyperparameters Kaggle Score
Decision Tree + K-Nearest Neighbour + Light Gradient Boost Machine + Random Forest + Bernouli Naive Bias 0.78

RESULTS

Model Parameters Kaggle Score
Decision Tree random_state = 0, criterion = 'entropy', max_depth = 30, splitter = 'best', min_samples_split = 30 0.70
Naive Bayes Alpha = 0.01, prior_class = True 0.75
K Nearest Neighbour 0.67
Random Forest n_estimators = 1000, random_state = 121, min_samples_split = 2, bootstrap = False, max_depth = 5 0.87
Light Gradient Boosting Machine objective = binary, n_estimators = 700, learning_rate = 0.1, num_leaves = 50, max_depth = 7, subsample = 0.9, colsample_bytree = 0.9, random_state = 108 0.92
Integrated Stacked Model Decision Tree + Naive Bayes + K-Nearest Neighbour + Random Forest + Light Gradient Boosting Machine 0.77

* Light Gradient Boost Machine was chosen as the final model with the final prediction score of 0.92

CONTRIBUTORS