Detecting fraudulent consumer transactions through machine learning
When I turned 14, what I was most excited about, was having this piece of plastic with my name on it that gives me the freedom to get stuff.
Yes, I’m talking about a credit card.
Going into the store with my parents, I always used to wonder how this piece of plastic was the same as giving actual money (AKA, cash); because for me, back when I was age 9, a credit card was essentially magic.
The average Canadian has about 2 credit cards in their wallet and transacts with that card over 220 times in a year. People love their plastic so much, they rack up ~$22 000 of consumer debt every year.
With that level of usage, its crucial to have systems in place to counter malicious use.
Credit card fraud, defined as the unauthorized use of consumer credit or obtaining goods without actually paying, costs consumers and banks over $180 billion.
Not to mention, conventional services used to counter fraudulent uses require large amounts of human, financial and computing capital.
Using Machine Learning
Machine learning is an up-and-coming subsection of artificial intelligence that focuses on recognizing (sometimes obscure) patterns in large sets of data.
Detecting credit card fraud, interestingly, can be described in the same way.
In order to mitigate malicious usage of consumer credit, a system needs to detect payment data that indicate fraudulent use among the ocean of all payment data.
Putting a human up to this task is just wildly ineffective, but a computer — especially one equipped with properly trained machine learning models — could prove to be an asset to merchants and lenders.
In this article, I walk you through the machine learning model I replicated, that has the capability to recognize fraudulent credit card transactions.
Our Data
The dataset is built off of European credit card transactions in September 2015; we have 496 fraud-flagged items, out of a total of 281 906 transactions. The dataset is very unbalanced, with fraudulent transactions representing about 0.176% of all transactions in our data.
How do we correct an imbalance of data?
- The process of oversampling: SMOTE
- A process of undersampling: via the RandomUnderSampler
Approach 1: Oversampling
To oversample means to unnaturally create points in our data set of the class that is being under-represented in our data.
One technique is SMOTE: Synthetic Minority Over-sampling Technique.
At a high level, SMOTE creates synthetic observations of the minority class (in this case, our fraudulent transactions).
SMOTE conducts these steps:
- Finding the k-nearest-neighbors for minority-class observations (basically we’re looking for similar observations)
- Randomly choose one of those k-nearest-neighbors and using it to create similar, new observations (but still modified randomly)
To learn more about the SMOTE technique, check this out.
Approach 2: Undersampling
Undersampling works by sampling the primary class to lessen the number of samples.
One simple way of undersampling is randomly selecting a handful of samples from the class that is overrepresented and omitting them.
The RandomUnderSampler class randomly selects a subset of our data in the targeted classes.
It works by performing k-means clustering on the primary class and removing data points from high-density centers.
First results
Our model uses a Random Forests Classifier to predict fraudulent transactions.
Without doing anything to tackle the issue of imbalanced data, our model was able to achieve 100% precision for the negative class label.
We have some good results for precision, considering both classes.
Check out the GitHub repo with my code for this project: https://github.com/swaritd/mlfrauddetection/blob/master/main
And special thanks to Rafael for inspiration!