Optimizing fraud machine learning model performance in financial services

7 min readMar 23, 2023

Fraud detection and prevention is a critical challenge that financial services institutions face. Traditional methods of detecting fraud tend to be reactive and can ultimately result in significant financial losses. With the rise of machine learning (ML) and artificial intelligence (AI), financial institutions are now able to proactively identify and prevent fraudulent activities. However, to train these ML models, financial institutions require large volumes of privacy-compliant data. In this blog post, we will explore how synthetic data can be used to improve machine learning model performance for detecting fraud cases.

Improving ML model performance for fraud detection & prevention

Machine learning models need a vast amount of data to detect fraudulent activities. The more data the model is trained on, the more accurately it can detect fraud. However, collecting such data can be a challenge for a few reasons, for example:

Incomplete data

Fraud transaction data often contains sensitive information about customers, such as personally identifiable information (PII) and credit card details.

Data sharing

While fraudulent data must be stored in an organization’s database, it is very difficult to share between divisions due to data privacy regulations within the financial services industry.

Rare occurrences

Fraudulent transaction data makes up a very small percentage of real-life datasets (usually a lot less than 1% of the data), which means that collecting enough data to train fraud ML models on can be difficult.

This is where synthetic data comes in.

Synthetic data can be used to generate data that represent real-world fraud cases. It can be used to create datasets that look like different types of fraud scenarios, which can help train machine learning models more accurately. By using synthetic data, financial institutions can generate a vast amount of data quickly, easily, and efficiently.

Creating complex scenarios for fraud detection

One significant advantage of synthetic data is its ability to generate scenarios that are not present in real-world data. Synthetic data can augment a wide range of fraud scenarios, which can be used to train machine-learning models to detect fraud more accurately. Given that fraud cases are not common, fraudulent transactions represent only a small proportion of the overall dataset, and real-world data often contain class imbalances. One way of addressing this problem is to create a more balanced dataset — either by collecting more data (which is expensive) or by creating more data. Two methods of creating new data are Synthesized, and a technique called Synthetic Minority Over-sampling Technique (SMOTE).

SMOTE is a common technique for generating synthetic tabular data that can be used to improve machine learning model performance for fraud detection in financial services institutions. The technique works by selecting a minority instance in a dataset, which in this case represents fraudulent transactions, and generating synthetic data points by interpolating between the minority instance, resulting in new synthetic data points that are similar to the original minority instance (see figures 1 & 2).

Research and academic studies have shown that SMOTE can be highly effective in improving the performance of fraud detection models in financial services institutions. For example, in their groundbreaking paper, Chawla et al. applied SMOTE to a credit card fraud dataset and compared the results to a baseline model that did not use any oversampling techniques. They found that the SMOTE-based model achieved significantly higher performance, with an AUC of 0.98 compared to the baseline model’s AUC of 0.88.

SMOTE is now a tested and trusted technique that performs well in a lot of situations, however, because SMOTE interpolates directly between existing fraud data points, it is limited in the new data it can create (leaving ‘holes’ where it has not sampled data). Synthesized learns the patterns and behaviors in the entire dataset — fraud and non-fraud alike. Learning what non-fraud data looks like allows Synthesized to generate more realistic fraud cases that are more than just direct interpolations between fraud points. Real-life fraud datasets typically have <1% fraud examples. Put simply, SMOTE uses information contained in only the <1% of fraud data. Synthesized can use the information in the remaining 99% of non-fraud data as well to produce more realistic fraud examples.

With Synthesized, users can also specify the columns, categories, and ratios that are required for their models. By using synthetic data to create different fraud scenarios, financial institutions can ensure that their machine-learning models are trained on a diverse set of data.

Addressing the benefits of using synthetic data to train fraud machine learning models

The use of synthetic data to train fraud ML models has several benefits for financial services companies, including:

Cost savings

Rapid fraud detection is a multi-million dollar problem for certain banks. Using synthetic data can help reduce the costs associated with collecting and managing large volumes of real-world data. The process of collecting and preparing original data can be time-consuming and expensive. Synthetic data, on the other hand, can be generated quickly and efficiently. This can help financial institutions save time and money while still training their machine-learning models effectively. With The Synthesized SDK, a tier-1 U.S. Bank saw demonstrated improvements in their fraud ML model performance, with an estimated value impact of $5–6 million per annum.

Saving developer time

Using synthetic data to improve fraud ML models can significantly benefit developers by saving them time. Typically, developers are required to collect a significant amount of real-world data to train their fraud ML models, which is a costly and time-consuming process. Synthetic data generation solves this problem by creating, high-quality, privacy-compliant, and balanced datasets that can be used to train fraud ML models. Developers can generate data to quickly iterate and test their models, without relying on real-world data that may take months to access. Not only do developers save time, but also fraud ML models can be deployed faster, which could potentially save millions of dollars through fast and effective fraud detection.

Reducing the risk of costly data breaches

One of the issues associated with using real-world data to train Fraud ML models is that the real data often contains sensitive information about a customer. If this data were to be leaked, or shared against existing data privacy laws, it could have significant ramifications for an organization, that could potentially result in multi-million dollar fines. One example of this was in 2018, when the Information Commissioner’s Office (ICO) fined British Airways £20m for a significant data incident that occurred over several months in 2018, resulting in the loss of personal data of over 400,000 staff and customers including banking/payment information, names, and addresses. Synthesized mitigates this risk by adding privacy-preserving definitions to its synthetic data, which omit any PII from customer data, making it safer to use than real-world data.

But there can be limitations…

Current scalability

While financial services companies have large amounts of data stored, they may not be able to input all of that information into a synthetic data generator to create value out of all of their datasets. Some original datasets can be too big, and subsequently cannot be used by synthetic data generators, as the generators may currently be unable to handle big data.

Requiring real-world samples

Many different types of fraud scenarios can hurt financial services companies, and training fraud ML models to detect all of those instances can be an important capability. However, synthetic data requires real-world data that contains those different scenarios, to be able to upsample the instances of a particular type of fraud for ML model training. Without it, synthetic versions of particular fraud scenarios cannot be generated.

Regulation

Synthetic data is an evolving field, and as a result regulation in the space is continuously developing. Synthesized has been at the forefront of trying to help banks to insure that the data practices that they implement are safe and secure. Privacy regulations such as GDPR and CCPA can result in huge fines for any companies that breach their users’ data privacy, which can cause companies’ data practices to stagnate. The ICO can issue penalties with a maximum amount of £17.5 million or 4% of the total annual worldwide turnover in the preceding financial year, depending on whichever is higher. EU regulators can issue fines of up to €20 million, or 4% of the firm’s worldwide annual revenue from the preceding financial year, depending on whichever amount is higher. Thus, utilizing synthetic data can help companies comply with data privacy regulations and still pull value from their data.

Conclusion

Synthetic data is a valuable tool for financial institutions looking to improve their machine-learning model performance for detecting fraud cases. It can help financial institutions generate large volumes of data quickly and efficiently, without compromising customer privacy. Synthetic data can generate representations of complex fraud scenarios, which can help train machine learning models more accurately. Moreover, it can help financial institutions reduce the costs associated with collecting and managing large volumes of real-world data. For senior data leaders, it is essential to consider the tools capable of improving machine learning model performance to detect fraudulent activities proactively.

Originally published at https://www.synthesized.io on March 23, 2023.