An introduction to synthetic data

5 min readMar 16, 2023

In our previous blog post, we discussed the topic of generative AI, the challenges of making it actionable in the enterprise, and how enterprises can derive value from generative AI going forward. As the world becomes increasingly digitized, the amount of data generated every day is growing exponentially. This data is a valuable resource for businesses, as it allows them to make informed decisions and improve their operations. However, there is a significant challenge in collecting and managing large amounts of data for artificial intelligence (AI) and machine learning (ML) applications. This is where synthetic data comes in.

What is synthetic data?

Synthetic data is defined as artificially generated data that is created by using algorithms to mimic real-world data. It is a valuable tool for creating large datasets that can be used for use cases such as machine learning model training. Synthetic data is created by using a set of rules that govern the data’s distribution, structure, and relationships.

An overview of the Synthesized Scientific Data Kit

At Synthesized, we developed the Synthesized Scientific Data Kit (SDK), which is a python package that can be used to generate high-quality synthetic data for use in machine learning and data science tasks. After raw data is loaded into the SDK from a data source, the highly optimized generative models utilized by the SDK are trained with this raw data. Once trained these models can synthesize a new dataset with the same statistical properties and correlations present in the original data. While this synthetic data retains the same statistical information present in the original, it is free of sensitive personally identifiable information (PII). This enables users to obtain access to representative data for machine learning tasks in a matter of minutes.

The value of synthetic data for AI & ML

Engineers and researchers will often experience long delays when trying to access sensitive data if they are granted access in the first place. Particularly in the financial services industry where compliance and security are paramount, the process of obtaining access to data that contains sensitive or PII is both expensive and time-consuming, and ultimately reduces productivity and hinders development. With the use of the robust privacy measures built into the SDK and its highly optimized generative models, data engineers can easily access synthetic copies of real-world data that enable rapid experimentation at scale.

On another front, data scientists often face the issue of only having access to low-quality data for machine learning model training. Real data can often be highly unbalanced (where certain categories are present in tiny proportions in a dataset) or contain missing fields meaning that it has limited value when used to train machine learning models. The SDK can be used to improve machine learning model performance by generating a high-quality synthetic dataset, that is better than production data, using techniques such as rebalancing, upsampling, and data imputation. When you have the ability to generate data, it is possible to selectively generate more data of a specific category to help augment your data. In particular, machine learning models trained with rebalanced synthetic data can show a marked improvement in the tasks of fraud detection.

Use cases of synthetic data in financial services and insurance

The financial services and insurance industries are increasingly leveraging AI and ML to improve their operations and better serve their customers. Synthetic data is becoming an essential tool in these industries for various use cases, a few of which are:

Training fraud ML models for accurate fraud detection & prevention

Fraud detection is another critical use case for AI and ML in finance and insurance. Generative models can be used to create synthetic fraud events that possess the same statistical properties as real fraud events. The data generated can be used to train algorithms to detect and prevent fraud more effectively. This is especially useful in cases where real-world data may not provide enough information to train the algorithm accurately.

An original dataset can be intelligently augmented with synthetic data to rebalance the distributions of highly imbalanced features. With the SDK, the use of original data can be avoided altogether by generating an entirely synthetic dataset that has been rebalanced. By training machine learning models on this augmented dataset, the models can learn to accurately detect fraudulent behavior with higher accuracy. This approach can help address the imbalanced nature of fraud datasets and improve the overall performance of fraud detection systems.

Training ML models to optimize underwriting risk assessment

With synthetic data, insurance companies can improve their underwriting systems by generating new data that enhances existing datasets. Synthetic data can be used to train machine learning models that can conduct an up-to-date assessment of risk factors, such as climate change or emerging diseases, that are not well-represented in historic data. With the SDK, insurance companies can create models faster on datasets that were not accessible before, as data scientists can utilize synthetic data to conduct experiments at speed without worrying about data access and quality. In addition, the SDK enhances existing data by upsampling minority classes within the data, which can then be used to optimally train ML models and provide more accurate risk assessments.

By leveraging synthetic data, insurance firms can improve their risk assessment capabilities and make more informed underwriting decisions.

Third-party data sharing

Using synthetic data, financial services firms can share their data with third-party partners, researchers, or across an organization while maintaining data privacy and confidentiality. Furthermore, firms can monetize their data by generating synthetic datasets with the same statistical properties as the original, but with none of the raw data points. The data can subsequently be used to facilitate the creation of new products and services, such as custom analytics and risk models that can be sold to other organizations. By leveraging synthetic data financial services firms can potentially unlock the value of their data assets and create new revenue streams.

Conclusion

Synthetic data is a valuable tool for organizations that want to leverage AI and ML to improve their operational capabilities, productivity, and revenue. It enables organizations to access vast amounts of data that may not be available otherwise, and it enables them to overcome issues with data access, privacy, and security. In finance and insurance, synthetic data is becoming increasingly important for use cases such as fraud detection and prevention, underwriting, and monetization to name a few. As the use of AI and ML continues to grow in the aforementioned industries, synthetic data will become an essential tool for businesses looking to stay competitive and provide the best possible service to their customers.

Originally published at https://www.synthesized.io on March 16, 2023.