Organizational and third-party data sharing
Introduction to data sharing
Data sharing has become even more important in recent years, especially with the exponential growth of data generated and stored within organizations. According to Statista, between 2020 to 2025, global data creation is projected to grow to more than 180 zettabytes. As organizations gather large amounts of data, the value of the data grows, and sharing valuable data can yield many benefits for them. However, sharing data has its limitations, such as potential privacy and security breaches. Today we will discuss the importance of data sharing, traditional techniques used to share data, and the drawbacks of each of these techniques. Also, we will discuss the potential of synthetic data as a key enabler for fast, safe, and secure data sharing in the future.
Why is data sharing an important topic?
Data sharing is a pivotal topic because it enables businesses to gain new insights and expand their knowledge in ways that may not be possible using their data. In financial services, data sharing can result in more effective risk management, fraud detection, and decision-making to improve customer experience. For example, credit card companies can share their transaction data with fraud teams across existing silos to understand consumer spending patterns, and identify potential fraud. Additionally, insurance companies can share their claims data to assess risk more effectively and optimize their policy pricing appropriately.
Sharing data with third-party vendors can provide opportunities for companies to gain access to new markets, identify growing trends and opportunities, and develop new products and services. Nevertheless, organizations must ensure that they are sharing data in a compliant and ethical manner, to protect the security of their customer data.
Traditional data-sharing techniques (and common pitfalls)
The key techniques for organizational data sharing can be broken down into three main categories, including:
How data is shared
- Sharing database keys: Data can be shared by sharing database keys, which are effectively passwords for a database. Sharing keys with third-party recipients enables them to access the corresponding database but holds the risk of allowing users to transfer the data. This technique helps organizations share data for analysis and cooperation.
However, the management of permissions and access must be controlled carefully, to ensure that highly sensitive data is not leaked to those who are not authorized to access it.
- Emailing CSV files: An organization’s data can be shared by emailing Comma-Separated Values (CSV) files as an attachment. This method is common when sharing data within an organization due to its ease of use. The data can be exported from a data warehouse or database into CSV format, and then attached to an email that is sent to the recipient.
Nevertheless, there are many potential risks such as errors in data entry and formatting, and security and privacy risks. It loses valuable information such as schemas and has no access to version controls. There are often no checks to see if the data has been corrupted at all. Most importantly from a security standpoint, once a CSV file has been emailed, it is incredibly easy for that file to be forwarded without the original organization’s knowledge or consent.
- Cloud-based storage: The concept of cloud computing has evolved exponentially since its early inception in the 1950s and 1960s. Cloud-based storage allows organizations to share data by providing a centralized repository for data storage that can be accessed by authorized users from anywhere. This enables data and engineering teams, as well as other departments, to access data without complex network configurations. The use of the cloud is very scalable as organizations can increase storage capacity to fit their needs. Cloud providers have strong security measures, such as access controls and encryptions to avoid data access breaches.
However, there are a few drawbacks. Firstly, lots of organizations still have not moved to the cloud, or are in the middle of large and expensive cloud migration projects. Secondly, not all organizations are comfortable with putting their data in the cloud yet. Finally, great care must be taken when placing data into cloud-based storage to fully understand exactly what jurisdiction the cloud infrastructure is sitting in. If a user accidentally saves data to a cloud storage system in another geographic region or jurisdiction, this can class as a movement of data outside of a jurisdiction and can be deemed as a data breach.
- APIs: The use of Application Programming Interfaces (APIs) allows organizations to share data cross-divisionally or with third parties. APIs enable systems to communicate with each other, enabling secure and efficient data sharing compared to other techniques. In financial services, APIs can enable the sharing of information such as transaction history or credit scores with Fintechs or other financial institutions. This can help recipients of shared data to increase innovation, but there are concerns about data privacy and security compliance when sharing data through APIs. If API settings are configured incorrectly data could be leaked during transit.
Techniques used for security improvement
- Pseudonymization: This is a technique used to enhance the privacy of sensitive data so it can be shared with third parties, without revealing the identity of the data subject. Pseudonymization involves substituting personally identifiable information (PII) such as names and addresses with a token or pseudonym instead. Once these privacy enhancers have been added, data can be shared across organizations or with third parties with mitigated risk of data privacy breaches.
However, pseudonymization is susceptible to linkage attacks.
- Information redaction: Information redaction techniques are applied to mask or remove sensitive information in a dataset. Redaction techniques include replacing all but the last 4 characters of a credit card number, or completely removing sensitive columns altogether. This allows organizations to share data cross-divisionally or with third parties without disclosing sensitive information. The process of redaction can be completed manually, but it is also able to be automated.
While redaction is a technique that enables data sharing by masking sensitive data in compliance with regulations, it is not always effective, because it can still be susceptible to linkage attacks.
Enabling data sharing
- Data sharing agreements: These are agreements or contracts defining the terms and conditions that data can be shared under. The contracts define the usage rights of the data, restrictions, and responsibilities of the parties involved. Usually, the agreements address data privacy, compliance, and intellectual property rights transparently, and work in tandem with privacy and security enhancement techniques.
- However, data-sharing agreement comprehensiveness and enforcement are often dependent on a company’s size, industry regulations, and geography.
- Nevertheless, legal regulations and data movement laws apply strongly to data sharing. Some datasets are not allowed to move across geographies (e.g. healthcare datasets often can’t leave the country they were generated in), and some of the existing techniques could be in breach of regulations if they are not adhered to properly.
Synthetic data for data sharing
Synthetic data has emerged as a valuable tool to enable data sharing between silos. Tools such as the SDK are safer than using pseudonymization and information redaction techniques, but provide the same quality of data. Synthetic data produced by the SDK has no direct 1-to-1 mapping with the original data points and provides enhanced data protection features such as tunable differential privacy. This allows organizations to determine how much information is learned from any single row in their datasets by putting strict mathematical constraints on the models during training.
Synthesized’s Governor provides a platform for role-based access control to raw and synthetic data for internal data sharing, enabling easier and more transparent control of raw and synthetic data products internally.
Synthetic data for cross-divisional data sharing (Internal sharing)
Synthetic data is a valuable solution to enable cross-divisional data sharing within financial services. For example, a large bank with multiple business lines, such as commercial, investment, and retail banking is siloed into its divisions with little data sharing between them.
There are several reasons why it is difficult to share data between divisions within a financial services organization. Primarily, there can be concerns about data privacy and security risks, particularly when it comes to sharing sensitive customer information. This ties into the rapidly-changing landscape of regulatory and legal constraints, which restrict the sharing of particular types of data between divisions. Another reason why it may be difficult is that different divisions may use different technologies or data systems, which makes it difficult to integrate data from various sources.
The use of synthetic data can enable a division to generate a synthetic dataset that is representative of an original dataset. The synthetic datasets can be safely shared between divisions without risking the privacy of sensitive customer information. For example, the retail banking division of a large financial institution can generate synthetic data that is representative of customer behaviors, and this data can be shared with the institution’s investment banking division to help inform investment decisions. This increases data access and data utility that is not achievable with original data.
Synthetic data for data monetization (Third-party sharing)
Data possess tremendous value, and for financial services companies with vast amounts of data at their disposal, there is great potential for that data to be monetized. In recent years, there has been a growing discussion around the use of synthetic data for data monetization. The privacy-preserving features in synthetic data enable faster and safer sharing of information between silos, and to other data controllers such as third parties, without worrying about the restrictions that apply to sharing data containing personally identifiable information (PII).
Increased efficiency from the speed and improved safety of sharing synthetic data could allow for the creation of new revenue streams through the sale of synthetic versions of banking and insurance data to third parties. As the data monetization use case has become a more popular topic in the context of synthetic data, financial services companies have begun exploring its potential going forward.
Using synthetic data for data monetization could provide several business benefits such as:
- Increased revenue: Synthetic data can be used as a representation of real and valuable datasets, as they maintain the statistical properties of original data. The synthetic datasets can subsequently be repackaged into assets that a financial services company can sell to a third party
- Decreased risk compared to selling raw data: Use of synthetic data mitigates the risk of sensitive data leaks, thus complying with regulatory requirements on sensitive customer data
However, there are challenges associated with using synthetic data for data monetization, including:
- Data quality: Data must be representative of original data sets (to provide value and insights similar to the original data), but should not risk the security of sensitive customer information. The quality of some synthetic datasets may be reduced as certain privacy-preserving features are activated and strengthened
- Regulatory standards: Regulations on the PII classifications of synthetic data are not yet concrete, meaning it is not yet clear how and when the sharing of synthetic data is compliant or not. Equally, as regulation continues to evolve in financial services, the future of data monetization in the context of regulatory standards is not yet set in stone
While synthetic data has potential benefits for data monetization, financial services firms must carefully consider the potential challenges and limitations before deciding to use it for this purpose.
Sharing synthetic data for academic research papers (Third party sharing/research)
Synthetic data’s use in academic research is increasing in importance due to its potential to overcome the problems associated with traditional data access and sharing. The use of synthetic data generation techniques enables researchers to generate data that is statistically representative and has the same correlations as the original data, without jeopardizing privacy or confidentiality. As a result, researchers have the opportunity to explore complex research questions that may have been difficult to address using original data alone.
For instance, in healthcare, the idea of using synthetic training data to help train machine learning models to detect early-stage lung cancer from CT scans, without compromising patient privacy and confidentiality when hospitals or medical researchers share the data has been highlighted.
Moreover, synthetic data can be used to address the issue of reproducibility in academic research, by allowing researchers to share data that can be used by other researchers without compromising the original data source. Nevertheless, the use of synthetic data in academic research provides a promising avenue for addressing the challenges of data access and sharing while advancing research in many fields.
To conclude, data sharing is pivotal for organizations that want to harness the potential of data to enhance decision-making, revenue, and innovation. But, data sharing does have significant risks, such as data privacy and security breaches, which can result in legal issues and significant costs. How data is leveraged for sharing must be cautious and considered thoroughly. Synthetic data is a promising enabler for secure data sharing, as it provides organizations with an opportunity to stay within the parameters of regulations. As synthetic data generation continues to evolve, organizations are likely to consider its use to enable fast, secure, and compliant data sharing.