Synthetic Data Generation
What is Synthetic Data Generation?
Synthetic data generation is the creation of artificial datasets that closely replicate the statistical properties and patterns of real-world data without using actual individual records. This data is crafted through specialized algorithms and mathematical models, providing a valuable resource for training machine learning models, testing software, and simulating scenarios while protecting privacy and addressing data scarcity. By preserving essential relationships within data, synthetic data generation enables robust analysis and innovation across fields like healthcare, finance, and technology without risking sensitive information.
The Basic Idea
Imagine you wanted to conduct a study testing the effectiveness of a new drug. To accurately measure this, you would need to hold a clinical trial and receive access to in-depth medical records, including patient demographic information, medical history, health conditions, and treatment outcomes. Unfortunately, this is sensitive information that could pose a significant risk to privacy. Additionally, it would be difficult and timely to collect enough valid data to make sure your conclusions reflect the entire population.
This is a hypothetical scenario where synthetic data generation could come in handy. Synthetic data is artificially generated data that mimics real-world information.1 By collecting data from reliable sources and databases, a computer generates synthetic data that mirrors the original numbers. The generated data will not be exactly like the real data, and can even be generated from aggregate data in order to reflect similar patterns while maintaining privacy.
A computer can analyze synthetic data to identify relationships and patterns, enabling the development of algorithms and statistical models that can later be applied to real-world data. In the case of a clinical trial, the computer would use existing data on patient characteristics, medical history, and treatment outcomes to create synthetic data, learn how this data relates to patient outcomes, and build algorithms that will be able to predict similar outcomes from other data sources.
There are three main methods for generating synthetic data:
- Generative Adversarial Networks (GANs): A deep learning structure where two neural networks work against one another to generate more authentic data. One network will generate synthetic data, while the other is trained to identify whether it is real or synthetic. If the latter can recognize that any data points are synthetic, the initial neural network will make adjustments until the second network cannot differentiate between real and synthetic data.3
- Variational Autoencoders (VAEs): A deep learning model that learns the patterns of real data to create synthetic data with similar characteristics. VAEs reconstruct the original input and then use variational inference to generate additional datasets that retain key features from the original.4
- Data Augmentation: A technique of synthetic data generation for incomplete datasets. Data augmentation analyzes the original data to create the missing data points and generate a more complete dataset that can be better analyzed.5
Synthetic data generation is used in various fields outside of medicine as well. Such scenarios include autonomous vehicle testing when it is too dangerous to simulate driving conditions, finance to help build algorithms to detect fraud without using private and personal financial records, and robotics to train machines in simulated environments. In short, synthetic data generation is helpful in any field where data is private or difficult to obtain.
Although technically “fake,” synthetic data replicates actual properties and characteristics. This method is an effective way to overcome challenges associated with difficulty obtaining real data, or in cases where the real data is sensitive. It is cheap, easy to produce, and since it is artificially generated, it can be easier to align the data with what you are looking for for your specific study. While it can be a pain to sift through medical histories to find a particular data point you are after, such as the outcomes of a drug intervention, with synthetically generated data, you can create domain-specific data points.2
About the Author
Emilie Rose Jones
Emilie currently works in Marketing & Communications for a non-profit organization based in Toronto, Ontario. She completed her Masters of English Literature at UBC in 2021, where she focused on Indigenous and Canadian Literature. Emilie has a passion for writing and behavioural psychology and is always looking for opportunities to make knowledge more accessible.