Synthetic Data Generation

The Basic Idea

Imagine you wanted to conduct a study testing the effectiveness of a new drug. To accurately measure this, you would need to hold a clinical trial and receive access to in-depth medical records, including patient demographic information, medical history, health conditions, and treatment outcomes. Unfortunately, this is sensitive information that could pose a significant risk to privacy. Additionally, it would be difficult and timely to collect enough valid data to make sure your conclusions reflect the entire population.

This is a hypothetical scenario where synthetic data generation could come in handy. Synthetic data is artificially generated data that mimics real-world information.¹ By collecting data from reliable sources and databases, a computer generates synthetic data that mirrors the original numbers. The generated data will not be exactly like the real data, and can even be generated from aggregate data in order to reflect similar patterns while maintaining privacy.

A computer can analyze synthetic data to identify relationships and patterns, enabling the development of algorithms and statistical models that can later be applied to real-world data. In the case of a clinical trial, the computer would use existing data on patient characteristics, medical history, and treatment outcomes to create synthetic data, learn how this data relates to patient outcomes, and build algorithms that will be able to predict similar outcomes from other data sources.

There are three main methods for generating synthetic data:

Generative Adversarial Networks (GANs): A deep learning structure where two neural networks work against one another to generate more authentic data. One network will generate synthetic data, while the other is trained to identify whether it is real or synthetic. If the latter can recognize that any data points are synthetic, the initial neural network will make adjustments until the second network cannot differentiate between real and synthetic data.³
Variational Autoencoders (VAEs): A deep learning model that learns the patterns of real data to create synthetic data with similar characteristics. VAEs reconstruct the original input and then use variational inference to generate additional datasets that retain key features from the original.⁴
Data Augmentation: A technique of synthetic data generation for incomplete datasets. Data augmentation analyzes the original data to create the missing data points and generate a more complete dataset that can be better analyzed.⁵

Synthetic data generation is used in various fields outside of medicine as well. Such scenarios include autonomous vehicle testing when it is too dangerous to simulate driving conditions, finance to help build algorithms to detect fraud without using private and personal financial records, and robotics to train machines in simulated environments. In short, synthetic data generation is helpful in any field where data is private or difficult to obtain.

Although technically “fake,” synthetic data replicates actual properties and characteristics. This method is an effective way to overcome challenges associated with difficulty obtaining real data, or in cases where the real data is sensitive. It is cheap, easy to produce, and since it is artificially generated, it can be easier to align the data with what you are looking for for your specific study. While it can be a pain to sift through medical histories to find a particular data point you are after, such as the outcomes of a drug intervention, with synthetically generated data, you can create domain-specific data points.²

About the Author

Emilie Rose Jones

Emilie currently works in Marketing & Communications for a non-profit organization based in Toronto, Ontario. She completed her Masters of English Literature at UBC in 2021, where she focused on Indigenous and Canadian Literature. Emilie has a passion for writing and behavioural psychology and is always looking for opportunities to make knowledge more accessible.

Consulting

Industries

Resources

What is Synthetic Data Generation?

The Basic Idea

Case studies

From Insight to Impact: Our Success Stories

Is there a problem we can help with?

About the Author

Emilie Rose Jones

About us

We are the leading applied research & innovation consultancy

Our insights are leveraged by the most ambitious organizations

OUR CLIENT SUCCESS

Annual Revenue Increase

Increase in Monthly Users

Reduction In Design Time

Reduction in Client Drop-Off

Read Next

The Butterfly Effect

Situational Leadership Theory

Discrete Choice Experiment

Conformity

Eager to learn about how behavioral science can help your organization?

Consulting

Industries

Resources

Synthetic Data Generation

What is Synthetic Data Generation?

The Basic Idea

Case studies

From Insight to Impact: Our Success Stories

Is there a problem we can help with?

About the Author

Emilie Rose Jones

About us

We are the leading applied research & innovation consultancy

Our insights are leveraged by the most ambitious organizations

OUR CLIENT SUCCESS

Annual Revenue Increase

Increase in Monthly Users

Reduction In Design Time

Reduction in Client Drop-Off

Read Next

The Butterfly Effect

Situational Leadership Theory

Discrete Choice Experiment

Conformity

Eager to learn about how behavioral science can help your organization?

Get new behavioral science insights in your inbox every month.