Synthetic Population

What is a Synthetic Population?

A synthetic population is an artificially created model, developed using data and statistical methods, that samples or resembles a real population. By accurately representing a population, they can then be used for various simulations, analyses, and decision-making processes across multiple fields such as public health, urban planning, and social sciences.1

The Basic Idea

Imagine you’re part of a team tasked with creating a new, improved public transport system for your city. Unless you live in a very small city, efficiently collecting data from every person who lives there might sound impossible due to time limitations, cost, privacy concerns, and people's willingness to participate. Also, you should probably consider people who live on the outskirts of such a city, but travel inward for work. Sounds hard, right? This is where creating a synthetic population comes in. 

The idea is that this synthetic population statistically simulates the real population you need to inform and create your new transport system. Taking this approach helps researchers and planners model, analyze, and predict outcomes and patterns without surveying actual people.

So, you might be asking yourself: how do you create this population so it accurately reflects real people? 

After identifying the purpose of their synthetic population—meaning what they aim to uncover with the data—researchers begin to collect data from reliable sources, such as national censuses, surveys, and other relevant databases. This data is then used to generate synthetic households or groups of individuals that mirror the attributes found in real-world distributions, such as income and education.

Next, these households or groups are assigned to specific geographical locations to reflect the actual distribution of people within a given area. The model is then validated and calibrated statistically to ensure it accurately represents the real population. Finally, researchers can use this synthetic population to simulate scenarios and analyze outcomes.1

The idea is that by following these steps—or something similar, as each institution typically has its own procedure—we can better understand and deal with complex societal issues. This can be quite useful for policymakers, researchers, and analysts who seek to study and solve issues in a regulated and systematic way.

Key Terms

Agent-Based Model: A computational model used to simulate interactions of individuals within an environment.

Microdata: Individual-level data, often used as the basis for creating synthetic populations.

Census data: Data collected through surveys on all individuals (e.g., people or entire households) of a certain population. Your country probably has a couple of national censuses.2

Survey: Data collected for a subsection of a population so it is statistically significant. All censuses are surveys, but not all surveys are censuses.2

Statistical sampling: A method based on probability and statistics used to select a representative subset of individuals or households from a larger population.

Population dynamics: The study of how and why populations change over time, including factors such as birth rates, death rates, immigration, etc. They are used to understand and predict changes in population size, structure, and distribution.

Demographic attributes: Characteristics of individuals or households, such as age, gender, income, education, employment status, and household size. 

History

The concept of synthetic populations originated in the field of statistics and computer science, where researchers needed realistic but manageable datasets to model complex systems. 

One of the first applications was actually for traffic and transportation planning. In 1996, Richard J. Beckman and his team published their approach to generating synthetic populations, which has now been widely adopted. One of the important steps in their methodology was to incorporate iterative proportional fitting, a statistical tool used to adjust data so that it aligns with known margins or constraints. Using this technique helps create households in specific areas that match certain demographic characteristics.1,3

Beckman’s model was first put into practice in the United States during the Transportation Analysis and Simulation System (TRANSIMS) project—which aimed to revolutionize transport and air quality. 

TRANSIMS was a super-powered computer simulation tool that modeled roads and public transport systems, and, with the help of synthetic populations, also took into account the people living in those areas.

By simulating the daily lives of people with a variety of demographics (age, income, employment, etc.) they were able to understand and improve transportation systems. Additionally, this simulator was also used to observe the effect transportation systems could have on air quality, energy consumption, and carbon dioxide emissions. The success of Beckman’s methodology helped the synthetic population technique reach different projects and fields including epidemiology, urban planning, and economics.1

By the 2010s, synthetic populations were used in many fields to inform decision-making, research, and even public policy. Additionally, advances in computer science and data availability allowed for even more complex and accurate simulations. Today, the integration of machine learning and AI has significantly improved the capabilities of synthetic populations. Both of these technologies help models learn enormous amounts of data, improving the accuracy and precision of these simulations. For example, synthetic populations have been able to mimic real patient data during the COVID-19 pandemic.4

People

Joshua Epstein: American pioneer in the field of agent-based modeling, Epstein's model Sugarscape has been used to explore seasonal migrations, pollution, sexual reproduction, and transmission of disease. He is now a Professor of Epidemiology in the NYU School of Global Public Health.5

Herbert A. Simon: American Nobel laureate in economics for his research on the multiple factors that drive and affect decision-making processes. He is also known for his contributions to artificial intelligence.6

Ilya Rahkovsky: An expert in computational social science, Rahkovsky contributed to the development of synthetic populations used in public health research, especially on food access and consumer demand.7

Nigel Gilbert: British sociologist known for his work on social simulation, Gilbert has been instrumental in advancing the use of synthetic populations and simulations in social science research. He is now director of the Centre for Research in Social Simulation at the University of Surrey.8

Consequences

As mentioned, synthetic populations can be used in various fields to enhance decision-making and research. Some of the most popular fields are policy-making (impact testing of new policies before launch), healthcare (prevention and intervention strategies), and urban planning (infrastructure projects). Additionally, using this methodology can bring a few lesser-known benefits.

Time & Cost Reduction

Collecting data through surveys and/or censuses can be both time-consuming and expensive. While synthetic populations can help reduce the need for new data collection by creating realistic datasets that mirror real-world data, they do come with their own costs. 

The creation and validation of synthetic populations require a substantial investment in data processing and model development. However, once these models are established they can be particularly useful when the project has a tight deadline that cannot be reached with surveys or censuses. 

They also reduce costs by minimizing data gathering and storage, which could be a great perk for smaller businesses or startups. Reducing both of these constraints allows researchers to focus on the actual data, analysis, and solution development rather than data collection and management.9

Privacy Protection

By using synthetic populations, the direct use of medical records or personal data, which can sometimes raise privacy issues about access and potential data breaches, is reduced. 

Synthetic populations have the accuracy of real data but only by showing patterns and characteristics. However, it's important to highlight that synthetic populations are still based on real data, meaning that the security of such information depends on the safeguards implemented by the organizations that create and manage these datasets.

While synthetic populations reduce some of the “higher risks” associated with handling sensitive data, they are only as secure and compliant with data protection regulations as the systems and protocols used.9

Diversity and Representation

Some censuses, especially surveys from a population sample, might not portray the actual diversity of all inhabitants. Using synthetic populations can increase representation (as long as the input data itself is not biased) by including a greater diversity than the original data sources. Doing so helps researchers and developers reach solutions or models that solve the main issue accurately and consider as many people and patterns as possible.9

Machine Learning Enhancement

Looking ahead, a potential advantage lies in feeding a system or computer with enough real-world data to uncover unique patterns or track how different characteristics or behaviors change over time or across various scenarios.

By providing large amounts of diverse data the performance of machine learning algorithms is improved. Using these better-informed technologies in projects could lead to even more accurate and reliable outcomes. Once again, it’s important to highlight that this can only happen if the data used to feed the computer or system isn’t biased. If it is, it will on perpetuate these biases.9

Controversies

While synthetic populations offer some benefits, they also come with challenges and limitations that must be addressed.

Lack of Accuracy and Reality

For certain projects, just replicating patterns and correlations doesn’t capture those intricate details of real data (e.g. contextual data and relationship dynamics). Having this lack of accuracy can affect the validity of analyses and the solutions that were created from synthetic populations.

Another important factor to consider is that ensuring the accuracy of synthetic populations can be quite challenging. Even if they seem realistic, validating that the model actually mirrors real-world trends and patterns is difficult. There’s no guarantee that models trained on synthetic data will perform as well in the real world.9

Complex Data Generation

Some surveys might include open-ended questions or even multiple-choice questions that feature language that might be relevant to understanding people’s needs and patterns. This could fade a little when using synthetic populations. 

Dependence on Real Data

It may sound odd that to avoid conducting a census or survey you have to put previous ones all together. It’s not a surprise that synthetic populations are highly dependent on real data but this means, if the original data is incomplete or lacks statistical value, your synthetic population will also have these flaws.

Additionally, the original data could contain certain biases that could be reflected in the synthetic population, which can have several consequences. Although synthetic populations sound great you do need to ensure you have enough valuable data to create them. Regular updates and validation are necessary to maintain the reliability of synthetic datasets.9

Mitigation Practices

This is not to say that synthetic populations are unusable, they are actually quite helpful when done right and applied to the correct project. To mitigate limitations, organizations should follow best practices such as ensuring diversity in generated data, using appropriate data metrics for validation, thoroughly testing synthetic datasets, and regularly monitoring changes in real-world data to keep synthetic populations up-to-date and accurate.

Related TDL Content

Social Network Analysis

This article talks about Social Network Analysis (SNA), a method used to investigate social structures through networks and graph theory. It explores how individuals and their relationships can be mapped and analyzed to understand social dynamics, influence, and information flow within communities.

Data Science

This article talks about the multidisciplinary field of data science, which combines advanced analytical techniques, statistical methods, and machine learning to analyze large datasets. It explains how data science helps organizations uncover hidden patterns, make data-driven decisions, and predict future trends.

Sources

  1. Moeckel R., Spiekermann K. and Wegener M.,  (2003). Creating a Synthetic Population. 8th  International Conference  on  Computers in  Urban  Planning and  Urban Management (CUPUM) Retrieved July 22, 2024 from: http://moeckel.github.io/rm/doc/2003_moeckel_etal_synpop_cupum.pdf
  2. Eurostat (2024). Beginners:Statistical concept - Survey, census and register. Retrieved July 22, 2024 from: https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Beginners:Statistical_concept_-_Survey,_census_and_register#:~:text=In%20a%20census%2C%20data%20about,characteristics%20of%20the%20whole%20population.
  3. Beckman, R. J., Baggerly, K. A., & McKay, M. D. (1996). Creating synthetic baseline populations. Transportation Research Part A: Policy and Practice, 30(6), 415–429.doi:10.1016/0965-8564(96)00004-3 
  4. Strait, J.E. (2022). Synthetic data mimics real patient data, accurately models COVID-19 pandemic. Washington University School of Medicine in St. Louis. Retrieved July 22, 2024 from: https://medicine.wustl.edu/news/synthetic-data-mimics-real-patient-data-accurately-models-covid-19-pandemic/
  5. NYU (2024). Joshua Epstein. Retrieved July 22, 2024 from: https://publichealth.nyu.edu/faculty/joshua-epstein
  6. Britannica, The Editors of Encyclopaedia. (2024). Herbert A. Simon. Encyclopedia Britannica.https://www.britannica.com/biography/Herbert-A-Simon.
  7. USDA (2019). Ilya Rahkovsky. Retrieved July 22, 2024 from: https://www.ers.usda.gov/authors/ers-staff-directory/ilya-rahkovsky/
  8. University of Surrey. (2024). Professor Nigel Gilbert CBE. Retrieved July 22, 2024 from: https://www.surrey.ac.uk/people/nigel-gilbert
  9. Lamberti, A,. (2023) The benefits and limitations of generating synthetic data. Syntheticus. Retrieved July 23, 2024 from: https://syntheticus.ai/blog/the-benefits-and-limitations-of-generating-synthetic-data
  10. Elessa Etuman, A., Benoussaïd, T., Charreire, H., & Coll, I. (2024). OLYMPUS-POPGEN: A synthetic population generation model to represent urban populations for assessing exposure to air quality. PloS one, 19(3), e0299383. https://doi.org/10.1371/journal.pone.0299383

About the Author

Mariana Ontañón

Mariana Ontañón

Mariana holds a BSc in Pharmaceutical Biological Chemistry and a MSc in Women’s Health. She’s passionate about understanding human behavior in a hollistic way. Mariana combines her knowledge of health sciences with a keen interest in how societal factors influence individual behaviors. Her writing bridges the gap between intricate scientific information and everyday understanding, aiming to foster informed decisions.

About us

We are the leading applied research & innovation consultancy

Our insights are leveraged by the most ambitious organizations

Image

I was blown away with their application and translation of behavioral science into practice. They took a very complex ecosystem and created a series of interventions using an innovative mix of the latest research and creative client co-creation. I was so impressed at the final product they created, which was hugely comprehensive despite the large scope of the client being of the world's most far-reaching and best known consumer brands. I'm excited to see what we can create together in the future.

Heather McKee

BEHAVIORAL SCIENTIST

GLOBAL COFFEEHOUSE CHAIN PROJECT

OUR CLIENT SUCCESS

$0M

Annual Revenue Increase

By launching a behavioral science practice at the core of the organization, we helped one of the largest insurers in North America realize $30M increase in annual revenue.

0%

Increase in Monthly Users

By redesigning North America's first national digital platform for mental health, we achieved a 52% lift in monthly users and an 83% improvement on clinical assessment.

0%

Reduction In Design Time

By designing a new process and getting buy-in from the C-Suite team, we helped one of the largest smartphone manufacturers in the world reduce software design time by 75%.

0%

Reduction in Client Drop-Off

By implementing targeted nudges based on proactive interventions, we reduced drop-off rates for 450,000 clients belonging to USA's oldest debt consolidation organizations by 46%

Read Next

Notes illustration

Eager to learn about how behavioral science can help your organization?