Skip to main content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.


The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

A Synthetic Health Data Generation Engine to Accelerate Patient-Centered Outcomes Research

Providing PCOR Researchers with a Low Risk, Readily Available Synthetic Data Source
  • Office of the National Coordinator for Health Information Technology (ONC)
Start Date
  • 4/1/2019
  • Use of Enhanced Publically-Funded Systems for Research


STATUS: Completed Project


High quality health and health care related data are often difficult to access because of cost, patient privacy, or other legal and intellectual property restrictions. To protect patient privacy, researchers and developers often depend on anonymized data to test theories, data models, algorithms, or prototype innovations. However, the risk of re-identification of anonymized data is high and has been impossible to completely eliminate especially with rare conditions. Further, due to a variety of interoperability issues, it is often difficult to bring data together from different resources for the purpose of robustly testing analysis models, algorithms, or assisting in the development of software applications. Synthetic data can be used to initiate, refine, or test innovative research approaches more quickly. This project addressed the need for research-quality synthetic data by increasing the amount and type of realistic, synthetic data that the Synthea™ software program can generate. Synthea™ is an open source software program that creates high quality, clinically realistic, synthetic patient health records in large volumes

ONC leveraged its expertise as a coordinator by bringing together a technical expert panel (TEP) to assist in the development of five use cases for new module development from three categories; opioids, pediatrics, complex care needs. To ensure that the modules generate data fitting the needs of patient-centered outcomes researchers, the TEP included representation from relevant researchers so they can provide input regarding the design of a given module. Some initial applications of the generated synthetic data include identification of effective prevention methods, treatments, or interventions, such as assessments related to controlled substance prescriptions that reduce the impact of mental and substance use disorders. Similarly, patient-centered outcomes research (PCOR) researchers may use the synthetically generated data from the other module categories to simulate care interventions, analyze longitudinal patient progress, and potentially incorporate patient reported outcomes.


A synthetic data engine is a potentially important piece of the greater PCOR data infrastructure because it provides PCOR researchers with a low risk, readily available synthetic data source complementing their use of real clinical data and enhancing their ability to conduct rigorous analyses and generate relevant findings that can inform health and treatment decisions.

Project Objectives:

  • Enhance Synthea™ by developing or updating five to seven data generation modules for opioid, pediatric, and complex care use cases to increase the number and diversity of synthetic patient health records.

  • Administer a prize competition (“challenge”) to encourage researchers and developers to validate the realism of the generated synthetic health records.

  • Support awareness and use of Synthea™ including its updated modules, module builder and the generated synthetic data through various dissemination mechanisms.


  • The project team convened a Technical Expert Panel (TEP) comprised of diverse stakeholders representing viewpoints external to the federal government to provide their respective insights and subject matter expertise.

  • The team developed Synthea™ clinical modules for five uses cases: cerebral palsy, prescribing opioids for chronic pain and treatment of opioid use disorder, sepsis, spina bifida, and acute myeloid leukemia.

  • The project team hosted the Synthetic Data Validation Challenge to engage innovators, providers, researchers, and technology developers to develop solutions that enhance Synthea™ or demonstrate novel uses of Synthea™-generated synthetic health data. The challenge awarded six winners with novel solutions.

  • The team conducted a demonstration study using data from the Acute Myeloid Leukemia Synthea™ module to evaluate the utility of Synthea™ for use in simulation studies.





Below is a list of ASPE-funded PCORTF projects that are related to this project

Building Infrastructure and Evidence for COVID-19 Related Research Using Integrated Data – Conducted by the Centers for Disease Control and Prevention (CDC) National Center for Health Statistics (NCHS), the National Hospital Care Survey (NHCS) provides data on health care delivery in hospital-based settings. Patient-level identifiers collected by the NCHS enable the linkage of data from hospital-based settings to other data sources. To protect confidentiality, these linked data are only available in restricted-use files. This project will reduce barriers to data access by developing publicly available synthetic linked data products. The synthetic data products will expand the utility of data linkages for researchers to investigate a range of PCOR questions and provide timely data about the COVID-19 pandemic.

PCOR: Privacy and Security Blueprint, Legal Analysis and Ethics Framework for Data Use, & Use of Technology for Privacy - Patient level data are essential to understanding and improving health outcomes. These data must be made available to researchers in a way that ensures the protection of patient privacy while providing sufficient granularity to allow meaningful research questions to be assessed. However, current laws and policies around the use of patient level data are nuanced and sometimes conflicting, creating confusion for researchers, providers, and patients. This project was a collaborative effort between the ONC and CDC to conduct research and create resources to improve the privacy of patients and their data.