Skip to main content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

A Synthetic Health Data Generation Engine to Accelerate Patient-Centered Outcomes Research

Providing PCOR Researchers with a Low Risk, Readily Available Synthetic Data Source
Agency
  • Office of the National Coordinator for Health Information Technology (ONC)
Start Date
  • 4/1/2019
Functionality
  • Use of Enhanced Publically-Funded Systems for Research

STATUS: Completed Project

BACKGROUND

High-quality health and health care-related data are often difficult to access because of cost, patient privacy, or other legal and intellectual property restrictions. To protect patient privacy, researchers and developers often depend on anonymized data to test theories, data models, algorithms, or prototype innovations. However, the risk of re-identification of anonymized data is high and has been impossible to completely eliminate, especially for rare conditions. Further, due to a variety of interoperability issues, it is often difficult to bring data from different resources together to robustly test analysis models and algorithms or assist in developing software applications. In contrast, synthetic data can be used to initiate, refine, or test innovative research approaches more quickly. This project addressed the need for research-quality synthetic data by increasing the amount and type of realistic, synthetic data that the Synthea™ software program can generate. Synthea™ is an open-source software program that creates high-quality, clinically realistic synthetic patient health records in large volumes.

ONC leveraged its expertise as a coordinator by bringing together a technical expert panel (TEP) to assist in the development of five use cases for new Synthea™ module development from three categories: opioids, pediatrics, and complex care needs. To ensure that the modules generate data fitting the needs of patient-centered outcomes research (PCOR) researchers, the TEP included representation from relevant researchers so they could provide input on the design of a given module. Some initial applications of the generated synthetic data include the identification of effective prevention methods, treatments, and interventions, such as assessments related to controlled substance prescriptions that reduce the impact of mental and substance use disorders. Similarly, PCOR researchers may use the synthetically generated data from the other module categories to simulate care interventions, analyze longitudinal patient progress, and potentially incorporate patient-reported outcomes.

PROJECT PURPOSE & GOALS

A synthetic data engine is a potentially important piece of the greater PCOR data infrastructure because it provides PCOR researchers with a low-risk, readily available synthetic data source complementing their use of real clinical data and enhancing their ability to conduct rigorous analyses and generate relevant findings that can inform health and treatment decisions.

Project Objectives:

  • Enhance Synthea™ by developing or updating five to seven data generation modules for opioid, pediatric, and complex care use cases to increase the number and diversity of synthetic patient health records.
  • Administer a prize competition (“challenge”) to encourage researchers and developers to validate that the generated synthetic health records are realistic.
  • Support awareness and use of Synthea™ including its updated modules, module builder, and the generated synthetic data through various dissemination mechanisms.

PROJECT ACHIEVEMENTS AND HIGHLIGHTS

  • The project team convened a TEP comprised of diverse stakeholders representing viewpoints external to the federal government to provide their respective insights and subject matter expertise.
  • The team developed Synthea™ clinical modules for five use cases: cerebral palsy, prescribing opioids for chronic pain and treatment of opioid use disorder, sepsis, spina bifida, and acute myeloid leukemia.
  • The project team hosted the Synthetic Data Validation Challenge to engage innovators, providers, researchers, and technology developers to develop solutions that enhance Synthea™ or demonstrate novel uses of Synthea™-generated synthetic health data. The challenge awarded six winners who presented novel solutions.
  • The team conducted a demonstration study using data from the Acute Myeloid Leukemia Synthea™ module to evaluate the utility of Synthea™ for use in simulation studies.

PUBLICATIONS, PRESENTATIONS, AND OTHER PUBLICALLY AVAILABLE RESOURCES

Resources:

Publications:

  • The project team published a case report that examined whether Synthea™ can be used for simulation studies that draw parameters from observational studies and randomized trials. The case report is available here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9360775/.

Presentations:

  • The project team held two informational webinars for the Synthetic Data Validation Challenge to introduce participants to Synthea™.

RELATED PROJECTS

Below is a list of ASPE-funded PCORTF projects that are related to this project

Building Infrastructure and Evidence for COVID-19 Related Research, Using Integrated Data from the National Center for Health Statistics Data Linkage Program – Conducted by the Centers for Disease Control and Prevention (CDC) National Center for Health Statistics (NCHS), the National Hospital Care Survey (NHCS) provides data on health care delivery in hospital-based settings. Patient-level identifiers collected by the NHCS enable the linkage of data from hospital-based settings to other data sources. To protect confidentiality, these linked data are only available in restricted-use files. This project will reduce barriers to data access by developing publicly available synthetic-linked data products. The synthetic data products will expand the utility of data linkages for researchers to investigate a range of PCOR questions and provide timely data about the COVID-19 pandemic.

PCOR: Privacy and Security Blueprint, Legal Analysis and Ethics Framework for Data Use, & Use of Technology for Privacy -– Patient-level data are essential to understanding and improving health outcomes. These data must be made available to researchers in a way that ensures the protection of patient privacy while providing sufficient granularity to allow meaningful research questions to be assessed. However, current laws and policies around the use of patient-level data are nuanced and sometimes conflicting, creating confusion for researchers, providers, and patients. This project was a collaborative effort between ONC and CDC to conduct research and create resources to improve the privacy of patients and their data.