A Synthetic Health Data Generation Engine to Accelerate Patient-Centered Outcomes Research

Providing PCOR Researchers with a Low Risk, Readily Available Synthetic Data Source
Agency
  • Office of the National Coordinator for Health Information Technology (ONC)
Start Date
  • 4/1/2019
Functionality
  • Use of Enhanced Publically-Funded Systems for Research

 

STATUS: Active Project

BACKGROUND

High quality health and health care related data are often difficult to access because of cost, patient privacy, or other legal and intellectual property restrictions. To protect patient privacy, researchers and developers often depend on anonymized data to test theories, data models, algorithms, or prototype innovations. However, the risk of re-identification of anonymized data is high and has been impossible to completely eliminate especially with rare conditions. Further, due to a variety of interoperability issues, it is often difficult to bring data together from different resources for the purpose of robustly testing analysis models, algorithms, or assisting in the development of software applications. Synthetic data can be used to initiate, refine, or test innovative research approaches more quickly. This project proposes to address the need for research-quality synthetic data by increasing the amount and type of realistic, synthetic data that the Synthea software program can generate. Synthea is an open source software program that creates high quality, clinically realistic, synthetic patient health records in large volumes.

ONC will leverage its expertise as a coordinator by bringing together a technical expert panel (TEP) to assist in the development of five to seven priority use cases for new module development from three categories; opioids, pediatrics, complex care needs. To ensure that the modules generate data fitting the needs of patient-centered outcomes researchers, the TEP will include representation from relevant researchers so they can provide input regarding the design of a given module. Some initial applications of the generated synthetic data include identification of effective prevention methods, treatments or interventions, such as assessments related to controlled substance prescriptions, that reduce the impact of mental and substance use disorders. Similarly, patient-centered outcomes research (PCOR) researchers may use the synthetically generated data from the other module categories to simulate care interventions, analyze longitudinal patient progress, and potentially incorporate patient reported outcomes.

PROJECT PURPOSE & GOALS

A synthetic data engine is a potentially important piece of the greater PCOR data infrastructure because it provides PCOR researchers with a low risk, readily available synthetic data source complementing their use of real clinical data and enhancing their ability to conduct rigorous analyses and generate relevant findings that can inform health and treatment decisions. 

This project will address the following objectives:

  • Enhance Synthea by developing or updating five to seven data generation modules for opioid, pediatric, and complex care use cases to increase the number and diversity of synthetic patient health records.

  • Administer a prize competition (“challenge”) to encourage researchers and developers to validate the realism of the generated synthetic health records.

  • Support awareness and use of Synthea including its updated modules, module builder and the generated synthetic data through various dissemination mechanisms.