Utilizing Data from Various Data Partners in a Distributed Manner

Developing and testing the capability to conduct timely and secure distributed regression analysis in distributed data networks.

Agency

Food and Drug Administration (FDA)

Start Date

7/15/2015

Functionality

Use of Clinical Data for Research
Use of Enhanced Publicly-Funded Data Systems for Research

STATUS: Completed Project

BACKGROUND

Currently information on a patients’ health care is captured across various data sources. The ability to link data across health care databases would provide more robust cross sectional or longitudinal patient profiles, enhancing secondary uses of electronic health care information for research purposes, and improving access to information that would not be present in claims or registry data or electronic health records (EHRs) alone. In order to address this gap, the Food and Drug Administration (FDA) built upon previous distributed linear regression analysis efforts by developing enhanced analytic capabilities and fully automating distributed linear regression analysis of patient data across organizations.

PROJECT PURPOSE & GOALS

This project was spearheaded by the FDA and developed and tested the capability to conduct timely and secure distributed regression analysis in distributed data networks. Additionally, it explored the feasibility of creating virtual linkage capabilities to: 1) utilize data from multiple data sources with unique populations (horizontally partitioned data); and 2) utilize data for one specific patient with information at different institutions (vertically partitioned data) through a unique key used to identify the patient. This allowed research networks to maintain control of patient level data while generating valid regression estimates within and across networks without the need to transfer protected health information, providing a balance between analytic requirements, patient privacy and confidentiality, and proprietary considerations.

Project Objectives:

Develop a new open source software application that will use PopMedNet™ (PMN), an open source software application that enables the creation, operation, and governance of distributed health networks, to automate multi step interactive processes and allow stakeholders to conduct distributed regression analyses with data from different people held at different institutions without sharing potentially identifiable information across sites.
Develop this software application to be supported by PMN and be modified and adopted for non PMN applications.
Test the new, distributed regression application in an actual distributed research network.
Provide technical and user documentation to accompany the new software and allow for its widespread adoption.
Explore the feasibility of conducting distributed regression analyses in which data from the same people are held at different institutions.

PROJECT ACHIEVEMENTS & HIGHLIGHTS

The project team successfully implemented a pilot distributed regression analysis (DRA) query workflow with select data partners in the Sentinel System.
The team developed the DRA SAS code, and DRA was fully integrated into the 2017 PopMedNet™ release. The team also created two SAS packages used to run DRA: one for data partners and one for the analysis center. The packages include all algorithms for linear, logistic, and Cox regression. The open source software utilizes PopMedNet™ and allows stakeholders to perform real-world distributed regression within actual PCOR distributed data networks on horizontally partitioned data.
The team developed documentation of the DRA algorithms and set up a SAS-based DRA application for execution in a horizontally partitioned distributed data network.
The project team explored the feasibility of performing linear DRA within vertically partitioned distributed data networks using a publicly available dataset. The team found the PopMedNet™ DRA query workflow could be used to conduct DRA within vertically partitioned data environments. However, additional enhancements were required to integrate vertical DRA algorithms into the PopMedNet™ DRA query workflow, and additional internal and external testing was required to assess operational performance.

PUBLICATIONS, PRESENTATIONS, AND OTHER PUBLICLY AVAILABLE RESOURCES

Resources:

The FDA published a final report, “Utilizing Data from Various Partners in a Distributed Manner,” in October 2018. The report is available here: https://aspe.hhs.gov/sites/default/files/private/pdf/259016/DistributedRegressionAnalysisFinalTechnicalReport.pdf.
The DRA was fully integrated into the 2017 PopMedNet™ release and the new version of the software is available on www.popmednet.org and on Sentinel’s website at https://www.sentinelinitiative.org/sentinel/methods/utilizing-data-various-data-partners-distributed-manner. In addition to source code and documentation for the algorithms, the Sentinel website also provides test data and sample reports for each regression model type.
Sample reports and data files are available here: https://www.sentinelinitiative.org/sentinel/methods/utilizing-data-various-data-partners-distributedmanner.

RELATED PROJECTS

Below is a list of ASPE-funded PCORTF projects that are related to this project

Harmonization of Various Common Data Models and Open Standards for Evidence Generation - This project was a collaborative effort among the Food and Drug Administration (FDA), National Cancer Institute (NCI), National Institutes of Health/National Center for Advancing Translational Sciences (NIH/NCATS), Office of the National Coordinator for Health (ONC), and the National Library of Medicine (NLM). The project built a data infrastructure for conducting patient centered outcomes research (PCOR) using observational data derived from the delivery of health care in routine clinical settings. The sources of these data include, but are not limited to insurance billing claims, electronic health records (EHRs), and patient registries. In addition, the project team harmonized several existing common data models, including PCORnet and other networks.

Utilizing Data from Various Data Partners in a Distributed Manner

Connect with Us