Skip to main content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.


The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Training Data for Machine Learning to Enhance Patient-Centered Outcomes Research (PCOR) Data Infrastructure — A Case Study in Tuberculosis Drug Resistance

Publication Date
Manohar Karki, Karthik Kantipudi, Babak Haghighi, Vy Bui, Feng Yang, Hang Yu, Michael Harris, Yasmin M. Kassim, Darrell E. Hurt, Alex Rosenthal, Ziv Yaniv, and Stefan Jaeger

Executive summary

This report describes the research plan and results of a PCOR project that followed a statement of work defined in an intra-agency agreement (IAA) between the Office of the Assistant Secretary for Planning and Evaluation (ASPE) and the National Library of Medicine (NLM). ASPE coordinates efforts to build data capacity for patient-centered outcomes research (PCOR). As part of these efforts, ASPE and NLM collaborated on a project about detecting tuberculosis drug resistance using artificial intelligence and machine learning. The project goal was to create a foundation to advance the use of artificial intelligence (AI) for Patient-Centered Outcomes Research (PCOR) and clinical practice, using existing and to be acquired TB Portals data from the National Institute of Allergy and Infectious Diseases (NIAID). The data of the TB Portals program provided by the Office of Cyber Infrastructure and Computational Biology (OCICB), NIAID, offers valuable training data for machine classifiers. The data allows researchers to train classifiers that discriminate between drug-resistant and drug-sensitive tuberculosis based on socioeconomic, geographic, clinical, laboratory, radiographic, and genomic data.

Machine learning is a type of AI where a computer uses training data sets composed of large and varied amounts of data to “learn” how to identify patterns with little human intervention. Industry experts have acknowledged that large amounts of high-quality training data are a critical part of the foundation that will support researchers’ use of machine learning to accelerate the discovery of novel disease-outcome correlations and inform the design of prevention and treatment studies. High-quality training data sets that are well-labeled and structured, use standard data models and common data elements annotated by domain experts, and combine previously unconnected data resources that can be used to train algorithms to elucidate knowledge and extract relevant data points for research. AI and associated innovative technologies like machine learning have the power to consume large amounts of data in varied, complex formats to more quickly identify effective treatments, potentially accelerating clinical innovation by speeding up the research lifecycle and the application of evidence in clinical settings. This project established a foundation for researchers to use AI to develop scientific approaches so healthcare providers can match patients to the best treatments based on their specific health conditions, life experiences, and genetic/phenotypic profiles.

The project was executed as part of a larger FY2019 OS-PCORTF project, in which NLM worked on the drug-resistant tuberculosis use case. The general objective was to enhance the capacity of PCOR researchers to use machine learning by developing and disseminating several resources that will present not only training data and methods but also lessons learned from the processes and implementation. The project curated high-quality training sets of quality clinical research data collected with NIAID. This included clinical images such as frontal chest X-rays (CXR) and computed tomography scans (CT), clinical and socioeconomic data, and genomic pathogen information of thousands of patients with drug-sensitive and drug-resistant tuberculosis (TB). These training data sets were used to develop, train, and improve machine learning models for detecting TB drug resistance. Building the capacity of researchers to compare the health outcomes of innovative approaches in delivering and managing care for TB supports the tenets outlined in the OS-PCORTF funding priority, value-based care and health outcomes. A consistent mission goal is to predict drug resistance in a patient early on and administer the appropriate patient-specific drugs for more efficient treatment. Successful implementation of this idea would be a significant breakthrough in the fight against drug-resistant TB and could save many lives.

Deliverables were made available via TB Portals and a GitHub software repository, including training data, machine learning algorithms, and trained models. The tools and knowledge generated from this project will help PCOR researchers to produce findings that could impact clinical practice. The results will encourage those building upon this project’s deliverables to develop similar use cases in other areas. Furthermore, as regulatory agencies develop national policies that increasingly consider patient-generated information in the approval of drugs and devices, evidence generated from the application of machine learning to patient-centered outcomes research will be beneficial.

Product Type