STATUS: Completed Project
In the U.S., central cancer registries collect, manage, and analyze longitudinal data about cancer cases and cancer deaths. Cancer data are collected from multiple sources such as hospitals, laboratories, physician offices, and independent diagnostic and treatment centers. Hospital reporting of cancer cases has been standardized for over a decade; however, as the provision of cancer care has shifted away from the hospital, registries have had to expand their data collection efforts to include data from nonstandard systems that contain large amounts of unstructured data. The process of abstracting these crucial cancer data is very labor intensive and expensive. Unstructured data limits the ability of researchers to analyze the information without manual review.
Similarly, a considerable amount of clinical information submitted to the FDA Spontaneous Reporting Systems is unstructured. One of the FDA’s major responsibilities is the post marketing safety surveillance through the review of spontaneous reports submitted to the Vaccine Adverse Event Reporting Systems (VAERS) and the FDA Adverse Event Report System (FAERS) to report adverse events. However, a considerable amount of clinical information in both systems is either not coded (e.g., medical and family history) or is not linked to codes that provide key information like exact time for each symptom. Additionally, there may be duplicate entries for the same event, a phenomenon that impacts the surveillance process, requiring manual review of submitted reports to trace the adverse event.
PROJECT PURPOSE & GOALS
The Development of a Natural Language Processing (NLP) Web Service for Public Health Use was a joint project between the CDC and the FDA.
This project developed a NLP web service that is publicly available to researchers to help them convert unstructured clinical information into structured and standardized coded data. The NLP web services environment is available on the Public Health Community Platform (PHCP) – a cooperative platform for sharing interoperable technologies to address public health priority areas aimed at improving population health outcomes and health equity (e.g., tobacco use). The NLP web services environment contains NLP architectures and tools that process spontaneous report narratives, extracts clinical and temporal information from the text, formats the data for presentation, and maps unstructured medical concepts (e.g., cancer data and safety surveillance data) into structured and standardized data (i.e., International Classification of Diseases 10th Edition Clinical Modification (ICD 10 CM), Logical Observation Identifiers Names and Codes (LOINC), Systematized Nomenclature of Medicine (SNOMED) and MedDRA.
Conduct an “as is” environmental scan and literature review of all existing NLP algorithms, methods, and tools for possible inclusion in the NLP web service to receive unstructured clinical information and return standardized data needed for CDC cancer surveillance and FDA safety surveillance domains. The assessment took into consideration possible requirements of other federal agencies, public health agencies, and/or PCORnet participant focus areas.
Design the NLP Web Service technical requirements (CDC lead; FDA contributor).
Build structured data sets using CDC and FDA resources to capture data and evaluate the performance of the pilot version of the NLP Web Service (CDC/FDA collaboration).
Evaluate the pilot and release the final NLP Web Service (CDC/FDA collaboration).
Update the NLP Web Service and release the final version on the PHCP (CDC/FDA collaboration).
PROJECT ACHIEVEMENTS & HIGHLIGHTS
The project team completed the environmental scan, which includes a literature review and multi-channel review that identifies 54 existing open-source tools that are potentially useful in building pipelines for clinical NLP domains.
Upon completion of the environmental scan, the project team designed the Clinical Language Engineering Workbench (CLEW) platform environment in order to provide open-source NLP and machine learning tools to develop, experiment with, and refine clinical NLP models.
The CDC team tested use of the CLEW via pilot on cancer pathology. As a result of the pilot, the electronic Mapping, Reporting and Coding (eMaRC) Plus, an application used by central cancer registries to receive and process cancer pathology and biomarker data, was modified to interface with CLEW web services to process unstructured pathology data.
The FDA team tested use of the CLEW via a pilot on safety surveillance data. As a result of the pilot, the safety surveillance NLP application was incorporated into CLEW for other NLP experts to use. The FDA team also created an annotated data set for training NLP models and uploaded the solution to GitHub for broader use.
PUBLICATIONS, PRESENTATIONS, AND OTHER PUBLICALLY AVAILABLE RESOURCES
CDC and FDA developed a NLP CLEW Final Report which describes in detail the project’s goals, major accomplishments including a description of the environmental scan, the CLEW platform design, pilot findings, lessons learned, and dissemination deliverables. The report is available here: https://aspe.hhs.gov/system/files/pdf/259016/NLP-CLEW-FinalReport-508.pdf.
CDC and FDA developed the NLP CLEW Workbench Web Service Technical Report, which presents a detailed technical description of the core NLP approach of the prototype version of the Workbench and two pilot applications developed using the Workbench. The report is available here: https://aspe.hhs.gov/system/files/pdf/259016/NLP-Workbench-Web-Services-Technical-Report-508.pdf.
The project teams compiled a Lessons Learned Report. In this report, the teams summarize the key observations, and findings that inform future tools, systems development, and testing, and NLP and machine learning pipeline and model development. The report is available here: https://aspe.hhs.gov/system/files/pdf/259016/NLP-CLEW-LessonsLearned-508.pdf.
CDC and FDA developed a CLEW User Guidance document, which explains how to install and use the CLEW, as well as the products developed in the CDC and FDA pilots. The report is available here: https://aspe.hhs.gov/system/files/pdf/259016/NLP-CLEW-UserGuidanceDocument-508.pdf.
The NLP Workbench Web Service Project Website is available at: https://www.cdc.gov/cancer/npcr/informatics/nlp-workbench/index.htm
CLEW, the cloud-based, open-source NLP Workbench Web Service code and documentation have been uploaded to the CDC public GitHub at https://github.com/CDCgov/NLPWorkbench and the FDA public GitHub at https://github.com/FDA/.
The FDA project team initiated and completed the generation of an annotated corpus to support training and development efforts of language models. The complete clinical and temporal annotations for the 1,000 Vaccine Adverse Event Reporting System (VAERS) reports are publicly available to the research community on GitHub here: http://github.com/fda/VAERS-Annotations.
The results of the environmental scan and literature review, “Natural Language Processing Systems for Capturing and Standardizing Unstructured Clinical Information: A Systematic Review” were published in the September 2017 issue of the Journal of Biomedical Informatics. The article can be found here: https://www.ncbi.nlm.nih.gov/pubmed/28729030
- The team also published the final corpus in a paper describing the methodology used to create it so that researchers can assess the utility of the corpus to their own work. The paper titled “Generation of an annotated reference standard for vaccine adverse event reports” was published in Vaccine in 2019. The paper can be found here: https://www.ncbi.nlm.nih.gov/pubmed/29880244
Below is a list of ASPE-funded PCORTF projects that are related to this project
Technologies for Donating Medicare Beneficiary Claims Data to Research Studies - This project aims to provide a safe and secure mechanism for Medicare beneficiaries to donate least three years of their individual Medicare claims data to scientific research studies. This project will allow researchers to collect longitudinal patient information from Medicare and to link data sets with other relevant information for NIH-led research. In addition, this project leverages current investments in federal data infrastructure to inform future infrastructure development—combining advances in Blue Button on FHIR (Blue Button 2.0) and S4S to enhance data collection by the All of Us initiative.