Development of a Natural Language Processing (NLP) Web Service for Public Health Use

Designing a web service for the public and researchers to be able to share interoperable technologies to address public health issues.
  • Centers for Disease Control and Prevention (CDC) 
  • Food and Drug Administration (FDA)


Start Date
  • 6/1/2016


  • Use of Clinical Data for Research
  • Use of Publically Funded Data Systems for Research


STATUS: Completed Project


In the U.S., central cancer registries collect, manage, and analyze longitudinal data about cancer cases and cancer deaths. Cancer data are collected from multiple sources such as hospitals, laboratories, physician offices, and independent diagnostic and treatment centers. Hospital reporting of cancer cases has been standardized for over a decade; however, as the provision of cancer care has shifted away from the hospital, registries have had to expand their data collection efforts to include data from nonstandard systems that contain large amounts of unstructured data. The process of abstracting these crucial cancer data is very labor intensive and expensive. Unstructured data limits the ability of researchers to analyze the information without manual review.

Similarly, a considerable amount of clinical information submitted to the FDA Spontaneous Reporting Systems is unstructured. One of the FDA’s major responsibilities is the post marketing safety surveillance through the review of spontaneous reports submitted to the Vaccine Adverse Event Reporting Systems (VAERS) and the FDA Adverse Event Report System (FAERS) to report adverse events. However, a considerable amount of clinical information in both systems is either not coded (e.g., medical and family history) or is not linked to codes that provide key information like exact time for each symptom. Additionally, there may be duplicate entries for the same event, a phenomenon that impacts the surveillance process, requiring manual review of submitted reports to trace the adverse event.


The Development of a Natural Language Processing (NLP) Web Service for Public Health Use was a joint project between the CDC and the FDA.

This project developed a NLP web service that is publicly available to researchers to help them convert unstructured clinical information into structured and standardized coded data. The NLP web services environment is available on the Public Health Community Platform (PHCP) – a cooperative platform for sharing interoperable technologies to address public health priority areas aimed at improving population health outcomes and health equity (e.g., tobacco use). The NLP web services environment contains NLP architectures and tools that process spontaneous report narratives, extracts clinical and temporal information from the text, formats the data for presentation, and maps unstructured medical concepts (e.g., cancer data and safety surveillance data) into structured and standardized data (i.e., International Classification of Diseases 10th Edition Clinical Modification (ICD 10 CM), Logical Observation Identifiers Names and Codes (LOINC), Systematized Nomenclature of Medicine (SNOMED) and MedDRA.

Project Objectives:

  • Conduct an “as is” environmental scan and literature review of all existing NLP algorithms, methods, and tools for possible inclusion in the NLP web service to receive unstructured clinical information and return standardized data needed for CDC cancer surveillance and FDA safety surveillance domains. The assessment took into consideration possible requirements of other federal agencies, public health agencies, and/or PCORnet participant focus areas.

  • Design the NLP Web Service technical requirements (CDC lead; FDA contributor).

  • Build structured data sets using CDC and FDA resources to capture data and evaluate the performance of the pilot version of the NLP Web Service (CDC/FDA collaboration).

  • Evaluate the pilot and release the final NLP Web Service (CDC/FDA collaboration).

  • Update the NLP Web Service and release the final version on the PHCP (CDC/FDA collaboration).


  • The project team completed the environmental scan, which includes a literature review and multi-channel review that identifies 54 existing open-source tools that are potentially useful in building pipelines for clinical NLP domains.

  • Upon completion of the environmental scan, the project team designed the Clinical Language Engineering Workbench (CLEW) platform environment in order to provide open-source NLP and machine learning tools to develop, experiment with, and refine clinical NLP models.

  • The CDC team tested use of the CLEW via pilot on cancer pathology. As a result of the pilot, the electronic Mapping, Reporting and Coding (eMaRC) Plus, an application used by central cancer registries to receive and process cancer pathology and biomarker data, was modified to interface with CLEW web services to process unstructured pathology data.

  • The FDA team tested use of the CLEW via a pilot on safety surveillance data. As a result of the pilot, the safety surveillance NLP application was incorporated into CLEW for other NLP experts to use. The FDA team also created an annotated data set for training NLP models and uploaded the solution to GitHub for broader use.




  • The results of the environmental scan and literature review, “Natural Language Processing Systems for Capturing and Standardizing Unstructured Clinical Information: A Systematic Review” were published in the September 2017 issue of the Journal of Biomedical Informatics. The article can be found here:

  • The team also published the final corpus in a paper describing the methodology used to create it so that researchers can assess the utility of the corpus to their own work. The paper titled “Generation of an annotated reference standard for vaccine adverse event reports” was published in Vaccine in 2019. The paper can be found here:


Below is a list of ASPE-funded PCORTF projects that are related to this project

Technologies for Donating Medicare Beneficiary Claims Data to Research Studies - This project aims to provide a safe and secure mechanism for Medicare beneficiaries to donate least three years of their individual Medicare claims data to scientific research studies. This project will allow researchers to collect longitudinal patient information from Medicare and to link data sets with other relevant information for NIH-led research. In addition, this project leverages current investments in federal data infrastructure to inform future infrastructure development—combining advances in Blue Button on FHIR (Blue Button 2.0) and S4S to enhance data collection by the All of Us initiative.