The Feasibility of Using Electronic Health Data for Research on Small Populations. Technical Conditions Required for Research Using EHR and Other Electronic Health Data


In order to use information in EHRs for research, it is first necessary for a number of technical conditions to be in place, such as the ability to extract and format data for research, as well as to address issues with missing data and data quality. As with claims data, the information in EHRs was not collected for research purposes. Whereas claims data are collected and entered in ways that help to maximize revenues, information is entered in EHRs to support provide patient care and to fit into clinical routines and workflows.288 In addition to assisting clinicians and health care organizations in their day-to-day work, the information that goes into EHRs provides documentation that is required by law, that is used for billing, and that informs, patient care decisions. For these purposes, there is not necessarily a need to ensure data are entered in a uniform fashion or to create the capacity for selectively pulling certain information from the system, aggregating data, or identifying certain groups of patients. The cost of converting the information contained in EHRs into databases suitable for research purposes is substantial and requires specific expertise.

Data extraction

Using data from EHRs for research requires extraction from an organization’s EHR system so that the data can be cleaned, reformatted, and analyzed. These steps require a substantial staff of programmers; their numbers depend on the system and vendor used.289 Some organizations create a data warehouse to store extracted data for secondary use—records in such a warehouse have a different architecture than an EHR, which is designed for clinical transactions.290 An organization may even have multiple data warehouses with the same data but in different forms to support various strategic functions, including resource strategic planning, resource scheduling and inventory control. Part of the problem is that various user groups often do not agree on the definition of variables, acceptable reliability rates and the list of variables to be extracted. However, these functions require data in a different format than exists in an EHR.291 For example, to facilitate access to information about any given patient, the design of an EHR may include many tables with a lot of linking, allowing clinicians to retrieve only certain information on a patient quickly, such as problem list or prescriptions. However, for research it is more useful to have all of this information in one large flat file.

This can be handled in various ways. Intermountain Healthcare has developed a central data warehouse where all information from its EHR, billing system, insurance product, registration system, and laboratory and radiology systems are pooled and linked. Data sets for research are then extracted from this warehouse rather than the EHR so that research does not interrupt the clinical care process or slow down the EHR.292 Rather than pooling to and extracting from a central location, Geisinger extracts data from 13 databases (including one EHR database and 12 databases from other clinical and administrative systems) and puts those into a separate database designed for research and quality improvement.293 New York City’s Health and Hospital Corporation (HHC) has data warehouses for each of its component hospital and community health systems from which aggregate data can be pulled. HHC has compiled several registries, such as a registry of some 60,000 diabetics that contains information that is used to track patients and improve outcomes.294

Intellectual property issues may be involved. Epic sells a data management product that extracts data from organizations’ internal files. However, because Epic considers these files to be intellectual property, client organizations are not allowed to share the internal variable names without permission from Epic. This restriction has been such an impediment that Kaiser Permanente is changing variable names used for many years that have Epic names.295 There are concerns that as large vendors such as Epic have gained market power, they are able to charge high prices while providing inflexible products and requiring additional costs for each functionality added to the EHR system.

Some research using EHR data has occurred by extracting a subset of data needed for the specific study either by manually identifying the desired records and/or variables, or by querying the system so it automatically retrieves the desired information. For example, a researcher may want to extract the records of adolescent patients with autism spectrum disorders. However, the information needed to select desired records may not be easily available for the computer to identify. While age is likely available to identify adolescents, diagnostic information is often not readily available on ASDs. In addition, not all systems were built to be queried. For example, Montefiore Medical Center in Bronx, New York, found that its system was not structured to be queried, and they needed to develop software to enable them to pull data for analysis from the system.296

Studies comparing the accuracy of automated versus manual extraction of EHR data on quality measures has found that the electronic method resulted in and underestimate of the rate of recommended care. For instance, the number of patients that received a clinical preventive service or who met a recommended treatment goal was undercounted when the automated method was used.297, 298 These findings suggest there are risks along with efficiencies in using automated extraction of EHR data for research purposes.

Part of the challenge is that the information needed to identify selected patient characteristics (e.g., autism spectrum disorder) may be spread across multiple fields but not expressed directly. For example, Kaiser Permanente developed and validated a software algorithm to detect episodes of pregnancy in patients EHRs. This algorithm searched for indicators of pregnancy in diagnosis and procedure codes, laboratory tests, pharmacy dispensing, and imaging procedures that are typical of pregnancy. Although using medical records to identifying which patients are pregnant seems straightforward, they found that it is not so easy to automate this synthesis of multiple data points from different sections of a patient chart, which is also difficult to do manually.299

Processing free-text data

Data extracted from EHRs must be converted to an analyzable format. The major difficulty for both data extraction and research is that a large portion of the data in EHRs has not been entered in a coded format. Desired information may be in free text that was entered by the clinicians to record their observations and assist with their decision-making. Even diagnoses may be put into free text by physicians because coding it is not needed for their day-to-day work. Some diagnoses (including perhaps ASD) may not be entered because of stigma concerns. Thus, relying on coded fields alone to identify patients with certain diagnoses may result in incomplete and perhaps biased representation.300 As part of an evaluation of its mental health integration program, Intermountain Healthcare looked for use of a depression metric among patients who received care at its organization. Intermountain found that even when mental health services were described in physicians’ notes, the corresponding data elements were often missing from the structured fields in the EHR.301

Free-text data are difficult to use in research they are highly heterogeneous, describing patients with similar characteristics or conditions in different ways. This variation makes it difficult to identify for data analysis patients with shared characteristics. The text may also not conform to standard grammar, may use acronyms and abbreviations, and may include typing and spelling errors. A clinician’s assessments may also be recorded as tentative, and the information may be context specific from subject to subject. A disease may be mentioned when it has been “ruled out.” Recording the nuances in each case both makes the information valuable for clinicians’ work and difficult to use for analysis.302

Active efforts are under way to find methods to overcome the limitations of unstructured data, and there has been great progress in developing algorithms and software for natural language processing with which to create standard categories from free text inserted into EHRs by clinicians. Researchers have been able to identify some populations searching for certain words or phrases in the free text of EHRs. For example, Dr. Jesse Ehrenfeld from Vanderbilt University developed and validated tools for natural language processing to identify LGBT individuals from their EHR data in order to determine whether such patient characteristics might be affecting diagnosis, treatment, and health outcomes. This process involves searching records for key terms such as “lesbian” or “bisexual,” but also looking for other indicators such as patients listing a same-gender emergency contact with a different last name. He reports that the initial search algorithm resulted in a false positive rate or 22 percent, but that after refining the algorithm to identify negation words for exclusion, only 3 percent of those identified as LGBT using the algorithm had been incorrectly classified as such.303

One systematic literature review of clinical coding and classification processes to transform natural language into standardized data found these processes had varying degrees of success.304 In general, the reliability of natural language processing programs appears to be better where variables are narrowly and consistently defined.305 Types of coding were found to fall into two primary groups: those that map text to existing classification systems such as international classification of disease (ICD) or current procedural terminology (CPT) codes, and those such as Dr. Ehrenfeld’s that used a coding scheme developed for a specific study to look for the presence or absence of certain terms or phrases.306

Despite the success of some efforts to covert free text into coded data, some experts caution that natural language processing should not be considered a magic bullet. Natural language processing requires computers that are very large and fast in order to process free text in a reasonable amount of time. In many cases, it may be more efficient and accurate to ask patients for the desired information rather than searching for it in the free text.307 Also, billing, lab, pharmacy or radiology databases may be better sources of diagnostic information than free text and may worth exploring before turning to natural language processing of the free text in EHRs. These utilization databases tend to be more structured than the problem notes recorded in the EHR.308

Other unstructured data includes scanned images, including radiology images but also PDFs of letters or records from other providers that have been scanned or faxed and then uploaded to the EHR. While useful for a clinician to open and view, converting them into something codable takes great effort and computing power. This issue is a whole sub-field of informatics by itself.309

Missing data and data quality

In addition to lack of standardization, the accuracy and completeness of data entered into EHRs are major concerns for research, since high quality and complete data are needed for drawing valid conclusions. Data quality has often been called into question when EHR data have been used for quality assessments. Compared to paper charts, electronic health records have been found to hold significant errors—in part because during this transitional period, many clinicians have not been accustomed to using a computer as part of their daily workflow. In addition to typos and spelling errors, errors of omission and commission have been found in medication lists and in problem lists where chronic and acute conditions are documented.310 Information entered in an EHR may also be affected by billing considerations. For example, some clinicians may not see the need to add secondary diagnoses for complex patients, if doing so would not affect the DRG payments. Such omissions may result in researchers’ underreporting certain diagnostic complexities.311

Because EHRs today may not reliably provide a complete picture of a patient’s health, researchers should guard against drawing conclusions as though they were complete, such as assuming that the absence of mention means that a particular characteristics, condition or treatment are not present. For clinical purposes, a physician may be more likely to record problems than improvement, particularly if there is no need for follow-up, but a researcher would need that information.312 In addition, some research that relies on EHR data may be skewed because the data do not include people who are unable to obtain care because of access barriers resulting from lack of insurance or differences in language or culture.313 This is a particular issue for the transgender population, which is often uninsured or seeks services that insurance does not cover, such as hormonal therapies, that have often been obtained outside the health care system.314 There is also the issue of patients moving in and out of EHR systems—either because they have stopped receiving care or have gone to another health care provider. For Asian subpopulations, they may even be going between countries and receiving care and taking medications they have obtained abroad. The mobility of populations can make it difficult to create cohorts and to make reliable inferences about them.315

The need for certain types of patient such as those with ASDs to see multiple providers (including mental health and medical providers) also makes it challenging to get a complete picture of someone’s health care through an EHR. Children may also receive testing for ASDs through the educational system that may not be shared with the child’s pediatrician. Although this challenge is related to the bigger issue of how the health system is organized, further development of the ability to share information among providers will be important in studying small populations. However, there remains the challenge of a patient may go to that do not have electronic data (such as some long term care facilities), making it more difficult to integrate the information into the patient’s electronic record with his or her primary care provider.316

However, increasingly integrated models of health care delivery should present opportunities to gain more complete pictures of patients’ care for study. In an integrated delivery system, a single organization provides most or all of a patient’s care across multiple settings. Integrated systems tend to be particularly advanced in the functionality and use of the EHR systems as a mechanism by which they can coordinate care across multiple settings. Therefore, a number of those interviewed for this report work in such organizations, and many examples we mention in this report come from integrated delivery systems. Shared EHR systems have permitted an increasing number of health care organizations to operate as virtual systems even though they are not a single organizational entity. This creates new opportunities to study patient care across multiple settings.

With the recent growth of accountable care organizations (ACOs) and the accompanying needed data sharing, researchers may increasingly be able to capture information about patients regardless of where they receive care. For example, because Essentia Health in the upper Midwest is an ACO, it has electronic access to patient information no matter where among the collaborating organizations they receive care, and Essentia can successfully request this information from other providers as a condition of getting paid for services for patients covered by the ACO contract.317

The growth of ambulatory networks connected with hospitals also facilitates this type of data sharing. For example, the Pediatric Research Consortium (PeRC) at Children’s Hospital of Pennsylvania (CHOP) is able to match outpatient data from CHOP’s primary care network with hospital data for patients who have received care in both. However, information is not available about care received in other settings, so the EHR system is most useful for the subset of patients who receive sub-specialty care within CHOP as opposed to the whole network.318

Restricted data

At times a portion of the medical record is restricted or separated from the rest of the patient’s information if it is viewed as sensitive in order to protect the patient’s privacy. This may be of particular concern for small populations where there may be an associated stigma, such as ASDs or LGBT populations. Patients with ASDs often receive care from mental health providers, and it is common for some or all of this information to be restricted. Even if it is included in the medical record, researchers may need special permission to be able to use it for a study—particularly as mentally disabled or cognitively impaired persons are considered vulnerable populations and therefore are a protected class of human subjects when research is considered by institutional review boards. This is an issue not only for EHR data, but for claims data as well—where any substance abuse claims must be removed when the data are used for research.319

Legacy systems

Because most EHR systems are relatively new, the number of years of available patient data varies by organization; information needed to look at a patient over time may be in paper charts or legacy electronic systems and not available for EHR-based research. Physicians in organizations that have upgraded their EHR systems may be able to login to the old system to access critical patient information stored there, but the information might not be readily available for research. The alternative ways to link legacy data into new systems all require time and resources.320

Needed expertise

The skills required to conduct research using EHR data are highly technical and specialized. A team of information systems staff is needed to support an EHR data warehouse to support care delivery, and translation to a research database requires another set of technical experts. This research informatics team must include programmers and analysts who build and maintain a research-focused warehouse.321 Higher education has yet to catch up with programs designed to provide training around these skills, which would require links between business and medical schools.322 The leader of this team must possess both IT skills and clinical expertise, and these individuals are in short supply as well, particularly as both the fields of medicine and technology have been quickly evolving.

It is also crucial that individuals conducting EHR research have knowledge of research methods specific for EHR data because a unique longitudinal data set is being repurposed. Expertise needed include statistical expertise to format and analyze the data, and the ability to interpret findings while considering how the data were collected and formatted, as well as any limitations connected to the patient population and the context. These considerations require individuals with expertise around organizational and policy history that may affect how data was recorded. For example, an organization’s decision to train staff on the collection of race/ethnicity data, whether for internal purposes or to comply with policy or accreditation requirements, may explain a perceived growth in the number of patients they serve from a certain Asian subpopulation over time. Changes in the system, personnel, and social history need to be documented and considered when interpreting data. Therefore, it is important that data warehouses and networks collaborate with their participating organizations and providers.323

View full report


"rpt_ehealthdata.pdf" (pdf, 1.99Mb)

Note: Documents in PDF format require the Adobe Acrobat Reader®. If you experience problems with PDF documents, please download the latest version of the Reader®