The Feasibility of Using Electronic Health Data for Research on Small Populations. The Growing Availability of Electronic Health Data


The Institute of Medicine sees EHRs as an essential part of a “learning health care system,” and many believe they are critical for the success of medical homes, accountable care organizations, and other provider payment and delivery system reforms resulting from the Affordable Care Act. The use of EHR data for research depends first of all on the adoption and use of EHRs by health care providers. Over the past decade or so, early adopters of EHRs have begun to tap their potential for clinical, epidemiological, and health services research. These early adopters have included HMOs, large multispecialty medical groups, and large hospital-owned and operated systems that employ physicians and operate other facilities along the care continuum. Some have now started or participate in EHR-based research networks, often with federal support. Federal stimulus funds under the Health Information Technology for Economic and Clinical Health Act has resulted in growing number of providers that use EHRs, and this increases the size and variety of the populations that can be studied. For example, more federally qualified health centers, small physician practices, and critical access and safety net hospitals are adopting and using EHR technology resulting in more information about traditionally vulnerable patient populations.

The current level and rate of increase in EHR adoption and use by providers suggests that the health care industry may be approaching a “tipping point,” that is the moment of critical mass where ideas, products, and behaviors spread like viruses.”2 The use of EHRs to capture, organize, and use information for purposes of quality and efficiency improvement as well as research is not just the expectation or norm among the “innovators” but increasingly the expectation and norm for entire health care industry.

Information available in EHRs

Information in EHRs comes from both patients and care providers. Information such as demographic and other background information may be collected directly from the patient using a form or questionnaire they fill out at the registration desk, in the waiting room, or through a patient portal. Data entered during the office visit by the clinician may include reason for the visit, height, weight, vital signs, patient-reported symptoms and characteristics (such as behavior and lifestyle), diagnoses, treatments and tests ordered, and medications prescribed. In addition, data from the pharmacy, laboratory, and radiology are often incorporated into the EHR. Claims and billing information may also be integrated with an EHR. There is the potential to identify some small populations using information that is typically recorded in an EHR such as demographics and diagnosis.

Having this information directly entered into the computer can transform the research enterprise, making data available in close to real time, facilitating the identification of patients with characteristics of interest, eliminating the need for data entry, and reducing reliance on patient recall as is required in survey research. EHRs also include a level of clinical detail on the process of care that is not available in federal survey or claims data. Having such detail about all patients in a health system also allows for identification of small populations, such as those with rare conditions.3 EHRs also provide information on patients who may not otherwise be included in research because they would not meet the requirements to participate in a clinical trial.4

Unlike federal survey data, however, EHR data are not collected or structured for research. Repurposing information collected for other purposes always presents challenges. Even though EHRs do include information that can facilitate research on small populations, a number of technical, legal, and multi-institutional conditions must be in place in order for this research to reach its full potential.

Technical conditions required for research using EHR and other electronic health data

To use EHR data and other electronic health data for research, information it contains must be extracted and formatted for research. The information in an EHR is collected to assist clinicians and health care organizations in their day-to-day work, providing documentation required by law, for billing, and to inform provider decision-making for care of individual patients. For these purposes, there is often no need to ensure that information is entered in a uniform fashion, or to plan for the ability to pull selectively certain information from the system, to be able to aggregate data, or to identify certain groups of patients. The cost of converting this information into databases suitable for research purposes is substantial.

A major limiting step required for using data from EHRs for research is the ability to extract it from the EHR system. While an EHR system is where information is entered, it is not the place where the data can be cleaned, reformatted, and analyzed. Extraction can require a large staff of programmers, and ease of doing so depends on the system and vendor used.5 Some organizations have created a central warehouse where EHR, billing system, registration system, labs, and radiology systems are extracted, pooled together, and linked. Others have developed software to automate extraction or to query their EHR systems for selected records based on patient characteristics needed for analysis.

The major difficulty for both data extraction and research is that much of the content of EHRs has not been entered in a standard format. Desired information may be in free text that was entered by the clinicians to record their observations and assist with their decision-making. Some estimates say only 20 percent of information in EHRs is coded and put into structured fields, meaning most of the information is in free text. However, there has been great progress in the development of techniques to classify unstructured data. Algorithms and software have been developed for natural language processing (NLP) to take a clinician’s free text and create standard categories. However, some experts caution that NLP is at best a partial solution. In many cases, it may be more efficient and may produce more accurate data to ask the patient for the desired information or to use other data sources rather than trying to find it in the free text.6

In addition to lack of standardization, there are major concerns regarding the accuracy and completeness of data entered into EHRs. Research requires high quality and complete data for reaching valid conclusions. Compared to paper charts, electronic health records have been found to hold significant errors—in part, because many clinicians have not been accustomed to using a computer as part of their daily workflow during this transitional period from paper to electronic medical records. In addition to typos and spelling errors, errors of omission and commission have been found in medication lists and in problem lists where chronic and acute conditions are documented.7 In addition, cultural or financial barriers to access may prevent certain populations from receiving care, reducing the representativeness of EHR data available for research.8 There is also the issue of patients moving in and out of health care and EHR systems—either because they have stopped receiving care or have gone to another system. Such movement makes it difficult to create cohorts and to make reliable inferences about them.9 However, increasingly integrated models of health care delivery may present opportunities to study a more complete picture of a patient’s care.

Finally, the skills required to conduct research using EHR data are highly technical and specialized. This includes information technology, clinical and research skills needed to prepare the data, conduct analysis, and interpret findings in light of the context in which the data was collected. Individuals with this combination of expertise are currently in short supply.

Legal conditions required for research using EHR and other electronic health data

In addition to requirements for data extraction and analysis, there are legal requirements that complicate the repurposing of EHR data for research. Traditional research regulated by Institutional Review Boards that comply with federal laws can complicate the reuse of data collected for another purpose, and measures taken to protect privacy and data security may need to be reconsidered when using EHR data for research. Such data may have the potential to address additional research questions as the information accumulates over time. There is ongoing debate about complications created by legal requirements governing privacy and human subjects research.

Governance processes specifying who owns, controls, and regulates the data must also be in place in order to use EHR data for research. While HIPAA, the Common Rule, and state laws currently provide the major guidance regarding how health data can be used for research, each organization must determine how it will remain in compliance and how patient data can be used. Data governance requires major resource investments and cooperation within and across organizations.

Organizational conditions required for research combining multiple data sources

Because of the limitations of data from any single organization, there is great interest in combining data from multiple organizations. Data that is in electronic form can facilitate this. However, there are complexities in using EHR data for multi-institutional research. A mechanism is needed for data sharing. There are two major ways that data can be shared across multiple institutions: through a consolidated warehouse where a copy of the data from each institution is stored, or through some form of “distributed” network in which each organization retains its own data but data from each cooperating organization can be queried and produce research results. Centralizing data in a warehouse may increase efficiency when standardizing and querying the EHR data, but it requires resources to build and maintain and presents a number of privacy and governance concerns.10 The alternative—a virtual data warehouse in which data remain in separate locations—avoids the need for investment to build a separate infrastructure and simplifies the issues of data ownership and may better serve to protect privacy. However, it requires each participating organization to have the infrastructure to store data. Both methods for sharing data require significant infrastructure development, both technically and organizationally.

Ongoing funding for research infrastructure is needed but most grants and contracts pay for specific, discrete studies. However, in recent years the availability of this funding has increased. For example, this year the Patient-Centered Outcomes Research Institute is investing $68 million to support the initial development of a National Patient-Centered Clinical Research Network to build the capacity needed support comparative effectiveness research.11

In addition, for studies that include data from multiple organizations, approval must be obtained from multiple Institutional Review Boards, adding to the time and resources needed to conduct the research. Also, a process is needed to ensure the quality of multisite data for research.12 Research among multiple institutions is facilitated by the interoperability of their EHR systems, which remains underdeveloped. Without interoperability, a large amount of effort is needed to make data comparable and combinable. Major health systems, some EHR vendors, and federal incentives are promoting standardized data fields and formats across different EHR systems. Research agencies also have the opportunity to promote standardization through their funding decisions. Incentives for meeting “meaningful use” standards will also like have some effect, and in combination with other levers and incentives, the availability of standardized EHR data for research should continue to increase.13

As noted above, a number of research networks have also developed to facilitate research using data from multiple institutions (see Table II.2 in Part II). These include practice-based research networks of primary care practices, as well as other networks such as community health centers, HMOs, or cancer care providers who are collaborating to facilitate research. A major benefit of research networks includes the wealth of clinical information available through their EHRs. Often the organizations within a network are already either sharing a common EHR system or have worked to develop some form of centralized or distributed data warehouse for research purposes. Research on small populations is increasingly feasible as networks of EHRs with common structures and formats have developed, including a larger number of patients from multiple health care systems.

Other data sources may be linked with EHR data to provide additional information for research. Commonly linked administrative databases include disease and immunization registries, claims files, survey data, provider files, vital statistics (e.g., birth and death records), and area-level data.14 Additional clinical information such as genetic, care management, and social network information also have the potential for linkage with EHR data for research. The use of multiple data sources may both serve to validate electronic health data as well as increase the amount of information available on target study populations.

View full report


"rpt_ehealthdata.pdf" (pdf, 1.99Mb)

Note: Documents in PDF format require the Adobe Acrobat Reader®. If you experience problems with PDF documents, please download the latest version of the Reader®