The Feasibility of Using Electronic Health Data for Research on Small Populations. Introduction to Part II


Patients’ health records and other electronic health information are an essential part of care, documenting critical issues such as their history, preventive care, diagnostic tests, and diagnoses and treatments over time. Health records also facilitate information sharing among physicians, other health professionals, and provider organizations that may be involved in a patient’s care. Containing key information regardless of where and from whom the patient receives care, health records can also be fairly comprehensive as well as longitudinal. Comprehensive integrated health records support the continuity and timeliness of care, which can in turn represent higher quality and less costly care.

Given the rich information contained in health records, much medical and health services research has been based on them, solely or in combination with other types of data (e.g., survey, claims). However, the traditional medium (i.e., paper and pen) in which health records have been created as well as organized and managed (i.e., paper file folders in a filing cabinet) has limited their usefulness for research. The manual process of identifying and obtaining the relevant records from one or more providers, abstracting the information contained in them, and creating a database for analysis is time-consuming, expensive, and fraught with potential errors and problems.206

The increased adoption and use of electronic health records (EHRs) and other forms of electronic health information have the potential to revolutionize research, overcoming many historical constraints. The new medium (electronic) in which health records are created, organized, and managed (computer hardware and software) result in “big data” (a lot of detailed data on a large number of people) and potentially faster and cheaper means of using medical records for research. For example, EHRs and other information technology can facilitate the identifying patients with a particular diagnosis or receiving certain services, obtaining their records, extracting information, and creating a database needed for analysis. Additionally, recent developments like EHR certification standards, ‘Meaningful Use” (MU) criteria, tools like natural language processing (NLP) software, and electronic health information exchange (HIE) infrastructure (e.g., email, Internet, cloud) and standards (e.g., HL7) have the potential to improve the reliability and validity of EHR data as well as their comprehensiveness and longitudinality. As the Institute of Medicine (IOM) notes, EHRs and other electronic health data provide the information infrastructure to support a “learning health care system” that continuously and relatively quickly turns data into information to guide ongoing improvement efforts and research.207

Research on “small n” populations is an important area where EHR and other electronic data have the potential to complement existing data sources and methods, perhaps revolutionizing the research process. By “small n” populations, we mean subpopulations that are much less common than the “average,” “typical” or “majority” population and may differ from them in important ways (e.g., disease prevalence, treatment). For a variety of reasons, small n populations have been difficult to study with traditional methods and data sources, such federal surveys and claims data sets.

As described in Part I of this report, there are important limitations to the use of federal surveys for the health and health care needs of small n populations. These surveys may include too few people in important demographic or clinical subpopulations (e.g., race/ethnicity, sexual orientation/gender identity, location, or clinical condition) to produce valid and reliable findings. Additionally, the surveys may not contain items or questions specific to the population of interest or on co-variates needed as controls (e.g., education, income, years in country, primary language). Finally, surveys may have a lot of missing or inaccurate data about sensitive topics that raise privacy concerns (e.g., sexual behavior).

Claims data from public or private health insurers or research agencies (e.g., AHRQ HCUP data) provide sources of data for research on some small n populations. However, these data have a number of limitations as well, primarily because they have been generated to obtain payment. Depending on the payment method, providers may be more or less motivated to submit comprehensive and accurate claims. Additionally, many important clinical details, as well as patient-reported information, do not appear in claims, although efforts are currently under way to try to enhance claims data with EHR and other types of data (e.g., laboratory and pharmacy data, death certificates or other vital records) for research purposes.208 Finally, claims data from particular health plans and providers may not provide comprehensive or longitudinal information because patients may change health plans and providers or see providers that are not part of the same organized delivery system.

The purpose of this report is to explore the potential use of EHRs and other electronic information to improve research about small populations, alone or in combination with other data sources. While “research” can take many forms , we define the term broadly in this report, as our primary purpose is to consider how EHR data can potentially be used to study the health and health care needs of small populations as illustrated by the four examples or sub- groups, including making comparisons to the larger population or other sub-groups as needed. As described in Part I, the priority research questions of interest about small n populations are highly varied, including topics traditionally addressed through clinical, pharmaceutical, health services, public health, public policy and evaluation research. In some cases, even basic descriptive information about certain small populations remains unavailable due to current limitations with data and research methods. The Institute of Medicine has described different approaches to collecting evidence that may be more or less appropriate to address different types of research questions.209 In a similar way, EHR data, alone or in combination with other forms of data, may be better suited for some purposes or types of research than others. Additionally, increasing interest in quality improvement provides opportunities to harness EHR data for research on small n populations but may also present some challenges. We discuss the issue of the “fit” between the purpose and nature of the research on small n populations and the potential use of EHR data further throughout this report.

To explore this potential, we focus on four small n populations that have been difficult to study using conventional methods and source of data—the LGBT population, Asian-American subpopulations, adolescents with autism spectrum disorders, and residents of rural areas. Each of these groupings has distinctive health or health care needs that have been difficult to study for reasons that include small numbers, sensitivity or validity of some reported information (problems in both survey data and data based on medical records or claims), and concerns about confidentiality when separate data elements could be combined to identify particular individuals in a data set.

Using EHR-based information for research on small n populations shares many challenges with all research that would use such information, but, as we will discuss, some special issues arise with small n populations. The four on which we focus illustrate a range of challenges in using EHR and other electronic health information for research. For example, information about the race/ethnicity information that is increasingly being collected in structured data fields in EHRs may not necessarily include smaller ethnic categories and categories may different across health systems. Information about sexual orientation, gender identity, and sexual behavior, if collected at all, is frequently located in the clinician’s notes or other unstructured data fields because of the potential discomfort and stigma historically associated with LGBT status or certain types of sexual behavior. But, natural language processing (NLP) of that unstructured data could be used to identify lesbian, gay, and bisexual individuals, or patient surveys could be administered through a patient portal or on an iPad in the waiting room and input or streamed into the EHR. A combination of structured (age, diagnoses, medications) and unstructured EHR information could be used to identify adolescents with autism spectrum disorder (ASD) and/or also be combined with claims and/or educational records. Finally, providers located in rural areas could be identified and recruited for research on the health and health care needs of rural residents and other issues, but rural providers are less likely to have an EHR and the ability to exchange health information, and privacy concerns arise because of the possibility that individuals in a sparsely populated areas could be identified if rural zip codes are included in the data.

To explore the potential strengths and limits of using EHR data for research on small n populations, alone or in combination with other data, this report covers four general topics. First, we provide a brief description of the methods and data used for the report and briefly discuss the need for research on small n populations. Second, we describe the increasing adoption and use of EHRs among physicians and hospitals, the kinds of data available in them, and the major issues encountered in using them for research within a single health care organization, such as federally qualified health center, physician group, or large organized delivery system. Third, we describe some additional challenges to conducting research with EHR data from multiple health care organizations and/or in combining EHR and other data sources. Finally, we conclude with a discussion of the implications for HHS, including some potential next steps for exploring and improving the use of EHR and other data for research on these and other small n populations.

View full report


"rpt_ehealthdata.pdf" (pdf, 1.99Mb)

Note: Documents in PDF format require the Adobe Acrobat Reader®. If you experience problems with PDF documents, please download the latest version of the Reader®