The Feasibility of Using Electronic Health Data for Research on Small Populations. Discussion/Conclusion


This report has focused on need for health information data about small populations and the challenges that meeting that need has posed for researchers. To explore these challenges we considered populations defined by four types of characteristics―sexual orientation and behavior, geography, race and ethnicity, and a health-related condition—that were selected to illustrate the range of problems that face researchers when using existing federal surveys (see Table I.6). In a Part II of this report, we examine the potential of data based on electronic health records and related electronic data sources to complement these surveys and overcome some of the problems researchers have historically faced.

In each of our four illustrative populations, we have presented evidence of distinctive health and health care issues that could usefully be better understood by research. Some of these issues pertain to problems and concerns that may characterize the population itself—as with the high rates of diabetes among Filipino Americans, the distance from specialty care that some rural populations face, or the problems posed by the transition to adulthood for adolescents with autism spectrum disorders. Some issues pertain to possible differences and possibility disparities from other populations or the population at large regarding health conditions, services, or outcomes of care.

Research to address questions about small populations depends on several things. The most fundamental is the ability to identify the population of interest in the data. The second is having data on the independent and dependent variables of interest, as well as relevant co-variates (e.g., education, income) that need to be controlled for. Third, the value of many data sources can be enhanced if researchers are able to link to other data sources. Such linkage requires availability of a unique identifier or a matching algorithm that uses multiple variables. Fourth, some research questions require longitudinal data in which data about the same people can be linked over time. Finally, given resource realities and constraints, ways are needed to conduct research as efficiently and effectively as possible. Primary data collection strategies for getting sufficient numbers of people from small populations can be very expensive.

Some national health survey data sets (including the National Survey of Family Growth, National Health and Nutrition Examination Survey, National Health Information Survey, and Behavioral Risk Factor Surveillance System) contain information about the LGBT population or Asian subpopulations. Although such data may be collected, issues exist that make it difficult to use for research on small populations. Information (e.g., zip codes) that is needed to characterize an individual’s degree of rural-ness is not available in federal public use data sets because of concerns that deductive identification of individual people might be possible. Additionally, validity concerns can be raised about information reported by a parent in household surveys about a condition such as a child’s autism. Survey data may also not include the dependent variables and co-variates needed to answer questions about the health and health care of small populations. Data analysis also requires sufficient numbers, and this can be a problem in survey research and secondary data analysis for people in categories that appear only in small numbers in a large population. This is particularly true when co-variates are considered. The common solutions for this problem all have important drawbacks.

Combining data from surveys conducted in multiple years may yield a sufficiently large analytic sample, but it can produce misleading results if changes are occurring within the population over time. Oversampling a small population in survey research is often feasible, but it can be expensive. Two-stage sampling, starting with a targeted survey, and then a follow-up survey of the target population, can be expensive, and can only be used when the target population is stable and easily identified.201 Web-based surveys are another potential approach, but these are also limited by self-selection bias (due to high nonresponse rates), representativeness issues, and concerns about the reliability and validity of the data collected.202, 203 Finally, focusing the study on a region or setting in which there is a concentration of people who fit the category is an oft-used option for obtaining sufficiently large numbers, but the resulting data may not be representative of the larger population.

Available data sources also have other important limitations. Federal survey research is typically cross-sectional, lending itself poorly to research questions that have a longitudinal dimension. Additionally, survey domains, questions, and response categories may change over time, limiting the ability to use the data longitudinally. Data based on insurance claims may permit data analysis that has a longitudinal dimension, but insurance claims do not typically include information that would permit identifying someone as from a LGBT or an Asian-American subpopulation and the data are limited to billed services from particular payers.

In sum, policymakers, advocates, or researchers interested in the health and health needs of small populations encounter various barriers to research using existing federal surveys.

A great deal of hope has been placed in the possibility that electronic information generated in the patient care process in organizations that have electronic health records will provide data that can be used for research on small populations, even though the organizations that collect such information at this time are hardly representative. Electronic health records and associated electronic data (e.g., patient reported health behavior or laboratory or prescription information) have a number potential benefits, such as the possible inclusion of large numbers of individuals from small populations, the collection of rich information about key process of care and outcome variables of interest, the potential for longitudinal study of cohorts of people (e.g. regarding outcomes of care), and the ability to do these relatively inexpensively.

In Part II of this report, we explore these possibilities on how electronic health records and other electronic data can be used to strengthen research on these patient populations.

View full report


"rpt_ehealthdata.pdf" (pdf, 1.99Mb)

Note: Documents in PDF format require the Adobe Acrobat Reader®. If you experience problems with PDF documents, please download the latest version of the Reader®