The Feasibility of Using Electronic Health Data for Research on Small Populations. Availability of Information to Identify Small Populations


Some small populations may be identifiable using information that is now typically recorded in EHRs. Residents of rural areas may be identifiable by the address and zip code information that is collected for billing purposes, although not all providers collect updated address information at each visit, so some of this information may not be up to date. 252 In addition, lack of EHRs in rural practices and hospitals limits the availability of electronic health data on rural populations.253 While rural providers are increasingly adoption EHR systems, there will remain the problems of interconnectivity and interoperability. There is also evidence that critical access and small hospitals are at risk of failing to meet Meaningful Use criteria, which suggests there may continue to be limited data available on rural populations,254 even where EHRs are adopted. Therefore, conducting rural health research using EHR data may remain for the time being in the hands of a few integrated health care delivery systems with EHRs and data warehouses that serve large rural populations, which may not be representative of rural populations in general. Some of these organizations have been able to drill down within their rural populations for research or quality improvement purposes. For example Intermountain Healthcare has looked at rural patients with 3 or more chronic conditions,255 and Kaiser Permanente Northwest (KP-NW) has looked at rural Hispanic patients with Spanish as their primary language, among whom drug seeking behavior has been a particular problem. This population mostly receives its care through the Oregon Community Health Information Network (OCHIN) of federally qualified health centers (FQHCs), to which the KP Foundation Health Plan gave $1 million to purchase the Epic electronic health record software, so this network and KP are now collaborating on research. Since OCHIN hosts the EHR for nearly all the FHQCs in Oregon and the FHQCs are attempting to create a single medical record for each unique individual (rather than a separate record for each clinic visited by a patient), it is possible to identify drug-seeking behavior by patients who attempt to obtain opiate-containing drug products from multiple FQHCs at the same time.256

Adolescents with autism spectrum disorders may also be identified using date of birth and diagnostic information in the EHR. However, the autism diagnosis may appear in free text rather than in structured fields in the EHRs.257, 258 Even within structured fields, a number of diagnostic codes can indicate someone has an ASD. Kaiser Permanente in Northern California has developed a list of valid autism diagnoses based ICD codes and who made the diagnosis.259 There is also variability within or across provider organizations regarding who can authoritatively diagnose ASDs, as well as on the tests and benchmarks that are used. Diagnoses of ASD are often made at psychological testing sites that are separate the patient’s health care organization, particularly for those with higher incomes, and this may affect whether ASD appears in the organization’s EHR. Regardless of a family’s ability to pay, diagnosis of ASDs is also often made by school psychologists, especially at kindergarten intake. Providers of ASD patients’ medical care are not necessarily skilled at diagnosing conditions such as ASDs.260

An additional challenge when studying any adolescent population is that EHRs have generally been designed for adult populations, and pediatric EHRs thus far are not yet as robust. AHRQ and CMS are currently working to strengthen pediatric EHRs with key data elements. However, this work is still in the early stages. EHR and other electronic health data may be particularly important in moving forward research on pediatric medicine, a field where clinicians and families have typically depended on findings from adult clinical trials. A number of pediatric primary care practice-based research networks have developed that are beginning to explore the use of electronic health data for research.261 For example, Pediatric Research in Office Settings (PROS) is the American Academy of Pediatrics’ practice-based research network and has begun an EHR-based sub-network called ePROS. This sub-network was funded through the American Recovery and Reinvestment Act of 2009 and is being built to develop and test the infrastructure needed to conduct pediatric research using EHR systems. It includes providers from diverse practice settings across different states and using a variety of vendors, with plans to expand the sub-network substantially within the next one to two years.262

Using EHR information to identify patients who are members of specific Asian subpopulations or the LGBT population remains challenging at present. The broad OMB race/ethnicity categories are increasingly collected in health care settings, but recording information in medical records about patients’ membership in subpopulations such as Filipino or Vietnamese rarely happens. There are also variations in how “Asians” get recorded, sometimes along with Pacific Islanders (as per the OMB categories) and sometimes under “Other.” Indeed and more generally, the race/ethnicity information in medical records is of variable quality because standardization requires a degree of staff training that does not always occur.263

Because the Americans with Disability Act requires health care providers make interpreters available where needed, language information that may identify some Asian subpopulations may be in some organizations’ EHRs. KP-NW collects information about primary language spoken at home as well as need for translation services, and has standardized this variable across health plans so someone could easily look up language sub-groups, such as patients who speak Tagalog.264 At University of Vermont, refugee and immigrant patients have been identified through billing data where interpreters were used.265 Another approach to identifying racial and ethnic minorities may be use of last names as proxies.

Sexual orientation is almost never collected or entered into patient records, although a few organizations have begun to do so. Therefore, it is important for this and other characteristics not to impute null values where the fields are blank. UC Davis Medical Center has started using a form to collect information for entry into EHRs about patients’ sexual orientation as well as gender now and as assigned at birth.266 Some such information may already be available in provider notes based what patients may have said about behavior, attraction, or sexual identity. But there has been no standard way to collect this information, so it is difficult to create structured fields for this information. Some EHR vendors such as Epic do have fields to capture information about sexual partners and this can be used to run reports based on the sex of partners. Epic has expressed interest in receiving input from users on how to collect sexual and gender identity in its EHRs.267 The HMO Research Network’s virtual data warehouse has also incorporated sexual orientation as a variable, although they believe there is significant under-reporting of these data across participating health plans. An additional challenge even if this information is being collected is that sexual orientation may change over time, so the information in an EHR may or may not be up to date. This challenge also makes it difficult to identify transgender populations because gender is typically collected only once.

The availability of different types of information in an EHR provides multiple possible approaches that can be used to identify a population, and the potential to improve accuracy when these approaches are used in combination. For example, while there are limitations to using diagnosis to identify patients with ASDs, looking also at the ICD-9 codes and medications may provide information to supplement or validate the diagnostic information. However, some of these types of information may be more accessible and more highly valid in an EHR than others.268

For example, while ICD-9 codes tend to be readily available, it is variable how reflective they may be of the patient’s actual diagnosis. Information on family and social history are generally incomplete and of low quality. However, information such as vital signs (blood pressure, weight, etc.) tend be collected relatively frequently and recorded accurately. Lab results are not always available in an EHR, but when they are they provide highly reliable information and may also be a better indication of what the clinician was thinking than the diagnostic code. EHRs also keep fairly accurate record of what was prescribed, which may also serve to validate the diagnosis (for example, if prescribed insulin, the patient likely has diabetes). However, prescriptions may be less useful to study utilization considering up to 40 percent of prescriptions are never filled.269

View full report


"rpt_ehealthdata.pdf" (pdf, 1.99Mb)

Note: Documents in PDF format require the Adobe Acrobat Reader®. If you experience problems with PDF documents, please download the latest version of the Reader®