The Feasibility of Using Electronic Health Data for Research on Small Populations. Limitations of available data sources


Recognizing the health needs of and health-related differences among, Asian-American subpopulations, various researchers, policy makers, and advocates of Asian Americans have called for more consistent and standardized collection of data on Asian subpopulations. The challenges faced getting adequate data to study the health and health care of Asian-American subpopulations include language barriers, small numbers, and differences from project to project in how groupings are defined and combined. The first two of these problems interact with each other. Although costly, it is possible to collect data in multiple languages, and some surveys have done so. But the problem of small numbers adds complications. The Asian-American population is itself small, and its subpopulations and language groups are of course even smaller.

Under the Paperwork Reduction Act, the Office of Management and Budget uses race and ethnicity standards in its review of federal agency requests to collect data through surveys and forms. For the most part, surveys conform to the standard categories. Additional granularity is encouraged when feasible, but always must permit aggregation to the appropriate categories prescribed in the standard. Because administrative data are not always reported by individuals themselves, rather collected by providers or other parties, the level of consistency may not match surveys. The aim however is to strive to meet the standard when possible. Determinations about level of granularity are made in the context of an expectation about whether a particular data collection activity is likely to generate a sufficient response.

Standards continue to evolve. In 1997, OMB revised federal data collection standards to separate Asians and Native Hawaiians. More recently the ACA directed HHS to establish standards for the collection of race, ethnicity, sex, primary language, and disability status. An effort led by the HHS Data Council produced a set of guidelines for surveys that expands the standards.68 As new and existing surveys are presented for review and approval, these standards are now being implemented. A similar effort is under way to recommend guidelines for administrative data.

In addition to efforts spurred by the ACA, other federal, state, and private initiatives could generate improved data. Federal Meaningful Use requirements do specify collection of race and ethnicity categories required in specific geographic areas based on the population make-up.69 Thus, medical records-based information about Asian subpopulations is likely to be collected only in locales where concentrations of those populations exist.

By the mid-2000s nearly 80 percent of hospitals were collecting race/ethnicity data from their patients, with teaching, urban, and hospitals in states with mandates to collect racial/ethnic data more likely to collect and report the data (such as state requirements that patient demographic information be included in hospital discharge data).70 There is less information about the collection of such information by other providers, and there has been doubt and confusion about how best to collect it. The Institute of Medicine has advised that such data should be collected from patients themselves, rather than by clerical observation, and most hospitals reported doing so. Most hospitals were using the OMB categories but up to 10 percent were using finer categories based in part on local circumstances. 78 percent of hospitals that collected race/ethnicity data used the category “Asian”, 25 percent used “Pacific Islander” and fewer collected more granular Asian categories.71 A 2009 IOM committee report highlighted several efforts to improve hospital collection of race and ethnicity data, including a Robert Wood Johnson Foundation initiative that required participating hospitals to systematically collect such data and use it to stratify quality measures. The IOM report notes that other hospitals have successfully collected race and ethnicity data for the purpose of linking them to quality measures. In 2007, Massachusetts required all hospitals in the state to collect race and ethnicity data on patients with an inpatient stay, an observation unit stay, or an emergency department visit.72

There have been many efforts to improve Medicare race and ethnicity data collection. CMS has supported various efforts, such as annual updates from Social Security data, quarterly updates on American Indians and Alaska Natives from the Indian Health Service, and requesting self-reporting of race through mailings.73 Researchers have used Census surname lists that allow them to more correctly impute race/ethnicity codes.74

The categories used to characterize racial/ethnic groups present additional problems. Groups like the Association of Asian Pacific Community Health Organizations have worked to standardize definitions for collecting data on Asians across organizations to better understand their health service use.75 The problem of categories has distinctive features among Asian-American subpopulations. The U.S. Census reports data for six Asian-American subcategories as well as “Other Asian” with a write-in box (see Figure I.2), but the use of so many categories may not be practical for many data collection purposes. In addition, Asians from the same subpopulation may describe themselves differently when given the opportunity to fill in the open ended box for “Other Asian.” The federal Office of Management and Budget has adopted standard racial/ethnic categories for federal data collection, but they have not been uniformly adopted by the many different entities that collect survey or administrative data.76 Moreover OMB’s five racial and one ethnic (Hispanic/Latino or not) category are considered by some researchers and advocacy organizations to be insufficient for understanding disparities and targeting quality improvement (QI) efforts. In considering the collection of race, ethnicity, and language data, an 2009 Institute of Medicine committee recommended adding questions about (a) English language proficiency, (b) preferred spoken language for health care, and (c) “granular ethnicity,” defined as “a person’s ethnic origin or descent, ‘roots’ or heritage, or place of birth of the person or the person’s parents or ancestors.”77

Figure I.2. Reproduction of the Question on Race from the 2010 Census

Figure I.2. Reproduction of the Question on Race from the 2010 Census

Source: U.S. Census Bureau, 2010 Census questionnaire.

Changes in the categories used in data collection create difficulties in documenting trends. In 1997, the OMB revised federal data collection standards to make separate categories of (a) Asians and (b) Native Hawaiian and Other Pacific Islanders (NHPI). However, race and ethnicity data collection is not mandatory across government programs and often uses inconsistent categories where it has been implemented. A study in the early 2000s compared Medicare enrollee data with self-reported race and ethnicity in Medicare’s Consumer Assessment of Health Plans (CAHPs) survey. The enrollment data matched only 55 percent of the people who self-reported as Asian, in part because many Asians were coded as “other” in the enrollment data.78 Other studies have also found that Asians are commonly misclassified or classified as “unknown” race.79 Some researchers have used preferred language selected for Medicare mailings and surname data from the Census Bureau to impute missing data for Asians,80 although common Hispanic surnames for Filipinos make this problematic, as do some last names (e.g. Lee and Park among Koreans). Birthplace or parent’s country of birth has also been used as a proxy for ethnicity, as in the national SEER cancer registry, but nativity and ethnic identification are not always synonymous.

In sum, various cultural, socioeconomic, and historical factors mean that there are variations in many aspects of the health of people from the various Asian subpopulations, but the research on their health needs and the care that they receive has been limited. Survey research has been limited by the small size of the subpopulations and by language barriers, as well as by other general limitations (e.g., self-reported, clinical detailed needed for certain studies). Research from administrative and medical records data has faced practical issues in the collection of recommended data on race/ethnicity and related issues (e.g., country of origin or month in country, language, etc.). The geographic concentration of some subpopulations may facilitate survey data collection at the state or local level and enhance the feasibility of medical record based research from health plans and providers that serve that population, but only if data collection goes beyond the standard racial/ethnic categories and data are collected as recommended (e.g., self-reported versus what clerks or clinicians assume). Generalization from certain geographic locations is hazardous, since the Asian communities on the West Coast, East Coast, and elsewhere differ in terms of their immigration histories and various social, economic, political, and even health-related characteristics.81

View full report


"rpt_ehealthdata.pdf" (pdf, 1.99Mb)

Note: Documents in PDF format require the Adobe Acrobat Reader®. If you experience problems with PDF documents, please download the latest version of the Reader®