Understanding Disparities in Persons with Multiple Chronic Conditions: Research Approaches and Datasets. 6.1.2 Methods for Imputing Race/Ethnicity


Rand Corporation developed an algorithm that incorporates U.S. Census Bureau latest surname list with a Bayesian method to integrate surname and geocode information (residence) to better estimate self-reported race/ethnicity information. The new approach greatly improved the accuracy of race/ethnic coding for Blacks and Asians, but imputing Native American and multiracial individuals from surname and residence remains difficult (Elliot et al., 2009)

Eicheldinger and colleagues (2008) developed a methodology using primarily surname lists (U.S. Census Bureau) to more accurately impute race/ethnicity codes for beneficiaries Hispanic and Pacific Islander origin; the method increased the number of identified Hispanics three-fold.

The use of census data (geocode data) to impute race and SES information is more accurate for majority populations (white and black) than minorities. Using census-level information to determine individual level characteristics is possible, but subject to ecological biases (Kwok & Yankaskas, 2001).

Roblin and colleagues (2010) developed an algorithm to electronically abstract race/ethnicity information from electronic health records notes. The algorithm was found to be highly reliable in identifying white, black and Asian/pacific islander race based on specific strings of characters. However, the algorithm requires exact string matches and cannot overcome misspellings or abbreviations.

Research Triangle Inc. developed an algorithm to improve the imputation of race and ethnicity in the Medicare Enrollment Database (EDB) and developed a method to calculate an SES index for each Medicare beneficiary. The race/ethnicity algorithm is a SAS program that imputes race/ethnicity for Hispanics and Asians/Pacific Islanders based on preferred language to receive materials, residence in Puerto Rico or Hawaii, and first and last names. It was validated using HCAHPS survey data as the gold standard. Compared to raw enrollment database data, the algorithm significantly improved the accuracy of race/ethnicity coding. The SES index is based on a composite of neighborhood characteristics drawn from Census data, based on work by Krieger (2003). It was validated against income data from the social security administration, HCAHPS survey data on insurance coverage, health status, and educational attainment, and dual eligibility status.

HCUP is linked with a 20% sample of the NIS database, which contains information from healthcare organizations that have high-quality demographic data. Cases with suspect or missing information are not included in the subsample. Validity/reliability is improved by dropping “bad” information.

View full report


"rpt_ResearchAddressing.pdf" (pdf, 1.34Mb)

Note: Documents in PDF format require the Adobe Acrobat Reader®. If you experience problems with PDF documents, please download the latest version of the Reader®