Understanding the High Prevalence of Low-Prevalence Chronic Disease Combinations: Databases and Methods for Research. Study Designs and Analytic Methods


As discussed above, most studies examine chronic conditions with the highest prevalence, costs, utilization, hospitalizations, and adverse events. For example, to study chronic disease prevalence in male Medicare patients, Black and colleagues limited their analyses to the “top ten” most prevalent diseases (Black et al., 2007). Other researchers have examined a somewhat larger number of conditions, but have purposely excluded less prevalent diseases (Schafer et al. 2010). It is critical to take the number of chronic conditions being investigated into account because prevalence estimates of multimorbidity are dependent on the number of diseases that are examined. This limitation was recently discussed by Salive, who found a prevalence estimate of 17.1% for 25–44 year old primary care patients when considering a list of seven conditions, and 73.9% when considering all possible conditions (Salive, 2013). Similarly, Fortin and colleagues found prevalence estimates of 47.3% among 45–64 year old primary care patients when considering seven conditions, and 93.1% when considering an open list (Fortin et al., 2010). Schneider and colleagues found that over 20% of Medicare beneficiaries had two or more chronic conditions when using the CMS Chronic Conditions Warehouse and a list of nine potential diseases, (Schneider et al., 2009). A considerably larger figure (52%) was reported for Veteran Affairs (VA) patients when almost triple the number of potential diseases (29 conditions) was considered (Yu et al., 2003). Thus, MCC prevalence can be under-estimated when fewer chronic conditions are investigated.

In addition to the number of chronic conditions that are studied, the specific types of chronic conditions that are examined across studies differ (e.g., cardiovascular conditions are studied vs. all possible chronic conditions). The “filtering” phenomenon can be observed when comparing a list of the chronic conditions that are investigated in two separate studies. For example, comparing the chronic conditions that were studied by Newcomer and colleagues (2011) (17 chronic conditions) to Chen and colleagues (2011) (8 chronic conditions), only three conditions were found to overlap. Although prevalence estimates for single conditions may be comparable across different data collection systems and surveys (Li et al., 2012), multimorbidity prevalence estimates across studies that include different conditions complicated the interpretation, generalizability and comparability of results.

MCC research has been conducted using a variety of different study designs (See Exhibit 12). However, the majority of MCC studies used retrospective cohort and cross-sectional designs, including secondary data analyses of data, due to the need for large sample sizes. It is important to note that these study designs have systematic limitations. For example, although retrospective cohorts are longitudinal and usually contain information on a large number of patients, they are often subject to attrition bias and bias due to changes in data collection procedures over time. This is an important concern for MCC studies, as prevalence estimates may be directly impacted by changes in data collection procedures, for example sampling strategies that change in terms of periodicity and population observed over time. Similarly, cross-sectional designs are not longitudinal and provide a “snap-shot” of information at one point in time. Future MCC research may benefit from employing longitudinal, prospective studies that provide researchers with large sample sizes, but also the ability to appropriately assess potential biases and study limitations as they occur. Preferred study designs for research on less prevalent combinations of MCC produce large sample sizes, are longitudinal, and provider researchers with the ability to assess the accuracy of diagnostic coding over time. Therefore, large prospective cohorts are advantageous for research on less prevalent combinations of MCC, although they are usually very expensive. The research questions that need to be answered may also dictate which study designs are most appropriate for certain MCC studies.

Exhibit 12: MCC Study Designs and Considerations

Author Study Designs Design Considerations
Ben-Noun 2001 Case-Control Small sample size, prone to recall/retrospective and selection bias, suited for rare conditions.
Salisbury et al. 2011 Retrospective Cohort Large sample size, prone to attrition bias, potential unknown coding practices and changes in data collection method, longitudinal.
Shelton et al. 2000 Prospective Cohort Large sample size, prone to attrition bias, known methodology changes, potential for missing data, longitudinal, highly expensive.
Wolff et al. 2002 Cross-sectional Large sample size, not longitudinal, cannot measure changes over time, cannot draw causal inferences, descriptive in nature.
Yu et al. 2003 Secondary Data Analysis All type of sample sizes, potential unknown coding practices and data anomalies.

Other important considerations for MCC research are the limitations of the databases and algorithms used to house and analyze chronic conditions data. Over and underestimation of chronic disease prevalence may be due to database-specific characteristics. For example, the CMS Chronic Conditions Warehouse algorithm, which is used to estimate chronic disease prevalence, has been shown to underestimate the prevalence of chronic conditions requiring less frequent healthcare utilization, such as arthritis (Gorina & Kramarow, 2011). The underestimation is due to the fact that the reference period (or look back period) used in the CCW algorithm does not go back far enough to capture diagnoses that were reported on early healthcare claims and not on more recent claims. Setting (e.g., inpatient, nursing home, etc.) and other database characteristics also impact prevalence estimates and the interpretation of multimorbidity. For example, Schram and colleagues (2008) found that multimorbidity prevalence significantly varied across settings, from 22% in the inhospital setting to 82% in nursing homes. As expected, given the inherent differences between these populations, Fortin and colleagues (2010) found that MCC prevalence was much smaller in a general civilian population compared to family practice patients. In addition to the effect of “setting” on chronic disease prevalence estimates, Schram et al. (2008) also concluded that prevalence estimates are dependent on the number of chronic conditions being studied, the data collection method used to capture diagnosis information (i.e., ICD-9 vs. survey) and the time-frame being investigated, similar to the concerns raised by Gorina and Kramarow with the CCW’s look back period.

Database comprehensiveness, sampling frame and the patient population being studied all affect results. In drawing conclusions about analyses conducted on CCW data or AHRQ’s National Inpatient Sample (NIS) data, it is important to know that the CCW covers all Medicare patients, while the publically available version of the NIS covers only 20% of hospital discharges. Understanding these types of database characteristics will help researchers interpret the generalizability of their findings. The fact that the occurrence and clustering of MCC is time-dependent as patients grow older means that longitudinal datasets are best positioned to accumulate a patient’s chronic conditions over time and provide more accurate estimates of disease prevalence than cross-sectional assessments (France et al., 2011 & Wong et al., 2011). Time-dependency is an especially important concept for research on less prevalent combinations of MCC, as less common diseases are more likely to manifest over a long period of time, and diseases have different durations. Cross-sectional studies and analyses of longitudinal datasets covering limited time periods may not contain sufficient diagnostic information to study less prevalent combinations of MCC. Database size is important for research on less prevalent combinations of MCC. Large administrative datasets provide the best option due to the sheer volume of data and number of patients available for study. Less prevalent combinations of MCC are less likely to occur in small datasets with a limited number of patients and diagnoses to consider. Rare disease researchers face similar challenges.

Longitudinal databases have limitations. First, false discoveries and associations between chronic disease on the basis of too few observed diagnoses, inconsistent findings, and multiple test corrections need to be addressed (Wong et al. 2011). Additionally, the further back in time you examine longitudinal claims, the less accurately you can predict resource use and cost for a given condition or combination of conditions because of changing illness intensity over time. Although large administrative databases provide useful, current information on financial burden of disease (Riley, 2009), to more accurately predict resource use and cost, researchers need to know which diagnoses are “active” for patients currently receiving care. A laundry list of diagnoses is of little utility without a way to identify “active” conditions. Many patients will have ICD-9 codes on their past claims that represent errors, unconfirmed suspected diseases, and conditions that have been cured or are in remission. “Non-active” ICD-9 codes captured in longitudinal databases can negatively impact predictions of resource use and cost associated with MCC. Solutions may include an active problem list for patients and/or the use of supplemental data ( e.g., pharmacy and laboratory data) to confirm “active” diagnoses.

The challenges associated with conducting research on less prevalent MCC are very similar to those faced by researchers of rare diseases. Within the United States, a disease is considered to be rare when it affects less than 1 in 1000 individuals. Thus, like researchers studying less prevalent MCC, rare disease researchers are limited by small patient sample sizes and the inability of data sources to collect information on rare diagnoses, making it difficult to design clinical trials and test new treatments. In a research environment constrained by limited resources, rare disease research is given lower priority than conditions affecting more individuals (Griggs et al., 2009 & Ragni et al., 2012). It is important to consider that while any given rare disease by definition does not represent a prevalent illness, there are many rare diseases that may cumulatively affect a significant segment of the population. Finally, the likelihood of coding a rare chronic condition as a mistake may be similar to the likelihood of a patient truly having a rare disease and having this diagnosis coded accurately on a claim. Although not well studied, both research on rare diseases and research on less prevalent combinations of MCC may suffer from difficulty assessing validity.

Lastly, it is important to recognize that traditional statistical approaches may not be applicable to research on low-prevalence MCC. The issue of multiple comparisons is highly relevant for MCC research due to the number of chronic disease combinations that can be considered in the long tail. In fact, there are almost as many chronic disease combinations as there are patients. For example, if working at the three digit ICD-9 code level with approximately 1,000 diagnosis codes, about one-million pair wise comparisons would be possible. In this case, correcting for multiple comparisons using the Bonferroni method would require p-values of less than 0.00000005 to be significant. To understand the differences between low-prevalence MCC new or modified statistical approaches may need to be considered to address the multiple comparison limitation.

View full report


"rpt_LowPrevMCCData.pdf" (pdf, 1.37Mb)

Note: Documents in PDF format require the Adobe Acrobat Reader®. If you experience problems with PDF documents, please download the latest version of the Reader®