The Feasibility of Using Electronic Health Data for Research on Small Populations. Organizational Conditions Required for Research Combining Multiple Data Sources


Because of the previously mentioned limitations with using data from a single organization’s EHR for research, the ability to combine EHR data with other electronic data sources is often needed to strengthen study results, particularly for small populations. Combining EHR data across institutions can allow for a larger sample size to increase the likelihood of being able to study small populations, as well as offer a more complete picture of patients that receive care in more than one place. While providing additional information, using data from multiple data sources for research does come with an additional set of challenges and requires a number of organizational conditions be in place, as described in this section. Examples of multi-organizational efforts such as research networks are described below where organizations are already working together to overcome these challenges. In addition, a number of other data sources that may be combined with EHR data to further facilitate research on small populations are described at the end of this section.

Using EHR and other electronic health data from multiple organizations

In order to conduct research with data from multiple organizations, a rationale and a mechanism are needed for organizations to share the data. The technical and legal issues associated with data sharing have received considerable attention throughout the implementation of provisions in the HITECH Act to promote health information exchange to improve the quality of care. There are two major ways that data can be share across multiple institutions: through a consolidated warehouse where a copy of the data from each institution is stored, or through some form of distributed network where the data remains stored with each organization but can be queried to retrieve standardized results from multiple databases. An additional criticism of the current legal framework surrounding human subjects research is the lack of guidance around the technical architecture of databases, although they may involve creating multiple copies of a patient’s data.358

While centralizing data in a warehouse may increase efficiency when standardizing and querying the EHR data, it requires resources to build and maintain. In addition, there are privacy and governance issues associated with creating a copy of patient information and storing it outside the organization when these data were collected for the organization’s use in caring for the patient.359 Also, as the data are centrally combined from multiple organizations, it becomes further removed from the different organizational contexts where the data were collected that must be considered when interpreting the data, such as changes in how the data was collected and documented over time. In addition, centralized data warehouses may be less flexible as all required data elements must be contributed by the organization in advance and then remain in the warehouse, giving organizations less control over which data they want to contribute for what purposes.360

As an alternative to creating a central warehouse or database, a virtual data warehouse may be created where data remains in separate home locations. This alternative may be more viable as it bypasses the need for investment outside the organization in building a separate infrastructure, and also simplifies the issues of data ownership. Virtual warehouses are easier to implement and more private because data remain at the collaborating organizations (referred to as a distributed network). Secure, remote analysis of these separate databases occurs through a central portal that queries and distributes results. Organizations may decide which data they are interested in contributing and what studies they want to participate in. One common type of distributed network is a federated research network, where separate, heterogeneous databases from multiple organizations make up the distributed network and each organization retains control of its own data. 361, 362 For example, ePROS is creating a federated database that links data from multiple organizations in order to allow for queries of de-identified patient data.363 Often the databases include standardized content areas, data dictionaries, and methods to define individuals. 364 While more efficient than a centralized model, investment is still needed in the administrative and governance infrastructure to maintain security and ensure appropriate use of the query function.365 A number of distributed research networks are being piloted to support clinical effectiveness research (CER).366, 367

Figure II.2. Example: The Cancer Research Network (CRN) Virtual Data Warehouse

Figure 2 is a diagram showing how the Cancer Research Network's Virtual Data Warehouse works. The diagram is split into three sections, "Advance Work," "Virtual Data Warehouse," and "Each Project." The "Advance Work" section describes the work CRN investigators and Site Data Managers do at each specific site to derive standardized data specifications from a common data dictionary.This section has an illustration showing diffrently colored ovals filled with site names, to demonstrate how each site has its own database. The second section, "Virtual Data Warehouse," has the same illustration, but with each oval in the same color, to show how each CRN site has its database set up using a common data dictionary. The final section, "Each Project", describes how CRN investigators develop programs to extract specific variables from the standardized VDW files and then convert them into a project-specifc data dictionary. This section has an illustration of the research team.

Source: Hornbrook et al. Building a Virtual Cancer Research Organization. Journal of the National Cancer Institute Monographs. 2005 (35), 12-25.

However, there are some reasons an organization may select a centralized warehouse instead of a virtual one. For example, the Community Health Applied Research Network (CHARN) chose a centralized data network because to house the data where it originates as in a virtual network, each participating organization needs to have its own infrastructure. However, because CHARN’s participants are community health centers that have limited resources, they lacked the capacity to make a virtual network an option. Cost would also be a significant barrier for each community health center to maintain its data locally. Finally, data quality was a consideration when CHARN selected a centralized database. Because of the variability among community health centers, were they to request data from each center it would be difficult to know what types of problems there may be in terms of outliers, omissions and commissions in the data. Therefore, they decided it would be simpler to look at the data all together. The issues faced by community health centers may be common among other under-resourced organizations that provide care for certain small populations, such as health care organizations in rural areas.368

An additional alternative to a distributed warehouse where data are still contributed for central analysis is to have distributed analytics. This approach is being used by the Massachusetts eHealth Institute, where participating organizations to contribute just the minimum information that is needed. While this approach addresses a lot of privacy related concerns, it does require participating organizations to conduct some of their own analytics before contributing their results.369

No matter which method is chosen for sharing data, each strategy requires significant infrastructure development, both technically and organizationally. One study of research teams that have developed such infrastructure to support CER identified a number of challenges, including the substantial effort required to establish and sustain partnerships for data sharing, understanding the strengths and limitations of their clinical information platforms, and the need for rigorous methods to ensure data quality across multiple sites.370 Another study involving interviews with multi-site research initiatives around data governance found a number of challenges related to data governance, but also found these initiatives are using strategies to address these barriers such as capitalizing on pre-existing relationships, beginning with smaller studies and then expanding, developing legal and policy documents with broad input, exchanging de-identified data only, and structuring governance bodies with broad representation.371 It is important that each organization contributing data is represented in the analysis as well in order to provide context on how the organization has changed, which affect how the data are interpreted. Particularly for those who care for certain small populations, the organizations that care for them are likely unique as well and need to be able to provide that context. The uniqueness of each organization may result in quality issues once their data are combined, even if data from the individual organizations are of high quality on their own.372

Funding for research infrastructure development is rare, as currently most grants and contracts pay for specific, discrete studies. However, in recent years the availability of this funding has increased. For example, the American Recover and Reinvestment Act of 2009 allocated $100 million to building infrastructure to use electronic clinical data for CER, patient-centered outcomes research, and quality improvement.373 In addition, in 2013 the Patient-Centered Outcomes Research Institute is investing $68 million to support the initial development of a National Patient-Centered Clinical Research Network to build the capacity needed support CER. There are currently three funding opportunities related to building this national network.374

In addition, for studies that include data from multiple organizations, approval may have to be obtained from multiple Institutional Review Boards, adding to the time and resources needed to conduct the research. Where organizations are from different states, there may also be different state laws governing health information to which each organization must comply. Some approaches to minimizing this burden have included careful distinctions between quality improvement and research-driven interventions, particularly where projects are low-risk. Negotiation of an arrangement where a central or lead IRB with particular expertise in the area first reviews the study and then other IRBs can accept their review may also be another solution.375 In addition, where research is conducted across distributed databases using methods such as distributed regression, the only information exchanged is statistical results rather than the underlying data. This technical strategy is one solution to protecting patient privacy. However, an issue with small populations is that unique individuals relative to their surrounding population can potentially be identified. In fact, some researchers are finding that people may re-identify themselves, even when given privacy protection.376

Finally, a process is needed to ensure the quality of multisite data for research, including prioritization of variables and dimensions of quality for assessment, development and use of standardized approaches to assessment, iterative cycles of assessment within and between sites, targeted assessment of data known to be vulnerable to quality problems, and detailed documentation of quality to inform data users—particularly in determining whether the data are fit for use in CER studies.377 Ideally, these efforts should be shared among the collaborating organizations on a continuous basis to keep pace with new versions of existing software and the introduction of new software to manage health care processes.

Interoperability of EHR systems

Research among multiple institutions is facilitated by interoperability of their EHR systems. In its absence, a large amount of effort is needed to integrate data. One of the reasons that building the infrastructure to share data is so challenging from a technical standpoint is the lack of interoperability among different EHR systems. Just among providers who have been able to demonstrate they are meaningfully using their EHRs based on the criteria specified under the Medicare EHR incentive payment program, 333 different EHR vendors have been used, although consolidation is occurring in the EHR industry with the top 5 vendors increasing being used by a larger share of providers.378 While the industry continues to consolidate, the wide variety of systems currently in use has led to two major challenges: 1) Syntactic interoperability, or the ability for systems to communicate with one another to exchange data; and 2) Semantic interoperability, or the ability for systems to understand the data exchanged. The ability to exchange data is more easily solved. However, differences in vocabulary and classifications are a more difficult problem, particularly when trying to identify members of small populations across multiple institutions.379 Even within a single organization’s EHR, standardizing the data is a challenge. This challenge is amplified across multiple organizations. Even for seemingly well-defined concepts there is variation. For example, what one system may call “high blood pressure” another system may call “elevated blood pressure.”380 Or, systems may use different race/ethnicity categories.

There are a number of efforts to create standards for EHR data, including the Health Level Seven International’s (HL7) Continuity of Care Document. HL7 is the global authority on standards for interoperability of health information technology. In partnership with ASTM International—another developer of voluntary consensus standards, the Continuity of Care Document was developed to foster interoperability by promoting standardization across systems through the use of templates representing typical sections of a patient’s EHR.381 While progress is being made in moving toward interoperability standards, the current set of standards are not at a level that solves many of the problems of researchers we talked to. Many of those we interviewed have been working with their vendors and other health care organizations as well to develop strategies for sharing data despite the lack of a single standard, universal approach to interoperability.

In addition, five major health systems, including Intermountain Healthcare, Geisinger Health System, Group Health Cooperative, Kaiser Permanente and Mayo Clinic have created the Care Connectivity Consortium as a pioneer effort and have achieved interoperability across multiple vendors to enable the sharing of patient information.382 While primarily motivated by wanting to provide a model by which EHR data can be shared across institutions to improve patient care, the ability of health systems to overcome interoperability challenges will also have significant benefits for research.

Those we interviewed felt that major vendors and federal incentives can both play important roles in promoting standardized data fields and formats across different EHR systems. For example, if Epic includes sexual orientation and gender identity in its system, that could lead to it becoming an industry standard. However, some smaller vendors may not invest in including these fields in their products unless it is added to Meaningful Use criteria.383 Meaningful Use requirements as well as quality reporting requirements for accreditation and recognition programs do all have the potential to help lead to greater standardization and interoperability across systems.384 While Meaningful Use presents only minimum requirements for standardization, physicians have the added incentive to do more because it enhances the value of their practices to potential purchasers.385

Research agencies also have the opportunity to promote standardization through what they fund. Although Meaningful Use itself may only do so much, in combination with other levers and incentives, the availability of standardized EHR data for research will likely continue to increase.386 In addition to interoperability across EHRs, there is the need to integrate supply chain, financial, and clinical data to provide a fuller picture. For an organization like the Health and Hospitals Corporation, which includes hundreds of systems, many decisions and definitions used by each individual component of the system do not align once information is brought together. For example, in terms of defining a visit or encounter, a clinician may only consider a patient to be discharged if they are alive, but from a financial standpoint, a discharge is some who is alive or dead. Or, the name of the same doctor may be entered differently in different systems (for example, whether the last name is listed first or second, whether the title Dr. is included, etc.). Going back and standardizing the data across systems is a lot of additional work. In the long run, it will be important to align these different types of systems as well.387

Practice-based research networks

Practice-based research networks (PBRNs) have facilitated much of the research using EHR data from multiple institutions. PBRNs are groups of primary care clinicians and practices that work together to answer community-based health care questions as well as to translate research findings into practice. AHRQ has devoted funding to support PBRNs through targeted grant programs as well as by supporting a resource center, learning groups and conferences. The DARTNet Institute is a growing collaboration of PBRNs (currently including nine of them) that is building a national collection of data from electronic health records, claims, and patient-reported outcomes for the use of quality improvement and research.

Research networks can make a wealth of clinical information available for research through their EHRs. The organizations within a network are often already either sharing a common EHR system or have worked to develop some form of centralized or distributed data warehouse for research purposes. In addition to PBRNs, there are other research networks that expand beyond primary care practices. The Cancer Research Network, a collaboration of integrated delivery settings funded by the National Cancer Institute of the National Institutes of Health, is another example of a network created to facilitate research. Still another example is the Community Health Applied Research Network (CHARN), a network of community health centers and universities established to conduct patient-centered outcome research among underserved populations. Members of CHARN include Kaiser Permanente Center for Health Research (which serves as the coordinating center), the Association of Asian Pacific Community Health Organizations (AAPCHO), Fenway Health in Boston, OCHIN in Oregon, and the Alliance of Chicago Community Health Services.

Research on small populations is increasingly feasible as networks of EHRs with common structures and formats have developed. There is also the potential to link data across systems to identify a cohort of interest.388 For example, within the Cancer Research Network, any of the individual health plans will likely include the numbers of patients needed for research on any of the five to seven most common cancers. However, for pediatric cancers or rarer cancers, data must be pooled from multiple medium sized sites or perhaps the two KP California regions to obtain sufficient number of cases for research. Most rare cancers require use of data from California, where KP has 4 million members in its EHR system.389

One challenge for PBRNs is that securing permission from individual practices and their vendors to access their server can take some time to make sure everyone is comfortable with the arrangement.390 Even after practices agree to participate, data use agreements must be established that are specific enough to provide protection, but flexible enough to accommodate research. Often additional, unanticipated data elements are required for research, requiring the revision of data use agreements, as well as working with IRBs at multiple institutions.391

EHR vendors have not yet played a big role in networks, which have mostly been built either by health systems or grant funded. However, it appears vendors are currently trying to better understand this space since there is a potential business model. While the involvement of vendors may provide additional resources and help move forward network technology, there is the danger that as the data becomes perceived as more valuable, it may make data sharing more difficult. This may also pose a threat to the current public/private partnership where the data collection occurs in the private sector without public and private sector researchers paying them to do so.392

Regional health information exchanges

While initially envisioned as another major source of patient data, it is unclear what role regional health information exchanges will play in the future of EHR-based research. One of the original purposes of the Office of the National Coordinator of Health IT was to facilitate the development of regional health information organizations (RHIOs) that would facilitate health information exchange among stakeholders in their region’s health care system. These RHIOs were intended to provide the infrastructure for a national health information exchange. However, their development has faced a number of barriers, including many of challenges mentioned in this report in EHR-based research, particularly lack of resources for infrastructure.393 Further removal from the day-to-day patient care would make data quality and interpretation an additional challenge when using data from these regional exchanges for research. There have been examples, however, where regional health information exchanges have provided data for regional quality improvement efforts.394

Linking EHR and other electronic health data with other data sources

A number of other data sources may be linked with EHR data to provide additional information for research, as well as to validate information in the EHR available to identify and study small populations. Data linkage requires that at least one common identifier be available in both sources that can be used to link records. Unique identifiers that are commonly used to link data at the patient level include social security numbers, health insurance claim numbers, and medical record numbers. Hospital or area level identifiers may also be used for linkage to organizational or geographic level data. Commonly linked administrative databases include disease registries, claims files, survey data, provider files, and area-level data.395 Additional clinical information—such as genetic, care management, and social network information—also has the potential for linkage with EHR data for research. Several examples of additional data sources for EHR-based research are described below.

Patient Registries

An electronic data source that may be useful for research in combination with EHRs are patient registries, where uniform data are collected from multiple institutions in a central database for a population defined by a particular disease, condition, or exposure. This data may be directly pulled from EHRs or require manual entry based on information from the patient’s record. Registries are a simpler form of consolidated data. They include only a core set of relevant data elements for a specific purpose. Registries may be local, such as immunization registries or vital statistics departments that collect birth and death data. Death records may be particularly important because death is often difficult to determine from an EHR. There are also national registries, such as the CDC’s National Program of Cancer Registries, and the National Cancer Institute collects information on diagnosed cancer cases and cancer deaths simply to measure incidence and mortality.396 The Institute’s tumor registry adheres to national and accreditation standards and has specialized staff that pour through records in local registries looking for evidence of cancer, including blood cancers. Although labor intensive, it is currently more accurate to use a manual process to determine which records should be included in the registry. In contrast, an automated process to query the registry for records of interest may be used if the records included are already well validated. Local registries are often able to accept EHR data and accept edits from providers. One complication is that at times, data can be corrected in the registry but not in the EHR source data. Registries may collect some patient demographic data in order to determine whether certain populations bear a disproportionate burden of the disease.

Information from registries has been linked to EHR data in order to identify patients with specific conditions. For example, in one study a tumor registry was linked to the Cancer Research Network’s distributed data warehouse to identify cancer cases. Race and ethnicity in this study were extracted from cancer registries as well. This study was able to look across eight years of data to examine whether someone’s health care utilization increases directly prior to diagnosis of a new primary cancer.397 The ability to look back to before patients were diagnosed with a certain condition is another unique benefit of research using EHR data and has the potential to improve our ability to identify patients who are at greatest risk of disease to improve targeting for preventive interventions.

Registries can also be linked to EHR data for data validation, such as was done in one study that linked clinical databases with a cancer registry to confirm cases of cancer. In this particular study, they found that 98.9 percent of cases overlapped. The use of multiple data sources presents opportunities to improve data quality for research. For example, addition of death data from a cancer registry to the clinical database allowed for more accurate stage-specific and overall survival figures.398

While registries and EHRs can combine to provide a fuller picture, like EHRs, patient registry data may be incomplete as well. It remains a challenge both to motivate clinicians to participate in registries and to facilitate easy transfer of information from patient records into the registry.399 Some studies have suggested there may be systematic bias when using only records that can be matched between multiple data sources, such as EHRs and registries. A review of the literature around this topic found a number of patient or population factors such as age, sex, race, geography, socio-economic status and health status that may be associated with incomplete data linkage. This association may result in a systematic bias among clinical outcomes reported from such studies.400

An additional limitation of some registries such as the National Cancer Institute’s Surveillance Epidemiology and End Results (SEER) registries is that they do not identify the recurrence of cancer. Researchers at Kaiser Permanente are trying to address this gap by looking for utilization clusters in claims as well as digital images to identify recurrence. The potential to use pattern recognition to analyze digital images may increase the accuracy of automated approaches to identify cancer incidence for registries and other purposes, potentially finding more than the human eye could have recognized.

In addition to registries, other systems that exist for surveillance purposes may provide useful electronic information. For example, the FDA’s Mini-Sentinel Network is a large multi-system collaboration to track exposure to specific drug products and to conduct case-control studies to identify unexpected adverse events. Participating sites agreed to make their patient medical records available to verify any statistically-identified associations. Because this effort is classified as public health surveillance, no IRB compliance is required.

Genetic Data

As the field of genomics has rapidly evolved in recent years, the routine generation of genetic data for individual patients has received much attention from the general public. The clinical utility is now limited by current inability to effectively process, store, update and interpret genetic data while protecting patient privacy.401 However, efforts have begun to integrate genetic data into EHRs,402, 403 opening many additional possibilities for research. For example, the mining of EHRs with genetic data may reveal previously unknown disease correlations based on patient genetic make-up.404

The National Health and Nutrition Examination Survey (NHANES) has collected DNA specimens from participants from 1999 to 2002, which may be used for secondary analysis and can be linked with the survey data. For permission to use the data, researchers may submit proposals to the Centers for Disease Control’s Research Data Center (RDC) for approval, and analysis must occur at a RDC location.405 In a study funded by the NIH, Kaiser Permanente in California has been able to link genetic information with its EHRs. By collecting saliva from 100,000 members, Kaiser has examined the associations between genetics and smoking and drinking habits as well as body mass index.406 While these saliva samples were expressly collected for research purposes, there have been other instances where blood or other bio specimens collected for medical purposes were reused for research.407 Instances such as these bring to light the need for clearer consensus and guidelines about the appropriate secondary use of information collected for clinical purposes. One example that may serve as a potential model is the open-consent framework used for the Personal Genome Project, where consent implies research participants accept that their data could be included in a public, open-access database with no guarantee of anonymity and confidentiality.408

Other Data Sources

A number of other data sources provide opportunities for linkages with EHRs. For example, claims data in the Healthcare Cost and Utilization Project (HCUP) databases now feature new linkage capabilities, including ability for linkage to clinical data from labs, trauma registries, EMS data and nurse staffing data.409 AHRQ has sponsored a number of clinical data pilots to demonstrate the feasibility of linking hospital lab data with HCUP data.410 Claims data may be an important supplemental source when studying insured populations because it can provide information on care provided across health systems. It may also currently be more useful to identify utilization such as visits or procedures better than EHRs. Although many health care organizations are now using EHRs to bill, EHRs likely only include their own claims, requiring claims for care received elsewhere to be obtained from another source such as the payer.411 The increase of digital data in all health care settings presents numerous opportunities for research.

In addition, the emergence of care management software programs that track weight, exercise, and medication adherence provide additional information that some providers are entering into EHRs. These programs may download data from pedometers to measure aerobic activity,412 and have been used for employee incentive programs run by employers or insurance companies. There remains much potential to develop interfaces whereby these types of programs can directly link to EHR systems. There has also been interest in incorporating personal health data from social networking websites and applications on mobile devices into health records for medical care as well as research and public health surveillance. For example, entries on Twitter about disease outbreaks have been correlated with official public surveillance data (although both reflect public concern rather than actual documentation of disease). Or, tracking consumers’ online behavior could be linked with bioinformatics. However, use of this data for such purposes presents complications in terms of privacy and consent as online, the lines between public and private are increasingly blurred.413

Linking to state and county data sources has allowed some of the organizations we interviewed to better understand their patient population.414 KP often links its data to the California Department of Developmental Services’ database for its ASD patients. However, they are unable to link to the patient’s educational records due to state laws.415 The ability to link EHR data to public school records would be ideal for research on autism spectrum disorders because individuals are often identified in both places and in theory should be managed jointly between the pediatrician and the school.416 Linking to outside data sets also allows research on the population level, for which Essentia has linked its EHR to publicly available state and county data.417 State employee health plans such as the California Public Employees’ Retirement System (CalPERS), which covers active and retired state and local government employees and their family members, may also be a potential data source of demographic and administrative information, diagnosis as well as information on spending.418

There have been a number of recent federal efforts to increase the availability of social, demographic, and behavioral data using a variety of data sources. AHRQ has recently awarded grants from the American Recovery and Reinvestment Act to enhance race/ethnicity information in statewide hospital encounter databases, another source of patient information. State grantees are taking a number of approaches to enhancing data, from standardizing, educating and auditing hospitals as they report R/E/L data to revising administrative codes to include a mandate.419 Also, CMS has recently commissioned a study to examine the barriers to collecting social and behavioral data from EHRs for Stage 3 of the meaningful use program, and how to overcome these obstacles. This study will identify the core social and behavioral domains that should be included in an EHR, possibilities for linking EHRs to public health departments, social service agencies, and other non-health care organizations, as well as case studies where such links have been established and how privacy issues were addressed.420

In addition, as EHR adoption increases, EHR data plays an increasingly important role in national health surveys such as the National Ambulatory Medical Care Survey (NAMCS), which collects information on practice characteristics and patient visits by abstracting data from a sample of patient medical records from each participating practice. While previously limited to national and regional estimates, the Affordable Care Act has funded a sample increase that will allow for state-based estimates of clinical preventive services.421 This survey also collects information on EHR adoption, as previously described.

View full report


"rpt_ehealthdata.pdf" (pdf, 1.99Mb)

Note: Documents in PDF format require the Adobe Acrobat Reader®. If you experience problems with PDF documents, please download the latest version of the Reader®