Minutes of the Technical Assistance Workshop, May 3-5, 2000. Topic 2: Data Mining


Robert Goerge of Chapin Hall led the session. Goerge asked the audience about their areas of interest in order to tailor his examples. Four of the audience members had health backgrounds. The others represented education, child welfare, and human services in general. A data warehouse was defined as a repository for storing integrated information for efficient querying and analysis. A data mine was defined as a more specific data repository designed for the analysis of particular questions. Bob indicated that the many states were making efforts to put data in one place through a data warehouse and indicated that the focus of the session would be on warehousing individual-level data.

A member of the audience asked for clarification on the definition of a data mine. She thought that a data mine was more targeted than a data warehouse. Bob indicated that it was not only the specificity of the subject of the data mine that distinguished it from a data warehouse, but also the way in which the data are connected. Goerge diagramed a data warehouse. In the example, he depicted a warehouse that contained a master index of all individuals and tables on Medicaid recipients, TANF recipients, foster children, children in special education, individuals receiving childcare subsidies, and immunization data. Goerge pointed out that a common identifier linked each table. He suggested that each table or each topic area could actually be thought of as an individual data warehouse in addition to the entire set functioning as a data warehouse and, by way of example, showed a diagram of a database that Chapin Hall created for the Illinois Department of Children and Family Services (DCFS).

The DCFS Database combines two areas in the child welfare system: abuse and neglect reporting and child welfare services. Goerge indicated that this data warehouse links the two areas by identifying common individuals using probabilistic record linkage. He emphasized that only when one is certain that individual data are linked correctly could one analyze the data. He added that conducting analysis from a full data warehouse is cumbersome and that what is done is to create smaller datasets that contain summary information. These smaller datasets are data mines.

One of the audience members raised the question of confidentiality concerns across agencies. They were also interested in whether or not confidentiality was a concern within agencies. Goerge pointed out that one of the best ways to relieve confidentiality concerns was to emphasize the fact that the data will only be presented in an aggregate format, say in groups of 5 or more individuals, and will not be used for case management. He said that he was not aware of any confidentiality breeches and that because of concerns for confidentiality, some information, such as birthdates, might be excluded from datasets. He also described security procedures to restrict access to data.

Goerge offered examples of agencies that have little trouble acquiring outside data. Child support was the best example of a agency that has been granted access to almost all data by the legislature. Children in foster care are Medicaid eligible so child welfare agencies are granted access to their Medicaid claims data. Child welfare and TANF agencies regularly exchange data in order to determine IV-A funding eligibility.

Updating. To be useful, warehoused data must be unduplicated and up to date. Illinois updates its data warehouse monthly. The decision of how often to update is critical to the effort it takes to maintain the data warehouse.

Geographic identifiers. An audience member asked if the address is used to link records. Goerge said not usually used unless one knows that the address is from the current time period. He added, however, that having some piece of geographic data--such as a county code--improves the quality of the match. To do more small area data analysis, data are geocoded.

Standardization. Standardization is required across databases and how to standardize service events is a major issue. The standardization process involves reconciling the master index and the events across the data warehouse and resolving all inconsistencies before analysis.

Efficiency. Efficient information storage requires information to be stored once in the data warehouse. A second key to efficiency is linking events. For example, within a master index, we want to know then relationship between individuals. A link table that presents the relationship between pairs of individuals would significantly boost efficiency. Linking individuals or events and storing that data in link files is a good way to make queries more efficient.


A data warehouse is a good way to deliver information to the users and to store information safely. It provides end users with a single data model and a query language. Goerge said that Florida and Ohio had two of the largest data warehousing efforts that he knew of. He pointed out that, in response to welfare reform, many states were interested in linking their TANF and child welfare system data in order to measure the effect of welfare reform on the child welfare system. Many data warehouses are designed specifically for this research. Regarding standardization of database formats, the audience and Goerge pointed out that there is no national center for social program statistics unlike the areas of health and education. It was thought that a national center would benefit the standardization of data definitions and the development of data warehouses.