Jim Scanlon, ASPE
Thank you for participating today and giving us the benefit of your expertise and experience. Let me set the stage for today’s discussion. We have two objectives for today’s meeting. First, we will discuss and assess the extent to which additional risks of re-identification in federal statistical data access initiatives may arise because of the increasing availability of data and datasets from other sources, the so called “mosaic effect.” The concept of the mosaic effect derives from issues in the security community when different types and sources of data are brought to bear on a problem to yield new insights when the individual pieces are combined. Does the risk of re-identification increase as more datasets become available?
Second, we will discuss the current and emerging body of statistical disclosure avoidance policies and techniques used by federal statistical programs and activities today, as well as promising areas of new research. Our focus is whether the current set of statistical disclosure avoidance techniques and data release mechanisms are adequate to protect confidentiality in the light of more data becoming available from all sources, or are new techniques and access mechanisms needed. Again, our interest today is on federal statistical data access and release activities, and primarily on microdata releases.
Federal statistical agencies have a long tradition of making data available through a broad continuum of access policies and mechanisms. Agencies constantly balance access to the data with protecting the confidentiality of the individuals who provide the information, weighing increasing demands for data with confidentiality protection, advancing technology and other publicly available data. Several initiatives have been launched in recent years to make federal data more available and accessible—the Federal Open Data Initiative, President Obama’s Executive Order 13642, and other health data initiatives.
As many of you are aware, a large body of effective and well developed statistical disclosure avoidance policies, techniques and practices has been developed, and virtually all federal agencies use them in their data release policies. We will discuss those techniques here today, and they are described in OMB Statistical Policy Working Paper 22. Similarly, there is a continuum of mechanisms through which federal agencies release data, including public use data files, de-identified data, data use agreements and restricted access mechanism such as research data centers (RDCs).
So the goal of today’s meeting is a self-assessment regarding 1) potential new threats to data privacy protection in federal agency statistical practice and 2) our current capacity to address them. Our preliminary sense is that the current portfolio of disclosure avoidance techniques are close to the state of the art and effective in protecting against disclosure, but we are eager to learn of any new issues, risks or new approaches, or new areas of research that we might want to pursue.
This morning we will hear about the current state of the art in federal statistical disclosure avoidance techniques as well as some new developments. This afternoon we’ll hear about specific statistical agency practices. And then we will discuss potential new threats to data protection.
Question from Brad Malin (Vanderbilt): Can you please define “open data?”
Scanlon: We view the concept of open data as a policy of making federal data available in whatever purpose/form would be helpful, with appropriate confidentiality protections. For example we have a long tradition of making survey data, research data and administrative data available to a variety of data uses. The data is either de-identified or made available through a data use agreement or a restricted access data center.