Minimizing Disclosure Risk in HHS Open Data Initiatives. 3. Secure Remote Access

09/29/2014

A number of federal agencies allow users remote access to agency data that are not released on public use files. This can take a number of different forms. For example, the Census Bureau allows users to request tabulations from decennial census files that include more detail than the numerous tabulations that can be obtained from the bureau website (FCSM 2005). The requests are reviewed to ensure that the tabulations do not present a disclosure risk. The National Center for Health Statistics allows approved RDC users to submit programs remotely, although the software that can be used for this purpose is more limited than what is available in the RDC, and certain functions are not accessible. The advantage to the user lies in not having to travel to the RDC, which can be important when the research can be conducted most efficiently with numerous, intermittent program submissions, each requiring extensive review of the results before the next program can be prepared. Some RDCs charge a daily fee for in-person visits, which can make a series of brief visits very costly.7

Lane and Schur (2010) discuss the benefits of establishing secure, remote-access entities or “data enclaves” that enable researchers to access confidential data from their desks. Kinney et al. (2009) discuss technical developments with respect to remote, “query-based access,” where sophisticated software restricts what the remote user can see or obtain from the data. They view this as an emerging research area.

A major challenge for preserving confidentiality through remote access is preventing users from submitting a sequence of requests that while individually innocuous are able, collectively, to elicit more detailed information from the source data than would be permitted through a single request. For example, Oganian et al. (2009) show that it may be possible to defeat some of the confidentiality protection strategies discussed in the next section through suitable designed queries. Because there may be legitimate reasons why a user would submit an extensive sequence of requests, a technical solution would be preferable to simply restricting the number of requests over a period of time and monitoring similar requests from different users.

A variation on this approach that appears to be growing in popularity might be described as a hybrid of restricted access and restricted data. Using methods described in the next section (but generally relying heavily on the synthetic method), an agency creates a public use file with limited or untested analytical value but with the same structure and many of the same variables as the source file. Analysts can use this public file to write sophisticated programs that address all the feature of the data and then submit these programs to be run on the source data. The output will be subject to the same review as other output from remote access, but the process is far more efficient. This approach can also be used to determine whether particular inferences drawn from the public data are valid—that is, supported by the source data.

Illustrating this approach, Borton et al. (2013) used the synthetic method to develop a public use file of Medicare claims data with “pseudo-analytic” utility. The file retained the structure of the original database, but in generating variables by the synthetic method the team deliberately excluded modeling of certain key relationships. Unlike other synthetic files discussed below, this file was designed explicitly as a test file that would not support valid inference. Rather, the file was designed to enable entrepreneurs to develop applications and researchers to become familiar with the source data. Code can be developed and debugged with the synthetic file with the intention that they would later be run on the source data, accessed on a restricted basis.


7 Remote access may not be free, however. The new Virtual Research Data Center at CMS charges $40,000 for an annual “seat,” which is defined as an individual user working on one project. The fee includes training, output review, and 500 GB of disk space. Additional users can be added to an existing project at a cost of $15,000 per user.

View full report

Preview
Download

"rpt_Disclosure.pdf" (pdf, 1.01Mb)

Note: Documents in PDF format require the Adobe Acrobat Reader®. If you experience problems with PDF documents, please download the latest version of the Reader®