|
|
Types of disclosure risk in microdata Managing the risks of identification Assessing potential identification risks Protecting microdata Factors that increase the risk of identification Microdata are unit record data where each record represents observations for a person or an organisation. Microdata contain individual responses to questions on survey questionnaires, or administrative forms, including identifying information such as name, address, telephone number and age. Microdata are a valuable resource for researchers and policy makers. The challenge for data custodians is striking the right balance between fulfilling obligations to protect the identity of individuals and organisations, and maximising the information available for statistical and research purposes. This requires careful weighing of the identification risks and benefits. Types of disclosure risk in microdata
Spontaneous recognition, an identification made without any deliberate attempt, can occur if individuals with rare characteristics are present in the data. The identification risk this poses depends on how remarkable the characteristic(s) are. For example, the dataset may include people with unusual jobs (e.g. pop star or judge) or very large incomes which are highly visible in the data and could lead to their identification. Deliberate attempts to identify a person or an organisation in a dataset may include, for example, list matching (matching unique records to external files using a combination of characteristics common to both datasets) or a ‘record attack’ (where a user tries to find a particular person or organisation with a set of characteristics known to the user). Managing the risks of identification The risks associated with providing access to microdata can be mitigated in a number of ways including: - treating the data directly (confidentialising); - deterring any motivation to attempt an identification (e.g. through legally enforceable undertakings not to attempt identification and penalties for breaching the undertakings); - restricting access; - educating data users about the importance of protecting privacy, and managing the risks of identification, as well as their obligations in relation to these (e.g. by providing training manuals and detailed instructions); and - ensuring data is accessed safely through an appropriate environment.
Assessing potential identification risks Assessing microdata for identification risks is a subjective process which requires a detailed examination of the data. Methods to assess identification risk in microdata include: - cross-tabulation of variables (for example looking at age by income or marital status) to determine unique combinations that may enable a person or an organisation to be identified; - comparing sample data with population data to determine whether the unique characteristics in the sample are unique in the population; and - acquiring knowledge of other datasets and publicly available information that could be used for list matching. The risk of identification can also be assessed by considering factors that contribute to the likelihood of identification (see below). Various software packages are available to help assess potential identification risks. These include: - Mu-ARGUS - a software package developed by Statistics Netherlands. The software is designed to protect against spontaneous recognition only and does not attempt to protect against list matching. - SUDA - software developed by the University of Manchester. SUDA stands for 'Special Unique Detection Algorithm'. It examines unit record data files and looks for records that are at risk of identification because they have unique combinations of characteristics. Protecting microdata
Two common approaches to protect microdata are confidentialising and/or restricting access to the file. CONFIDENTIALISING MICRODATA Data perturbation and data reduction methods are used to confidentialise microdata. These are the same basic principles used to protect aggregate data. Popular techniques to confidentialise microdata include: - limiting the number of variables included in the dataset; - introducing small amounts of random error (e.g. rounding or data swapping); - combining categories that are likely to enable identification (e.g. giving age in five year ranges); - top/bottom coding extreme values of continuous variables like income or age; - suppressing particular values or records that cannot otherwise be protected from the risk of identification; and - data swapping - this involves swapping a value in an identifiable record with a value in another record with similar characteristics to hide the uniqueness of the record. For example, a record with a unique language spoken in the region could be swapped with a similar record (based on age, sex, income etc.) in another region where the language is more commonly spoken. RESTRICTING ACCESS TO THE MICRODATA FILE Providing controlled access to microdata is important in protecting the data from identification (or disclosure) risk. Access to detailed microdata should only be provided under the strictest conditions to approved researchers for an approved purpose. Generally researchers must sign undertakings to abide by specified conditions for access and use of the data. The extent to which the files are confidentialised will determine how the files are accessed. The more detailed the information, the more protection is required when providing access to microdata. One way of releasing microdata is in the form of Confidentialised Unit Record Files (CURFs). These are files that have been confidentialised to ensure that the direct and indirect identification of individuals or organisations is highly unlikely. A highly confidentialised microdata file, such as a CURF, may be released publicly on CD-ROM. However, if more detail is left in the CURF, more secure ways of accessing the data need to be used. Researchers who need lots of detail may have to access the data through a very secure environment. Secure on-site data laboratories are one way of achieving this. The microdata are de-identified and should have some level of confidentialisation to avoid spontaneous recognition, but may still contain data that would allow indirect identification. For this reason, access to this data is available only at data custodian access sites so that all output generated is confidentialised before it leaves the premises. Another option is providing access to microdata through a remote access facility. Remote access facilities are used by statistical agencies and research organisations around the world and mean data users can access microdata from their desktop. Approved researchers can submit data queries through a secure internet-based interface. Requests are generally run against the microdata which is securely stored within the data custodian’s computing environment. The results of the queries are confidentialised. For definitions of terms used in this information sheet see the glossary for the Confidentiality Information Series. For more information about managing the confidentiality of microdata, or to provide feedback on this series, please email: inquiries@nss.gov.au Factors that increase the risk of identification
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||