Data processing refers to the process of turning collected data into statistics and outputs. The processing phase includes activities such as data capture, coding, editing, weighting, estimation, validation and monitoring.
5.1 DATA CAPTURE
Data capture is the process of transferring data provided by respondents to a computer system.
Historically, data capture was carried out by physically keying data from survey or questionnaire into a computer. However, most data capture now occurs electronically. Some of the common electronic data capture methods include optical character recognition (OCR, optical mark recognition (OMR), Computer Assisted Telephone Interviewing (CATI) and Computer Assisted Personal Interviewing (CAPI). More recently, with the widespread popularity of Internet, electronic forms (e-forms) are increasingly being used for the data collection and capture. For example, in Australia e-forms were successfully used by 10% of Australian households in the 2006 Population Census.
The OCR uses intelligent character recognition technology to scan and capture handwritten or typed responses from survey forms directly into data files. OMR uses scanning technology along with strictly designed forms to interpret selections from predesigned response possibilities and translate these to a data file. In CATI and CAPI the interviewer directly enters responses into a portable data capture device or computer rather than onto a form. The main advantage of CATI, CAPI and e-forms is that responses are directly recorded in a data file thus eliminating a step in the process and the potential errors involved in that step. CATI and CAPI also allow interviewers to query respondents further to correct any errors at the data collection stage.
Coding is the process of converting questionnaire information to numbers or symbols to facilitate subsequent data processing operations. Sometimes this involves interpreting responses and classifying them into predetermined classes. For example, males may be assigned the value of one and females two in coding responses for the question of respondents sex. Where a respondent answers a question in their own words (for example, 'what is your occupation?'), coding involves interpreting responses and classifying them into predetermined categories. Using codes to mark responses can greatly assist interpretation and processing.
There are various forms of coding including manual, computer assisted coding and automatic. Manual coding involves entering a code from an index (i.e. male equals 1, female equals 2). Computer assisted coding involves entering a truncated form of a response into a computer and then selecting from a restricted range of entries displayed on the screen with the code written automatically to the data file. With automatic coding, the computer system codes the information which has been previously captured and assigns a code on the file.
The use of standards and classifications will ensure consistency and assist comparison of data across time or across different collections. A number of standard classifications published by the ABS contain lists of standard codes and collection methods for a range of topics such as occupation, industry, qualification, language etc. See Chapter 10 – Statistical Infrastructure for further information on frameworks used in statistical standards and classifications.
Editing is a process in which data is checked, altered or corrected to ensure the data is as far as possible free from errors. This can be done manually and/or automatically. By editing, a number of non-sampling errors can be eliminated or reduced. The most common non-sampling errors are data entry, data processing and interviewer errors. Various types of edits can be used including, balance edits, consistency edits, logical edits, range edits etc.
An edit is a logical condition or a restriction to the value of a data item or response which must be met if the data is to be considered correct. An edit involves the application of a set of test conditions which the data has to meet. Corrective action can be, but isn’t always, taken if the data fails the test. A balance edit, for example, checks if the reported total equals the sum of its reported parts. This can be reported as an edit fail or the reported total can be automatically amended.
5.3.1 When does Editing Occur?
Editing can occur at various stages during data processing. Some of the stages of editing are:
· Data collection editing
· Clerical editing
Clerical editing includes all editing undertaken manually before the data is loaded into a computer file. Clerical editing in large collections is normally restricted to a visual scan of forms to ensure that important items of data are reported and that related items are completed.
· Input editing
Input editing involves checking unit level respondent data for completeness and internal consistency before aggregation of individual responses. Input editing can also be part of data entry process.
In designing an input edit consideration should be given to tolerance levels, clerical scrutiny levels, resource costs, respondent load and timing implications. For example, tolerance should be set at such levels to avoid the generation of large numbers of edit failures.
· Output editing
Output editing includes all edits applied to the data once it has been weighted and aggregated in preparation for publication. Output editing often focuses on identifying the units with the largest effect on collection outputs and ensuring that data for these units is correct and the consequent effect on outputs appropriate. For example, if a unit contributes a large amount to a subtotal or total, then the response for that unit should be confirmed.
Editing at the above stages should however be carefully determined based on the required level of data quality, budget, time and availability of resources.
5.3.2 Types of Edits
There are four main types of edits: validation edits, missing data edits, logical edits and consistency edits.
· Validation Edits
Edit checks which are made between fields in a particular record. This includes the checking of every field of every record to ascertain whether it contains a valid entry and the checking that entries are consistent with each other.
· Missing Data Edits
Check that data that should have been reported were in fact reported or that questions which should not have been answered were not answered. An answer to one question may determine which other questions are to be answered and the editing system needs to ensure that the right sequence of questions has been answered. Computer Aided interviewing programs will reduce the need for these edits as they can have the correct sequencing programmed.
· Logical Edits
Edits used to ensure that two or more data items do not contradict. For example, a 16 year old respondent receiving the age pension would fail this type of edit.
· Consistency (or Reconciliation) Edits
The edit checks for determinant relationships so that arithmetic relationships between variables are obeyed. For example, the edit will fail if harvested acres reported are more than planted acres. Determining these edits sometimes requires knowledge of the subject matter.
Further information on editing is available in the Basic Survey Design Manual (Chapter 10) in www.nss.gov.au.
5.4 DERIVATION AND VALIDATION
Derivation refers to creation of a new data item from existing data items. For example, the total household income can be derived by adding together the income of all persons in the household.
Validation is the comparison of survey or administrative data against other sources of information. It is a method of checking that the survey measures what it is supposed to measure, that is, it is free of systematic error. A validation study compares data collected using a survey instrument with data considered to represent the "true value" of the data. It is an important aspect of ensuring data quality. Validation is also often referred as data confrontation.
Generally, validation is not a prescriptive process. Some examples of validation methods are to compared the collected data with
· complementary data sources such as other statistical and administrative collections on the same subject matter
· previous results of the same collection
· comments and forecasts by subject matter experts in media
· management information produced by processing systems
5.5.1 Monitoring Performance
The data processing should be continuously monitored by the collection manager or other responsible staff to ensure that:
· survey forms and data are processed in line with plans;
· resources are being used effectively and efficiently
· deadlines and benchmarks are met for different processing stages
· strategies are developed and implemented to deal with unexpected outcomes
· appropriate and timely feedbacks provided to staff on any changes to be effected
5.5.2 Monitoring Quality
System wide monitoring of processing systems and procedures should form an integral part of the data processing to monitor the impact of various processing steps on data quality and to determine whether existing edits are sufficient. The monitoring should look at issues such as the number of incomplete records, the number of units that failed edits and the effect of edits on final estimates. A high edit failure rate, for example, could indicate poor form design or that tolerance levels in the editing are set too stringently.
Some procedures that can be used to monitor the quality of the processing system are:
· Random auditing where a sample of responses is checked and verified. For example, many collections contact a sample of respondents to confirm the validity of the original responses.
· If data capture and data processing are undertaken electronically then some forms can be also processed manually to verify if systems such as OCR and edits are performing correctly. For example, some respondents may mark a response box with a tick mark instead of the advised X mark. Manual processing will verify if the OMR has correctly interpreted such instances.
· Management Information reports produced automatically at different stages of data processing. For example, reports could be produced on the number records failing different edits.
· A register of problems and their solutions can be maintained. This can be used as a guide to resolve recurring or similar problems, and is also a valuable tool for identifying and improving processes.
5.6 WEIGHTING AND ESTIMATION
Sample surveys only take information from a subset of the population. The process by which information from a sample is extrapolated to represent the entire population is called estimation.
To create estimates which represent the entire population, weights are applied to respondent records. Weights are a multiplier which is applied to the information from the sample (e.g. in a 5% sample, the weight is 20). Methods for calculating the weights will vary depending on the methodology used in sample selection.
Measures of the accuracy of estimates are a measure of data quality. There are a range of measures that can be calculated such as the variance, standard errors or relative standard errors. For surveys which collect a large amount of data, it may not be feasible to create measures of accuracy for every data item. In this situation, it may be feasible to create measures for a smaller representative set of key data items.
More information on weighting and estimation is available in the Basic Survey Design Manual (Chapter 11) in www.nss.gov.au. For complex surveys, weighting can be extremely complex; therefore, it is useful to consult a statistician.