Basic Survey Design
Introduction Descriptive statistics are summary measures which are calculated from observations in the population and are used to provide information about the distribution of particular variables in the population. In the case of responses from a sample of the population, the statistics are used to make inferences about the population. Forming Classes Employed Persons Retrenched October 199093, Victoria
In this case, the population covered by the data is Victorian persons aged between 18 and 65 years who were employed between October 1990 and October 1993. The variable we are looking at is 'age' and the values of ages are grouped into ranges which cover all possible ages in the population. Frequency counts for the ranges are created by counting the number of people that fall into the relevant age range. For example, there were 20,700 responses in the range 50 to 54 years old. The total of the frequency counts adds up to the total number of Victorian persons aged between 18 and 65 years who were employed between October 1990 and October 1993, which is 244,400. It should be noted that, in a census, the frequency counts will be close to the actual number of responses in the particular ranges, with only a small adjustment needed for the nonsampling error. In a sample survey, the initial frequency counts will only be a tally of those in the sample, and will need to be inflated to provide estimates for the whole population. This process is called weighting the data. Producing a set of tables which crossclassify the key variables can give a clear picture of the data trends, and can therefore help determine the type of statistical analysis that could be undertaken on the data. A simple way to summarise the relationship between two variables is to produce a bivariate (2way) frequency table. Employed Persons Retrenched October 199093, Victoria
For more information on tables and other forms of data presentation see the chapter on 12 Presentation of Results . Frequency Histograms Once the class frequencies have been produced, the distribution can be represented graphically by a histogram. Sometimes instead of plotting frequencies we plot relative frequencies which show the percentage of the population within each class interval. Outliers Summarising the data can be complicated if there are observations which appear to be inconsistent with the remainder of that set of data. It is important to query whether it actually belongs to the data or whether it is a computing, clerical or measurement error. An outlier is an observation that has a major effect on an estimate and which, because of its independently known atypical nature, needs special treatment. Before choosing the treatment for an outlier it is essential to know the reason for its occurrence and whether the outlier provides any information about the population of interest. Some of the techniques available for treating outliers are given in Section 11.4. Return to top Measures of Location (also known as Measures of Central Tendency) It is often desirable to have summary measures to indicate the location of a frequency distribution on some sort of scale. Often the scale involved in the analysis is a time scale. This helps the researcher build up a picture of the distribution and facilitates some sort of analysis. Summary measures also enable the comparison of frequency distributions before and after a specified event (eg. number of car accidents before and after a change in traffic laws). A change can indicate a shift in the frequency distribution. The most common measures of location are the median, mean and mode. where X_{i} is the observed value of the ith member of the population and N is the total population count. Sample Mean where x_{i} is the observed value of the ith member of the sample and n is the total number of units in the sample. For example, the mean of the set of numbers from a sample: 2, 7, 9, 11, 14 is 8.6. The mean is used in many statistical tests (e.g. testing differences between groups) and it is possible to calculate standard errors and construct confidence intervals for the mean. In general it is the most stable measure of location but can be badly affected by extreme values. Median The median is the middle value when values are sorted into order of size. If there is an even number of values in the set, the median is the average of the middle two values, for example, the median of the set of numbers: 2, 7, 9, 11, 14 is 9; while the median of the set of numbers: 2, 6, 7, 9, 11, 14 is 8. It is a good measure of location for nonsymmetrical data as, in such cases, it is more central than the mean and is not affected by extreme values. It is often used in social science research, particularly in areas of housing prices and income. When analysing samples it is difficult to construct confidence intervals for the median due to the complexities in defining the sampling distribution (the distribution of the estimate of median over a number of samples). Percentiles A measure of location that is linked to the median is the concept of percentiles. A percentile is a value at or below which a given percentage of the data lies. The 50th percentile is also the median as one half of the population lies below it. Two other important percentiles are the 25th percentile, known as the lower quartile boundary and the 75th percentile, known as the upper quartile boundary. The lower and upper quartile boundaries for the set of numbers: 2, 6, 7, 9, 11, 14 are 6 and 11. Percentiles can only be formed for quantitative variables. Splitting the population into percentiles enables some comparison of the characteristics of units in each percentile. For example, the average annual income of wage and salary earners in each quartile could be compared, rather than calculating one overall average. Percentiles are also very useful for comparing changes in characteristics of a population over time. For example, by forming income quartiles for 1986 and 1990 we can determine whether the income share of wage and salary earners in each quartile has changed over time. Mode The mode of a frequency distribution is the most frequently occurring value. The mode, however, is not necessarily unique and this can cause problems in measuring the 'centre point' of our values. Having a measure of centre that is not required to take only one value can tell us more about the data than a measure like the mean or the median. In general the mean and median are better measures of location, however the mode is useful when the values are unevenly spread (eg. a twopeak distribution). Return to top Measures of Spread (also known as Measures of Variation) n summarising datasets it is also important to know the variability (spread) of the values, ie. how spread out the values are around the 'centre'. A measure of location does not provide us with this information so it has to be supplemented with a measure of spread. The common measures of variability are the range, variance and standard error or standard deviation.
Other Estimates In addition to estimating means it is often of interest to measure other statistics such as totals, proportions, percentiles or even minimum and maximum values. For instance, the total turnover of retail sales for businesses in Australia or the proportion of unemployed in particular regions may be of interest. Further Information The statistical analysis can comprise any summarising or presentation of the data from interpreting confidence intervals about basic summary measures calculated from the survey data, to more complex hypothesis testing using such techniques as contingency table analysis, loglinear modelling, regression analysis and time series analysis. Further information about analysis can be obtained from any of the standard texts in your library. However, it is recommended that consultation with experienced statisticians is worthwhile to determine the most appropriate analysis techniques. Return to top Arrangements of the values of a variable is called the distribution of the variable, for example, the percentages of a group of people in different age groups is called the percentage distribution of the variable 'age'. If the actual numbers or frequencies in different age groups are presented instead of percentages then it is called a frequency distribution. Similarly, the distribution which shows the probability that someone will fall into a particular age group is called its probability distribution. Therefore, a probability distribution shows the chance or probability that the value of a variable will lie in different areas within its range. The curve which shows the probability distribution is called a probability density curve. In a sample survey, we are always required to make estimates of certain population parameters. This is done in order to make inferences about the population as a whole. Good estimators are generally unbiased. In other words, theory indicates that across all possible samples the average sample value is equal to the population value, regardless of the sample size. A good estimator will also have a low variance and thus be very close to the population parameter we wish to estimate no matter which units are included in the sample that we take. Note that in order to avoid the researcher drawing spurious conclusions, great care must be taken to weight and aggregate the data correctly. Weighting is the process whereby each unit in the sample has its response inflated to represent the response from all similar units in the population. The weight of a unit reflects the proportion of the population that the sampled unit represents. The weight allocated to each sample observation depends on the process used to select the sample. The most simple form of weighting is where a simple random sample (SRS) of size n is selected from a known population of size N. If we observe a sequence of n observations y_{1} , ... ,y_{n} from a population of size N, then the numberraised estimator for the population total is the sample total multiplied by the ratio of population size to sample size (N/n). Our numberraised estimate is unbiased as the average of all possible samples is the true population total. The basis of this method is that the average (mean) of a sample is the best estimate of the mean of a population. So if we want to find the average population of Melbourne suburbs we select a representative sample of suburbs and take the average of this sample as our estimate of the average population in Melbourne suburbs. Then to obtain the estimate of the total population in Melbourne, the average population in suburbs is multiplied by the total number of suburbs. Business Example cont'd Remember that each business has a one in ten chance of selection. If the total turnover from the sample of 10 cafe and restaurant businesses is $5 million then the numberraised estimate of total income from the population of cafe and restaurant businesses in the City of Melbourne is $5 million * 10 = $50 million. Advantages This form of estimation is easy to use and does not require any benchmark information. It is relatively simple to calculate and its variance formula is known. Disadvantages Numberraised estimation has problems in that it produces a large sampling error compared to ratio estimation and is badly affected by unrepresentative samples. Return to top Instead of calculating population values from sample values by inflating them by the ratio of the number of population units to the number of sample units, ratio estimation uses a ratio of population to sample totals based on some other variable. For example, it may be useful in a survey of job vacancies to use a ratio of total employment in the stratum to total employment in the selected firms, rather than simply the ratio of total number of firms to selected number of firms. This other variable is known as the benchmark or auxiliary variable. For it to be effective, this variable should be highly correlated with our variable of interest and needs to be known for all units in the population. If we define y_{i} to be our variable of interest and x_{i} as our benchmark variable, then our ratio estimate is: (where X is the population total for the auxiliary variable). The average of all possible sample estimates will not be exactly equal to the true value. Thus our ratio estimate is biased. Business Example cont'd Total employment was known to be a useful auxiliary variable in estimating total turnover of the cafe and restaurant businesses in the City of Melbourne and is known for every business in the population. The total employment for the population was found to be 1,500 people and the total employment for the sample was found to be 100 people. The calculation of the estimate of total turnover from the population uses the ratio of population to sample totals based on total employment (1,500 /100 = 15) as its weight. The ratio estimate of total turnover for cafe and restaurant businesses in the City of Melbourne is therefore; $5 million * 15 = $75 million. Advantages The value of ratio estimation is that it decreases the standard errors of the estimates when the benchmark variable is highly correlated with the variable of interest. The ratio estimate also remains relatively unaffected by unrepresentative samples. Disadvantages As we have seen, ratio estimates have the problem that they are biased. This means that, for small samples, the estimates derived may be uniformly larger (or smaller) than they should be. Ratio estimates can be less accurate than numberraised estimates if the auxiliary variable has a low correlation with the variable of interest. As a result of poor correlation, ratio estimates can also be adversely affected by outliers (unusual observations) in either the variable of interest or the benchmark variable. Return to top An observation should only be treated as an outlier if:
Example
In a sample survey from a population, an outlier is rarely treated by removal. This is because every unit provides some information about the population, as the unit is itself a member of the population. Letting such a unit represent itself and no other unit is a common way of treating an outlier. Return to top
