# Sampling errors and confidence intervals

## Calculating confidence intervals around sample based survey estimates

The LFS survey data is used to make inferences about the whole population. When data obtained from a sample is used in this way, there is an element of sampling error, or uncertainty, about the sample estimate. Sampling errors relate to the fact that the chosen sample is only one of a very large number of samples which may have been chosen, each giving rise to different sample estimates. All LFS based population estimates are subject to sampling error, or uncertainty, since they are based on a sample of individuals rather than the whole population.

Using statistical theory it is possible to say how precise a population estimate is by constructing a confidence interval around it to show the range of values which the true population value lies (ie the value that would have been found if the entire population had been surveyed) in the absence of bias. Confidence intervals based on LFS sample estimates are presented as 95% confidence intervals. What this means in practice is that in 19 samples out of 20 we would expect the true value to lie within the 95% confidence intervals constructed. A 95% confidence interval for a population estimate is about +/-2 standard errors around the estimate calculated from the sample (where the standard error is a measure of the range of different estimates provided by different samples).

When the estimate is based on very few sample cases, the confidence interval can include a negative lower confidence limit. For these cases, the lower limits have been set to 0.

## Reliability thresholds

The main factor that determines the size of the confidence interval around an estimate is the size of the group for which the estimate is being derived - the smaller the group the (proportionally) less precise is the estimate. For estimates based on less than 30 sample cases (which in recent years equates to an incidence/ prevalence estimate of about 25,000 cases) confidence intervals should be quoted in preference to the prevalence or incidence central estimate or rate ie figures shown in italics within the tables. In order to reflect some of the variability in the days lost estimates (measure from person to person) as well as the sample numbers involved, confidence intervals should be quoted for days lost estimates and rates based on fewer than 40 cases taking time off, also shown in italics. Estimates based on fewer than 20 sample cases (which in recent years equates to an incidence/prevalence of about 15,000 cases) are not published as they are likely to be unreliable.

## Calculating confidence intervals around sample based survey estimates of change

Estimates of change from one year to the next are simply calculated by subtracting the latest years estimate from the earliest. However just as annual survey estimates are subject to uncertainty or sampling error, then so too are estimates of change. Where 2 independent estimates are being compared then the standard error of the estimate of change is calculated as:

Standard error (estimate1-estimate2)= √(variance(estimate1)+variance(estimate2))

A 95% confidence interval around the population estimate of change is about +/-2 standard errors around the estimate calculated from the sample. If the 95% confidence interval excludes zero, then we describe the difference as "statistically significant" at the 5% level (ie there is a less than 5% chance that the difference is due to sampling error alone). All tests of statistical significance of change are made at the 5% level.

## Improving the precision of sample estimates by pooling annual data

One way of increasing the reliability of survey data is to increase the sample size on which it is based. Whilst the annual sample size is fixed, several years' worth of data can be pooled to produce estimates for the average of the combined years. Injury and ill health measures by demographic and employment-related variables are generally presented in this way by pooling three years' worth of data. Results by occupation and industry, where the number of sample cases at the detailed levels tend to be low, are also presented as five-year averages. The formulae for the different measures of injury and illness remain the same as for annual estimates (see 'The different measures of work-related ill health, workplace injury and working days lost') only now estimates are based on annual three year or five year average results, rather than just annual results.