International Journal of Epidemiology 2000;29:536-541
© International Epidemiological Association 2000
How many data sources are needed to determine diabetes prevalence by capture-recapture?
a Division of Tropical Medicine, Liverpool School of Tropical Medicine, Pembroke Place, Liverpool L3 5QA, UK.
b University Department of Medicine, University Hospital Aintree, Liverpool L9 7AL, UK.
c School of Health, Liverpool John Moores University, 79 Tithebarn Street, Liverpool L2 2ER, UK.
Reprint requests to: Dr GV Gill, Department of Medicine, University Hospital Aintree, Lower Lane, Liverpool L9 7AL, UK. E-mail: G.Gill{at}liv.ac.uk
| Abstract |
|---|
|
|
|---|
Background Capture-recapture (CR) methods are increasingly used to estimate the size of human populations, including those with diabetes. Few studies have examined the demographic details needed to match patients on the lists used in these techniques, or to determine the optimum number of lists.
Methods Six lists of known diabetic patients attending different medical settings during the study year were obtained. The effects on total enumeration after aggregation of these lists were examined using increasing numbers of demographic data items as patient identifiers. The CR estimates of prevalence were obtained using 15 different combinations of two lists. Estimates were obtained after log-linear modelling for interdependence between different combinations of three and four lists, and after combining the six available lists into three logical lists.
Results For matching patients, adding date of birth to first name and family name as matching criteria increased the total of identified patients from 2500 to 2585 (3% increase), corresponding to a period prevalence of 1.5% (95% CI : 1.411.52). Addition of further identifiers, such as partial postcode, only increased the estimate by a further 15 patients (0.5%), and more detailed matching with full postcode introduced uncertainty. The use of two-list CR yielded widely varying estimates of the total diabetic population from 1379 (95% CI : 4352273) to 9554 (95% CI : 729110 983). Log-linear modelling using different combinations of three and four lists produced estimates of 5074 (95% CI : 44175947) and 5578 (95% CI : 49187081), respectively, after compensating for statistical interdependence between the lists used. The appropriate condensation of six available lists into three lists for modelling yielded estimates of 5492 (95% CI : 48706285), corresponding to a CR-adjusted period prevalence of 3.1% (95% CI : 3.033.19%).
Conclusions In a Western population, the only demographic data required for matching patients on lists used for CR methods are first name, family name and date of birth, if unique identifiers such as social security numbers are not available. Two lists alone do not produce reliable data, and at least three lists are needed to allow for modelling for dependence between datasets. The use of more than three lists does not substantially alter the absolute value or confidence of enumeration, and multiple lists (if available) should be condensed into three lists for use in CR calculations.
Accepted 1 December 1999
| Introduction |
|---|
|
|
|---|
Capture-recapture (CR) methods are gaining popularity for determining the incidence and prevalence of diabetes,1,2 and are supported by the World Health Organization.1 The techniques are based on the use of multiple data sources to estimate the size of the total population concerned, and there are four basic assumptions i.e. the existence of independent sources; each individual has the same chance of being included in each source; all individuals can be identified and matched; and there are no changes in the population. In CR studies, two, three or four data sources are commonly used.3,4 Estimation of the prevalence of Type 2 diabetes by CR methods is particularly appropriate, as care of patients is typically split between hospital clinics and primary care facilities. To estimate the prevalence of diabetes, the data sources usually used include lists from general practitioners, family physicians or other health professionals, hospital admission or discharges, specialist clinic lists, diabetes registers, prescription lists and membership lists from local diabetic associations. The choice of data sources is important in determining the accuracy of the estimates of prevalence.
Issues sometimes arise on how to fulfil the mathematical assumptions, and what will happen if any are violated.5 The optimal number of data sources that should be used has not been adequately explored. The aim of this study was to examine the effects of using different numbers of data sources, and the effects of using dependent data sources, on CR population estimates, using data collected to estimate the prevalence of diabetes.
| Subjects and Methods |
|---|
|
|
|---|
The study was conducted in an urban area of North Liverpool with an estimated mid-year population of 176 682.6 The target population was all known people with diabetes, alive during a one-year study period from October 1995 to September 1996. Diabetes was diagnosed clinically using WHO criteria.7 The sources of data used to identify patients with diabetes were taken from general practices (GP) in the area, outpatients attending the hospital Diabetes Centre, hospitals, discharge data, patients attending a hospital Retinal Clinic, a research list of stroke inpatients with diabetes, and a list of children with diabetes attending a local children's hospital. The information used to create each of the six lists of patients was family name, first name, postcode, date of birth and sex.
The information collected was entered on to a computer using the database software Epi-Info version 6.04.8 The analysis used SPSS for Windows9 and Generalised Linear Interactive Modelling (GLIM) 4.10 The data were checked and corrected for key-punching errors. Records without the information mentioned above were removed from the list.
Cases were matched by using the patient identifiers (i.e. family name, first name, date of birth and postcode). This was done by using a sort and aggregate command in SPSS for Windows. This process identified cases that appeared on single, double or multiple lists.
Using two-list CR, the total diabetic population with 95% CI can be determined using formulae described elsewhere;11,12 For three or more lists, log-linear modelling13,14 was used to determine the number of missing cases. The observed number of cases was used as the dependent variable, and presence or absence on each list as the independent variables. The analysis was performed by fitting the simplest model, and continued until a model was found in which the residual deviance (on the
2 distribution) did not differ significantly from the observed data. GLIM provide fitted values under the model for the number of individuals in each observed category and in the unobserved category. If the model provides a reasonable fit for all the observed data, then the expected value of the missing cases' may be extracted directly. These statistical techniques have been fully described and validated elsewhere.14 The missing cases' is the number of cases in the population which have not been captured by any source. The total diabetic population was calculated by adding the number of missing cases to the aggregated cases identified by all the lists.
The 95% CI for missing cases was estimated using a software macro in GLIM. For each value of the interval, the change in log-likelihood above minimum is equal to 3.84.15 A similar range of intervals was then calculated for the total estimated population. The prevalence of diabetes was determined by dividing the number of cases by the total population in the group or subgroup.
The study was approved by the Ethics Committee of the Sefton Health Authority. Confidentiality of information was maintained all the time. Only one author (AAI) had access to the full identity of patients and this information was removed after the matching procedure.
| Results |
|---|
|
|
|---|
The total number of people identified as having diabetes from all six lists was 2585. This was determined from a pool of 25 GP lists (1469 cases), the diabetes centre (1252), hospital admissions (454 cases), retinal clinic (351 cases), children's hospital (64 cases) and the stroke database (38 cases). The distribution of the cases between lists is shown in Table 1
|
Matching criteria
The aggregate command in SPSS for Windows was used to match cases between lists, and different numbers of criteria used for matching resulted in differing numbers of cases. Using two criteria (family name and first name), the aggregated number of cases was 2500. Using three criteria (family name, first name, date of birth), produced 2585 cases, using four criteria (adding first part of postcode) the aggregated cases numbered 2600, and five criteria (adding both parts of postcode) produced 2720. It was decided to use three criteria (as above) for the matching in this study. Some cases (85 of 2585 or 3%) will be missed if only the first two criteria (family name and first name) are used. Adding the first part of the postcode made little difference to the aggregated case estimate, and adding the second part of the postcode added 100 to the aggregated case estimate, but this may introduce inaccuracy as some patients moved house within the area, during the study period.
Estimation of known diabetes population
Two lists
Using various pairs of lists, the total number of the known diabetes population was estimated. Data from Table 1
were rearranged into 22 contingency tables and the total known diabetes population (N) with 95% CI was determined. There were 15 possible combinations of two lists, which provided different estimates of the total N and the number of missing cases (m), as shown in Figure 1
. It was not possible to estimate the number of diabetic patients using the Children's Hospital list in combination with the other hospital-based lists because children with diabetes in the area are looked after exclusively at the local Children's Hospital and there were no duplicates between these lists. The remaining combinations yielded estimates of N varying from 9554 (95% CI : 729110 983) to 1379 (95% CI : 4852273). As the number of overlapping cases (duplicates) decreased, both the estimated N and missing values (m) increased, and were considerably bigger than the number of cases identified by all lists (2585), with wide CI.
|
Three lists
Using the General Practice, Diabetes Centre and Hospital Admission lists, the aggregated number of the known diabetes population was 2422. Using log-linear modelling, the first model tested, permitting no interaction between the lists, fitted the data poorly (P = 0.00) (Table 2
|
Four lists
The total aggregated number of patients identified by four lists (GP, Diabetes Centre, Hospital Admission and Retinal Clinic) came to 2538. Using log-linear modelling, the first model with four lists independent fitted the data poorly, as did models with one, two or three pairs related. The simplest model that fitted the data was given by an equation with the terms of L1, L2, L3, L4 and the interactions L1.L2 and L2.L3.L4 (Table 3
|
Six lists
With more than four lists, CR calculations become increasingly complex, and we preferred a six list analysis collapsed into three lists by combining the Diabetes Centre list with the Children's Hospital list (known as Lb), and combining the Hospital Admission list with the Retinal Clinic list and the Stroke Database (known as Lc), while the GP list remains as L1. These lists were combined based on the results of the two-list analysis, and on the basis that they had similar characteristics and would produce more homogenous populations in terms of age. The aggregated number of cases identified by the combination of lists was 2585. These data were cross-tabulated according to the presence or absence of cases in each of three new lists into a 23 contingency table. Once again, a model with two interactions (like 3b above) fitted the data best (P > 0.05). The estimated number of missing cases was 2907 (95% CI : 22853700), and this gave an estimate of the total diabetic population of 5492 (95% CI : 48706285).
Prevalence rates
The estimated mid-year population of South Sefton in 1994 was 176 682 (Merseyside Information Services, 1996). The CR estimated diabetic population ranged between 5074 to 5578, depending on the number of sources used in the calculation (Table 4
). The crude period prevalence rates expressed per 100 of known diabetes, using three, four and combinations of six lists were 1.4 (95% CI : 1.321.43), 1.4 (95% CI : 1.381.49), and 1.5 (95% CI : 1.411.52), respectively, compared with CR estimated period prevalences of 2.9 (95% CI : 2.792.95), 3.2 (95% CI : 3.083.24) and 3.1 (95% CI : 3.033.19), respectively. Although there were slight differences in the estimated total diabetic population after increasing the number of lists, there were no significant differences in the prevalence rate estimates. The number of diabetic patients estimated using the first three lists was the closest to the aggregated cases (5074 versus 2585) and had the highest ascertainment rate (47.7%). These lists, however, do not include diabetic patients registered at the Children's Hospital, and patients attending the Retinal Clinic and on the Stroke database. It was important to include these lists in the calculation, in order to have a homogeneous age distribution. Using the combination of six lists produced a prevalence rate of 3.1% (95% CI : 3.033.19).
|
| Discussion |
|---|
|
|
|---|
It is important to emphasize that our study is one of known diabetes cases; we cannot of course estimate numbers of those Type 2 diabetic patients as yet undiagnosed. We found that the crude period prevalence of known diabetes in South Sefton ranged from 1.4% (95% CI : 1.321.43) to 1.5% (95% CI : 1.41 1.52), and CR-adjusted rates range between 2.9% (2.792.95) and 3.1% (95% CI : 3.033.19) depending on the number of lists used. The CR-adjusted period prevalence in those <30 years was 0.4% (95% CI : 0.340.41) and the CR-adjusted period prevalence amongst adults aged
30 was 5.2% (95% CI : 5.095.36%). These figures are within the expected ranges of prevalence in the UK and in some parts of Europe, which range between 1.6% and 5%.1619 We used family name, first name and date of birth for the matching process. An attempt was made to use fewer criteria (family name and first name), but this resulted in some patients having similar matching criteria. Overall, adding the date of birth as a matching criterion increased the estimate of cases by 3.6%, and adding the first part of the postcode made little further difference apart from ensuring that occasional patients from outside the study area were not included. This is a smaller difference than was observed in a study in Thailand, performed using data on drug misuse and police arrest. In that study the total number of cases varied by a factor of 45% overall, depending on whether two, three, four or five specific identifiers were used.20
The choice and quality of lists is a very important factor in obtaining good estimates, but our data show that increasing the number of lists does not necessarily produce significant improvements in prevalence estimates. Analysis using a two-list CR technique can be used as a guide to determining whether lists are independent or not,21 but two lists on their own are not sufficiently accurate in most circumstances and as with our results (see previously in Results section under Two lists), widely varying and erroneous estimates of population may be obtained, depending on excessive or inadequate overlap. Fulfilling the second assumption of valid CR, i.e. that subjects have an equal chance of appearing on each list, is always difficult, and represents the problem of dependency, which is very importantour Children's Hospital list is a good example, as this list is highly independent from all but the GP lists. Children are unlikely to develop retinopathy and therefore will not appear on the Retinal Clinic list, and in our area are admitted to their own hospital rather than the local adult hospital. Similarly, the lists of Hospital Admission and Stroke Database, and Hospital Admission and Retinal Clinic, were positively dependent, so that CR estimates using these were far less than the actual identified cases.
Our results demonstrate interactions between lists, the degree of which may differ in different age groups. In the sex-specific analyses, the simplest model that fitted the data was also the model which had an interaction between GP, Diabetes Centre and Hospital Admission lists. When data were further divided into age groups, the interaction between lists diminished. The simplest model that fitted most age group data was the model in which there was no interaction between lists.
If lists are dependent, three choices can be made: to discard the lists, to combine them, or to use log-linear modelling.14 In our study, we combined several lists to produce three lists for CR, which we believe is appropriate in a medical setting, because of difficulties in finding other independent and reliable lists, and also because calculations are simpler. However, even combined three-list CR cannot entirely remove dependency problems, and thus the use of log-linear modelling is vital to overcome such difficulties.
In summary, our analysis has shown that relatively few patient identifiers are needed to match patients in the lists used for CR in Western populations.22,23 Full name and date of birth appear sufficient, though of course a unique identification number (such as national insurance number or social security code) would make identification much simpler. Our data suggest that three lists are sufficient. Interdependence is still a potential problem, but appropriate combination of lists and log-linear modelling can overcome this. Complex analysis of more than three lists does not appear to add more precision to estimates of population size. The narrow CI of numbers of cases and prevalence estimates we obtained would appear to support this. Although we have concerned ourselves specifically with the enumeration of a diabetic population, we believe that our findings are applicable to the use of CR methods in other medical areas.
| Acknowledgments |
|---|
We thank Miss Liz Harsnape, Mr Keith Jones, Ms Susan Kerr, Dr Colin Smith, Dr Anil Sharma, Mr Kevin McDonald, and Professor Ronald LaPorte, as well as the staff of the Walton Hospital Diabetes Centre. This work was part of a PhD project undertaken by Dr A Ismail at the University of Liverpool, funded by the University of Science, Malaysia.
| References |
|---|
|
|
|---|
1 WHO Study Group. Prevention of diabetes mellitus. WHO Tech Rep Ser 844; Geneva: WHO, 1994.
2
McCarty DJ, Tull ES, Moy CS, Kwoh CK, LaPorte RE. Ascertainment corrected rates: application of capture-recapture methods. Int J Epidemiol 1993;22:55965.
3 Bruno G, LaPorte RE, Merletti F, Biggeri A, McCarty D, Pagano G. National Diabetes Programs, application of capture-recapture to count diabetes? Diabetes Care 1994;17:53855.
4
Wadsworth E, Shield J, Hunt L, Baum D. Insulin dependent diabetes in children under 5: incidence and ascertainment validation for 1992. Br Med J 1995;310:70003.
5
Papoz L, Balkau B, Lellouch J. Case counting in epidemiology: limitation of methods based on multiple data sources. Int J Epidemiol 1996;25:47478.
6 Office for National Statistics (ONS). ONS Population and Health Monitor. London: HMSO, 1998.
7 WHO. Diabetes mellitus: report of a WHO Study Group. WHO Tech Rep Ser 727. Geneva: WHO, 1985.
8 Dean AG, Dean JA, Coulombier D et al. EpiInfo Version 6: A Word Processing Database and Statistics Program for Epidemiology on Microcomputers. Atlanta: Centers for Disease Control and Prevention, 1994.
9 Norusis MJ. SPSS or Windows Base System User's Guide release 6.0. Chicago: SPSS Inc., 1993.
10 Francis B, Green M, Payne C. GLIM 4: The Statistical System for Generalized Linear Interactive Modelling. New York: Oxford Science Publications, 1993.
11 LaPorte RE, McCarty D, Bruno G, Tajima N, Baba S. Counting diabetes in the next millennium: application of capture-recapture technology. Diabetes Care 1993;16:52834.[Abstract]
12
LaPorte RE. Assessing the human condition: capture-recapture techniques. Br Med J 1994;308:56.
13 Bishop YMM, Fienberg SE, Holland PW. Discrete Multivariate Analysis: Theory and Practice. Cambridge, MA: MIT Press, 1975, pp.22956.
14 Cormack RM. Log-linear models for capture-recapture. Biometrics 1989;45:395413.[Web of Science]
15 Cormack RM. Interval estimation for mark-recapture studies of closed populations. Biometrics 1992;48:56776.[Web of Science][Medline]
16 Unwin N, Alberti KGMM, Bhopal R, Harland J, Watson W, White M. Comparison of the current WHO and new ADA criteria for the diagnosis of diabetes mellitus in the three ethnic groups in the UK. Diab Med 1998;15:55457.[Web of Science][Medline]
17 McKeigue PM, Pierpoint P, Ferrie JE, Marmot MG. Relationship of glucose intolerance and hyperinsulinaemia to body fat pattern in South Asian and Europeans. Diabetologia 1992;35:78591.[Web of Science][Medline]
18 Currie CJ, Peters JR. Estimation of unascertained diabetes prevalence: different effects on calculation rates and resource utilisation. Diab Med 1997;13:47781.
19 Gatling W, Hill RD. General characteristics of a community-based diabetic population. Pract Diabetes 1988;5:10407.
20
Mastro TD, Kitayaporn D, Weniger BG. Estimating the number of HIV-infected injection drug users in Bangkok: a capture-recapture method. Am J Public Health 1994;84:109499.
21
International Working Group for Disease Monitoring and Forecasting. Capture-recapture and multiple-record systems estimation I: history and theoretical development. Am J Epidemiol 1995;142:104758.
22 Squires NF, Beeching NJ, Schlecht BJM, Ruben SM. An estimate of the prevalence of drug misuse in Liverpool and a spatial analysis of known addiction. J Public Health 1995;17:10309.
23 Devine M, Syed Q, Tocque K, Bellis M. Capture-recapture estimates of whooping cough in the Merseyside area. Commun Dis Public Health 1998;1:12125.[Medline]
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
R. L. Knowles, A. Smith, R. Lynn, J. S. Rahi, and on behalf of the British Paediatric Surveillance U Using multiple sources to improve and measure case ascertainment in surveillance studies: 20 years of the British Paediatric Surveillance Unit J. Public Health Med., June 1, 2006; 28(2): 157 - 165. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. V Gill, A. A Ismail, N. J Beeching, S. B J Macfarlane, and M. A Bellis Hidden diabetes in the UK: use of capture-recapture methods to estimate total prevalence of diabetes mellitus in an urban population J R Soc Med, July 1, 2003; 96(7): 328 - 332. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. I.R. Boyle and S. G. Cunningham Resolving fundamental quality issues in linked datasets for clinical care Health Informatics Journal, June 1, 2002; 8(2): 73 - 77. [Abstract] [PDF] |
||||
![]() |
G.V. Gill, A.A. Ismail, and N.J. Beeching The use of capture-recapture techniques in determining the prevalence of type 2 diabetes QJM, July 1, 2001; 94(7): 341 - 346. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




