WWW-publications from the WHO MONICA Project
April 19991
Vladislav Moltchanov2, Kari Kuulasmaa2 and Jorma Torppa2 for the WHO MONICA Project3
1 Misprints in Table 4 (column for Source of data) were corrected on 4 April
2000
In Table 4 for MCC 28, RU 1, years 82-84, the Reported population size was
corrected on 3 September 2002. The correction had no implications to the results
of the quality assessment, which had been done initially using the correct data.
2 MONICA Data Centre, National Public Health Institute, Helsinki, Finland
3 Annex: Sites and key personnel of the WHO MONICA
Project
This document includes the main findings of unpublished reports:
Thanks are due to Alun Evans and Hermann Wolf who commented on the text.
The MONICA Centres are funded predominantly by regional and national governments, research councils, and research charities. Coordination is the responsibility of the World Health Organization (WHO), assisted by local fund raising for congresses and workshops. WHO also supports the MONICA Data Centre (MDC) in Helsinki. Not covered by this general description is the ongoing generous support of the MDC by the National Public Health Institute of Finland, and a contribution to WHO from the National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland, USA for support of the MDC and the Quality Control Centre for Event Registration in Dundee. The completion of the MONICA Project is generously assisted through a Concerted Action Grant from the European Community. Likewise appreciated are grants from ASTRA Hässle AB, Sweden, Hoechst AG, Germany, Hoffmann-La Roche AG, Switzerland, the Institut de Recherches Internationales Servier (IRIS), France, and Merck & Co. Inc., New Jersey, USA, to support data analysis and preparation of publications.
The population demographic data (PDD) is one of the core data components of the WHO MONICA Project (1). Its main use is as the denominator for the calculation of coronary and stroke event rates in the populations. The correctness and accuracy of the demographic data is crucial for the quality of the event rates and their trends over time.
The demographic data in MONICA were derived from routinely available local demographic statistics. Each MONICA Collaborating Centre (MCC) was to collect the best available estimates of the mid-year (30 June) population size and structure and report them annually to the MONICA Data Centre (MDC) in Helsinki.
The quality of demographic data in MONICA was first assessed in 1992 and again in 1996. During the preparation of the quality assessment reports, and after they had been distributed to the MONICA Collaborating Centres, major shortcomings of the data were corrected by several MCCs. Furthermore, many MCCs corrected the intercensal estimates after data from a new census became available. The purpose of the current report is to assess the quality of the demographic data which will be used for the estimation of ten-year trends in coronary and stroke events in the Project.
The population unit for which data were collected in MONICA is called a Reporting Unit (RU). It is defined as the residents of an area delineated by clear geopolitical boundaries. The current report covers all RUs which were considered in the quality assessment reports for coronary and stroke event registration data (2, 3). In the quality assessment reports for event registration data, small RUs were combined to form Reporting Unit Aggregates, but in the current report the individual RUs are considered. The RUs as well as their Reporting Unit Aggregates in the quality assessment reports for event registration data are listed in Table 1. Altogether 79 RUs are examined.
The report considers data for all calendar years which were available in the MDC on 15 September 1998 with few exceptions:
The years covered are shown in Table 2.
The demographic data were collected for each calendar year by sex and 5-year age group. The overall age range is 25-64 or 25-74 years depending on whether the MCC provided data for the optional age group 65-74.
The sources of data for this report are the annual data received from the MCCs on the Population Demographics Reporting Form (Form A) and other communication with the MCCs. The data transfer form is given in Appendix 1 and the instructions for completing it in the MONICA Manual (4).
The quality assessment will address two main aspects of the demographic data:
To summarise the findings, a score was defined for each DBC indicator. These scores were then summarised further in a summary score. Each score has values:
The scores and their definitions will be introduced in the following sections.
The data were transferred to the MDC on paper forms (see Appendix 1). Once the data were received in MDC and keyed in the MONICA database, routine checks took place for consistency with the row and column totals of the forms. Furthermore, the age distribution and sex difference were visually checked for any peculiarities and compared to the previous years. Whenever inconsistencies or something unexpected was found, the problems were communicated to the MCC and a correction or explanation was sought. Such routine checking procedures minimise coding errors in completing the data collection forms and keying the data in to the computer; this is provided that the MCC has completed the forms and, in particular, calculated the row and column totals according to the instructions given in the MONICA Manual (4).
It is important that the demographic data refer to exactly the same population as the event registration. The MCCs were asked for this in site visit forms (which were self-completed by most MCCs in 1986) and in a questionnaire in 1987. Furthermore, for a number of MCCs the issue was double checked more recently in connection with the definition of eligibility for the MONICA surveys (5). To our knowledge the demographic and event registration data refer to exactly the same populations in all RUs.
According to the MONICA Manual (4), the population demographic data should represent the best available estimates of the mid-year (30 June) population size and structure of each RU. This was also checked with the MCCs in 1987, and later re-checked with some MCCs. In some RUs the routinely available demographic data refer to something else, typically 1 January or 31 December, but mid-year estimates were entered in the MONICA database. The current data in the MONICA database has the following exceptions:
For MCCs 39, 46, 47 and 49 the intercensal estimates were produced in the MDC, based on census data provided by the MCCs. The reference dates of demographic data for each year were determined by the reference date of census. A closer examination of these data has shown, that they could be used for analyses, since in each case, re-calculating the data into mid-year estimates would have a negligible effect.
We will use the term system for demographic data collection (SDDC) for the full process through which the demographic data are collected in a population. It includes components such as the individual enumeration and collecting the data in a census, population registers and the processing of the data to get the final published reports on the annual figure and characteristics of the population. More details on the general concepts and examples from individual countries can be found elsewhere (6). Furthermore, a brief description of the system for many countries is maintained in the World Wide Web by the International Monetary Fund (7).
For estimates of the population size for the years when census data are not available we will use the following terminology (6, p26).
For our purposes the essential characteristics of an SDDC are:
In some countries the direct count of the target population is provided by census, while in the others an annual report is produced from a computerised population register which contains records of all individuals in the population and is updated continuously. In many countries and local areas some special agency registers the births and deaths and immigration/emigration events, and thus provides useful information for the inter- and postcensal estimates. Sometimes the data on births, deaths and migration are only available at the level of large administrative units. For example, migration statistics may be collected for a big city but not for its districts. In such case the intercensal estimates for the districts may be biased. The extent to which directly measured vital and migration data are used for intercensal estimation varies from country to country. In extreme cases, the intercensal estimates are based exclusively on data from two subsequent censuses, no other information is used. A fresh census always provides a good basis for revising the previously made estimates.
The SDDCs which provided demographic data for the MONICA RUs can be roughly classified into three categories:
Population register (PR) involves:
Register adjusted by census (RC) involves:
Census/intercensal estimates (CE) involves:
The classification of the RUs into these three categories of SDDC is shown in Table 2. A more specific description of the SDDC for problematic RUs is given in Section 10. Among the 79 RUs considered, 18 had a PR, 31 had an RC and 30 had a CE. Note, that the borderlines between the categories are often obscure, and our knowledge of the actual SDDC in each RU is not necessarily complete. No effort was put in obtaining the details of the SDDC except in the cases where it was essential for deciding whether or not the Summary Score (see Section 7) should be zero.
It is obvious that there can be, and there are, errors in the figures coming from the population registers and censuses, but we have no means to check their quality in most of the MONICA populations. In general, in well-established systems, they provide reasonably good estimates of the population size. Therefore, we will assume that the quality of the population figures produced by registers and censuses are reasonably good, and consider the ability of the different SDDCs to provide reliable population estimates outside the census years.
Well established individual based population registers (PR) usually give good and reliable population estimates.
When the data are based on intercensal estimation (CE) or on a register using census data for adjustment (RC), the frequency of censuses is important for the accuracy of the population figure. If there is a census every five years, the intercensal estimates can be pretty accurate half-way between the censuses. When the time span between censuses is ten years, then intercensal estimates become less accurate, but can usually still be acceptable. If additional supportive measurements, such as so called micro-censuses, are performed during the intercensal period, the credibility of the SDDC will be increased.
RC uses annual or regular direct count of components allowing for annual changes in population size. This makes estimates even more reliable. RC systems may evolve in PR systems. In general RCs may have longer intercensal periods, compared with CE systems. However, we have found that in some RUs in Germany a RC system has accumulated an error of up to 20% in 17 years, as compared with the census data in 1987.
The census years in the RUs where the SDDC was CE or RC are given in Table 2.
Among the 61 RUs, where census is used either for adjustment of the register or for updating the intercensal estimates, 10 have a census every five years, 3 have a census every eight years, and 30 have a census every 10 years. 6 RUs in Germany had the last two censuses in 1970 and 1987. In most Centres there was a census around 1990. In most of these the census data are available, and updated intercensal estimates for the previous years were sent to the MDC. In one MCC (MONICA East Germany) the last census was performed in 1981.
The postcensal years (i.e. the years after the last census for which census data are already available) are of particular concern. A 5-year span after the last census has a potential for a much higher error than the year in the middle of a 10-year intercensal period.
Several MCCs had their last census in 1996, but the census data are not yet available from the local demographic agencies/statistical offices.
The number of years reported by the MCCs since the last census year for which data are already available in the MDC is given in Table 2 (column "years after last census"). For 20 RUs this time span after the last census is 6 years or more.
The main part of the data based quality assessment is checking the consistency of population dynamics using the demographic data. This is approached by modelling the yearly changes of the one year birth cohort sizes. The modelling provides estimates for one-year cohort sizes and annual net migration rates, which:
The detailed mathematical description of the method is given in Appendix 2.
Our main interest in the results of the modelling are the estimates of the net migration rates for the 5-year age groups. Very high or irregular pattern of migration rates over calendar years and age groups indicate possible problems in the data.
The results of the modelling are given in Table 4. The table gives the following information for each sex and RU:
By calendar year:
In addition, the table shows the maximum absolute values of certain rows and columns to help to identify large values.
| S_MIGR = | 2 | if SDDC=PR or MIGR =< 6 |
| 1 | if 6 < MIGR =< 10 | |
| 0 | if MIGR>10 |
If there is a good explanation for a high MIGR, S_MIGR was modified to a higher value. Such explanations are given in Section 10.
The values of MIGR and S_MIGR for men and women are shown in Table 3. Four RUs got value "0" for S_MIGR, two of them were modified to "1".
During the preparation of the first quality report it was found, that in some MCCs the intercensal estimates were obtained by linear interpolation within the age groups. So simple procedure is logically inconsistent and usually gives an incorrect population figure. To identify the RUs where such a simplified intercensal interpolation was used, a Linearity Index LIN was defined in two steps:
100×LIN is then approximately the average deviation in percent of the actual population size from its linear estimates. If the population sizes for each year lies exactly on a straight line, LIN will be zero.
Table 3 shows the results of the linearity test for the data reported. In the column "Linearity charts" the patterns for men and women represent the value of LIN for the 8 age groups 25-29, 30-34, .... , 60-64 years using symbols:
| "*" | for LIN =< 1.5 |
| rounded value of LIN | for 1.5 < LIN =< 8.5 |
| "9" | for LIN > 8.5 |
Linearity score S_LIN is then defined as:
| S_LIN = | 2 | if Na <= 5 |
| 1 | if 5 < Na < =8 | |
| 0 | if 9 < Na |
where Na is the number of age/sex groups for which LIN =< 1.5 (i.e. Na is the number of symbols '*').
In the case of a good explanation for a low value of the linearity index, S_LIN was modified to a higher value. Such explanations are given in Section 10.
The linearity checking, if applied to intercensal periods, is good for detecting inappropriate intercensal estimation. Few such cases were detected in the early years of MONICA, and the intercensal estimation was redone properly. For most RUs, the current demographic data includes many intercensal periods, and therefore the linearity checking is not as powerful as before. Nevertheless, for simplicity, the linearity checking was applied to the full period of data, but whenever S_LIN was "1" or "0", the linearity was checked more thoroughly and the results are reported in Section 10.
In the current checking no RU got value "0" for S_LIN, and none of the 3 RUs with score "1" appeared to use a simple linear interpolation.
This test is aimed at detecting situations when sex-specific figures for each calendar year were not available, but were generated by applying a constant male-female ratio. Such a situation may arise if the sex-specific data are available for a larger age group only. The test uses the ratios of the the population sizes for men and women within the 5-year age groups in each calendar year. We calculated the standard deviation of the logarithms of the ratios over the age-groups by calendar year. The Sex Ratio Index (SRI) was defined as the second largest standard deviation over all calendar years, and is expressed as a percentage (i.e. multiplied by 100).
A low value of SRI indicates little variation in the sex ratio between the age groups. If the value is 1%, it is very likely that a constant sex ratio was applied.
The results considering age groups 25-29,...., 60-64 years are shown in Table 3. A Sex Ratio Score S_SRI was defined as:
| S_SRI = | 2 | if SRI > 4 |
| 1 | if 4 >= SRI > 2 | |
| 0 | if 2 >= SRI |
A value of 0 or 1 of S_SRI raises the need for a more detailed analysis of all relevant information. No RU had S_SRI="0", but it was "1" in three RUs (in MCCs 10, 11 and 28). These are considered in more detail in Section 10.
After the results of a census became available, most of the MCCs, where the SDDC was CE or RC, replaced postcensal estimates of the demographic data in the MDC by intercensal estimates (and new postcensal estimates for the years after the new census). For those MCCs, the difference between the postcensal estimates and the intercensal estimates can be used as an indicator of the potential error in the postcensal estimates of the last years of the Project. Even if such estimates of the error can not be used directly to correct the data, they bring information about the quality of the system used for the estimation of the size in the population. Table 5.1 lists such MCCs and RUs, together with the years near a census where such replacement of estimates was done, the year of the previous census on which the postcensal estimates were based, and the magnitude of the correction of the demographic data.
Using the information on the correction, Potential Error (PER) was defined as:
PER = SQRT [hanging span/(correction year - previous census year)] × correction,
where
(The square root in the formula corresponds to the assumption that there is an annual random error in the postcensal estimates. Alternatively, without the square root, we would assume that there is a constant annual error. The difference between the two assumptions is very small from the point of view of our conclusions.)
The values of PER for the last year for which the demographic data are available in the MDC are shown in Table 5.2, separately for men and women. The table also includes the RUs where census data were used but the MDC does not have both postcensal and intercensal estimates for any year. For such RUs, it was assumed:
This assumption gives a relatively large correction, which however, does not lead to a score of zero (see below) if the hanging span is five years or less.
A Potential Error Score (S_PER) is defined as:
| S_PER = | 2 | if PER =< 5 or SDDC = "PR" |
| 1 | if 5 < PER =< 10 | |
| 0 | if 10 < PER |
Table 5.2 shows the values of S_PER, separately for men and women for the last year for which the demographic data are available, and a combined score, which is defined as the minimum of the scores for men and women.
For the RUS where S_PER is zero, Table 5.2 also shows the last calendar year for which S_PER would be more than zero.
The results of the quality assessment are summarised in a Summary Score (SS) which is defined in two steps using the scores defined above.
First, an intermediate Data Based Consistency Score (S_DBC) is defined as the minimum of S_MIGR, S_LIN and S_SRI.
The results of S_DBC (DBC score) are shown in Table 3.
Only 2 RUs have score 0 (in MCCs 25 and 27).
The Summary Score (SS) is defined as the minimum of S_DBC and S_PER.
The values of SS are given in Table 6. This table also shows the contributing scores. If SS is zero, but it would be one if the last years of the data were not considered, such a "Last year with SS = 1" also is given in the table.
Five MCCs have at least one RU with SS zero. They are:
The main use of demographic data in MONICA is as the denominator for estimating coronary and stroke event rates in the population. Our experience of working with the demographic data in MONICA shows, that the range of possible errors in the population size of a 5-year age group data may be over 20%, and in extreme cases up to 50%. Usually, some weighted aggregates rather than 5-year age group indicators are used for event rates. In this case an error may be up to 20%. The MONICA hypotheses explore 10-year trends in event rates. It is clear, that a bias of 20% in the trend of the population size makes any such estimates irrelevant. The purpose of the demographic data quality assessment was to identify and quantify problems with the data.
Since the earlier versions of the demographic data quality assessment report, intensive work has been done, both in the MCCs and the MDC, to resolve problems that have been detected in the data and to learn more about the systems for collecting the data in the different countries. Such descriptive information on the sources of the data are crucial for making inference on data reliability. If we know for sure, that a demographic figure with high irregularity and sudden jumps came from the "true" population register, and in addition there is an explanation for the irregularity, this figure is regarded as fully reliable.
Despite the efforts made to investigate the systems for collecting the data, at least for MCCs from which the data were questionable, we do not know all the details for all MCCs. In many cases the official statement: "the estimates were produced using the vital statistics and allowing for migration in and out of the population" covers a wide range of techniques for obtaining the population estimates in different MONICA RUs. This is confirmed by the experience we obtained during site visits to Novosibirsk and Moscow. We consider the information currently available in the MDC as adequate and sufficient for quality assessment of the MONICA data.
Since the previous versions of the report, the data based checking procedures have been improved. The methods for detecting linearity and equality of the sex-specific population figures are now more adequate than formerly. The approach for checking the consistency of the population dynamics has not been changed, but the computer algorithm has been revised to allow the processing of much longer periods of calendar years than during the early stages of the project.
For some MCCs, where intercensal estimates were not readily available, the estimates were made in the MDC using the census data provided by the MCC. For this purpose, a technique was developed, based on an inversion of the method used for checking the consistency of the demographic dynamics.
In general, the quality of demographic data improved significantly over the years. It became possible to solve major problems and uncertainties in the quality of the data for some years in several MCCs, and hence to improve the quality of their data for all years. Furthermore, more accurate population estimates became available in many MCCs after their latest censuses. At this final stage, five MCCs have RUs with Summary Score SS = 0, suggesting major concern in the quality of the data (see Table 6). Two of these five MCCs also have major limitations in utility of their event registration data available for MONICA. In the remaining three MCCs the data show good internal consistency, but the low score is due to the long postcensal period for the last reported years. The data may be good, but we lack information to confirm this. For two of these three MCCs, only one year of MONICA event registration is affected. In the remaining MCC (MCC 23), where the uncertainty about the demographic data is most problematic, the last official census was carried out in the year 1981.
The following list includes only the RUs with specific findings or exceptional background information relevant to the use of the data. The list is by MCC code.