This paper was originally published in Bulletin of the International Statistical Institute: Proceedings of the 51st Session; 1997 Aug 18-26; Istanbul, Turkey.Voorburg: International Statistical Institute; 1997. Book 1; 295-8. The International Statistical Institute has granted the athors permission to reproduce it in the World Wide Web.
The motivation for the current paper comes from the WHO MONICA Project (1988). It is a population based monitoring study for cardiovascular diseases, involving about 40 populations in 21 countries. The objectives include:
The members in the populations are those of certain age with their permanent residence in a geographically defined area. Therefore, there is a change of members in the populations over time. There is no follow-up of individuals.
The objectives concern changes in the population level. Therefore, it is quite natural that the units of analysis for objective 3 are the populations rather than individuals. Nevertheless, we do have individual level data for estimating the incidence and risk factor trends, and therefore have statistical information about the accuracy of the data of the units of analysis of objective 3.
In the study, particular attention was paid to quality control and quality assessment of the data collection. This provided information, even though mostly qualitative, on any shortcomings in the standardization and quality of the data.
The data from the study will soon be available for the final analyses. This paper will consider statistical issues related to estimating the trends in incidence, trends in risk factors and the association between these, taking into account the known information about the accuracy of the estimates of trends. Earlier developments in the same topic have been considered by Dobson et al. (1996).
The annual age-specific incidence rate we define simply as rt = et/nt, where et is the number of disease events in year t and nt is the mid-year population size. The simplest model for a trend in the incidence rate is
| (1) |
which implies directly that the change rate of the incidence rate is
. For the relatively short periods in
populations like those in MONICA, it is hardly worth trying to model non-linearity in the
change rate of the incidence rate.
Two basic approaches for estimating the trend (
) are the generalized linear model assuming Poisson variation in
the number of annual events, and the ordinary linear regression of log rt
on t. Both approaches yield usually a very similar estimate of
, but the standard errors differ. If the
observed annual rates lie nearly in a straight line, which is quite possible by chance for
data of 10 years or less, the standard error of
becomes much smaller than is reasonable assuming the randomness in the
individuals to get the disease. Poisson regression takes this randomness into account and
therefore never gives such small random errors. On the other hand, Poisson regression
assumes that all variation around the straight line comes from the Poisson variation,
which is not necessarily the case. Therefore, we have used the regression model
| (2) |
where the variance of the error term has two components,
, with
representing the Poisson variation and
the variance of possible additional deviation from the
regression line. Estimating
by 1/et and considering it fixed,
the parameters can be estimated conveniently using the algorithm described in the
Appendix.
When comparing incidence between populations in wide age groups, it is common to consider age standardized incidence rates:
| (3) |
where k refers to age group and uk is the size of the so called standard population. A trend in age standardized incidence rate can be estimated as described above, but using
. |
(4) |
Dobson et al. (1996) calculated the change in a risk factor between two surveys on independent samples from the same population as the difference between the risk factor mean values divided by the distance between the surveys. This approach, however, has a number of complications:
These complications can be solved in a simple way by pooling the data from each survey and calculating the trend from the individual observations by simple regression on the date of examination.
This is the statistically most challenging part of the analysis. For simplicity, we will consider the case of one risk factor, systolic blood pressure say, only. A straightforward approach is provided by the linear regression model:
| (5) |
where yi, i = 1,
.n, are the estimated
trends in incidence in the n populations, xi are the
estimated trends in blood pressure, and
are i.i.d.
. We do, however, have
additional relevant information: the standard errors of the incidence and risk factor
trend estimates for each population. Therefore, instead of yi and xi
we observe
| (6) |
where
and
are i.i.d.
and
respectively. We can
consider the
and
known, having the values of the variances of the trend estimates. The
regression model now becomes:
| (7) |
where the
are i.i.d.
, with
.
There is yet additional relevant information for the analysis. Although we are happy with the quality of the data from most of the populations, there is a concern with the quality of some of them:
Our information on the bias involved in these concerns is mostly qualitative. One way
of using this information is to exclude from the analysis the populations whose data we
are concerned of. Although it is a safe approach, it would also lead to a loss of relevant
information, in particular in analysis involving several risk factors. As a solution to
this problem, we suggest to assign each population a weight Wi , based
on the quality of the data, which would determine the contribution of the population in
the analysis. Then the model to be considered is the same as (7) above, but with the
i.i.d.
. The
unknown parameters
,
and
of the model can be
estimated conveniently using the algorithm described in the Appendix.
This paper uses the MONICA project as reference, but the issues addressed are relevant to a large variety of population based disease and risk factor monitoring. It is perhaps surprising that some of the simple approaches for estimating incidence and risk factor trends have not been in common use earlier. For a large study like MONICA or in routine monitoring it is important that the methods of estimation apply to a large variety of populations and are computationally fast.
Changes in the population trends in cardiovascular risk factors, like cholesterol or blood pressure, which may be very relevant from the point of view of prevention of the disease, are much smaller than the heterogeneity of the risk factor levels between the populations. Therefore, many methodological factors which are not very crucial in the comparison of levels between different populations, may be very important for the estimation of trends within the populations. This requires that particular attention is paid not only in the measures to attain data of as high quality as possible, but also in assessing retrospectively the quality of data actually achieved. The information on the quality is important for the interpretation of the results, but also challenges the statistician to develop methods which can use the information, which is often qualitative, in an effective and adequate way.
In this paper weighting of populations according to the data quality is suggested. In the MONICA Project, the construction of the weights is going on together with the final assessment of the quality of the full ten-year data, but so far the weighting scheme looks feasible. There will no doubt be details in the weighting procedures where the solution will be partly arbitrary. Assessment of the sensitivity of the weighting on such details will have to be an inherent part of the application of the approach and will determine their ultimate usefulness.
In the paper we have considered special cases of the regression model
| (8) |
where i = 1
n,
,
,
are fixed functions of
and Wi are fixed
weights. The unknown parameters
,
and
can be estimated
iteratively using the approaches described by Pocock et al (1981) and Breslow (1984) for
special cases. The algorithms are based on the findings that
![]() |
(9) |
Writing equation (9) without the expectation, we get:
| (10) |
For fixed
and
we can estimate
iteratively from formula
(10) using the previous value of
in the right side but keeping
.
To estimate
,
and
we can now use the
following algorithm:
| (11) |
The variance of
can be
estimated from the inverse of the information matrix, assuming that the weights
are
fixed. We get
![]() |
(12) |
Breslow N.E. (1984) Extra-Poisson variation in log-linear models. Appl. Statist. 33, 38-44.
Dobson A., Filipiak B., Kuulasmaa K., Beaglehole R., Stewart A., Hobbs M., Parsons R., Keil U., Greiser E., Korhonen H. and Tuomilehto J. (1996) Relations of changes in coronary disease rates and changes in risk factor levels: Methodological issues and a practical example. Am. J. Epidemiol., 143, 1025-34.
Pocock S., Cook D.G. and Beresford S.A.A. (1981) Regression of area mortality rates on explanatory variables: what weighting is appropriate? Appl. Statist., 30, 286-295.
WHO MONICA Project Principal Investigators, prepared by Tunstall-Pedoe H. (1988) The World Health Organization MONICA Project (Monitoring Trends and Determinants in Cardiovascular Disease): A major international collaboration. J. Clin. Epidemiol., 41, 105-14.
The paper considers the common situation where data on populations have been obtained by measurement of the individuals in a series of independent samples (e.g. risk factors of cardiovascular disease) and by continuous counting of events in the individuals of the entire population (e.g. population mortality or incidence), but the risk factors of those getting the events are not known. Methods are described for the estimation of changes in the population mean of the risk factors, estimation of changes in the disease incidence and estimation of the extent to which the observed incidence changes are explained by the observed changes in the risk factors when data are available from a large number of populations. The units of the last analysis are the populations rather than the individuals, but the variances of the variables (i.e. risk factor changes and incidence changes) as well as qualitative information on the quality of the data are available for each population.
Ce papier considère la situation courante selon laquelle les données ont été obtenues dans des populations par la mesure de sujets dans une série d'échantillons indépendants (par exemple, les facteurs de risque des maladies cardio-vasculaires) et par le dénombrement continu des événements survenus parmi les sujets de la population totale (par exemple, la mortalité ou l'incidence de la population), mais où les facteurs de risque des sujets ayant eu un épisode ne sont pas connus. Les méthodes sont décrites pour l'estimation des variations des facteurs de risque dans la moyenne de la population, pour l'estimation des variations de l'incidence de la maladie et l'évaluation du degré selon lequel les variations de l'incidence observée sont expliquées par les variations observées dans le niveau des facteurs de risque, quand les données sont disponibles pour un nombre élevé de populations. Les unités de la dernière analyse sont les populations plutôt que les individus, mais la variance des variables (c'est à dire les variations des facteurs de risque ou de l'incidence), ainsi que les informations qualitatives sur la qualité des données, sont disponibles pour chaque population.