This paper was originally published in Bulletin of the International Statistical Institute: Proceedings of the 51st Session; 1997 Aug 18-26; Istanbul, Turkey.Voorburg: International Statistical Institute; 1997. Book 1; 295-8. The International Statistical Institute has granted the athors permission to reproduce it in the World Wide Web.


Statistical issues related to following populations rather than individuals over time

Kari Kuulasmaa, MONICA Data Centre, Department of Epidemiology and Health Promotion, National Public Health Institute, Mannerheimintie 166, 00300 Helsinki, Finland
Annette Dobson, Hunter Region Heart Disease Prevention Programme, Centre for Clinical Epidemiology and Biostatistics, University of Newcastle, New South Wales 2308, Australia
for the WHO MONICA Project

1. Introduction

The motivation for the current paper comes from the WHO MONICA Project (1988). It is a population based monitoring study for cardiovascular diseases, involving about 40 populations in 21 countries. The objectives include:

  1. measurement of ten-year changes in the incidence rates of acute myocardial infarction and coronary deaths in each population through population based registration;
  2. measurement of ten-year changes in the population distributions of the major risk factors of smoking, blood pressure and blood total cholesterol through two or three independent sample surveys, one in the beginning, one at the end and an optional one in the middle of the ten-year period; and
  3. assessment of the extent to which the incidence changes are explained by changes in the known risk factors.

The members in the populations are those of certain age with their permanent residence in a geographically defined area. Therefore, there is a change of members in the populations over time. There is no follow-up of individuals.

The objectives concern changes in the population level. Therefore, it is quite natural that the units of analysis for objective 3 are the populations rather than individuals. Nevertheless, we do have individual level data for estimating the incidence and risk factor trends, and therefore have statistical information about the accuracy of the data of the units of analysis of objective 3.

In the study, particular attention was paid to quality control and quality assessment of the data collection. This provided information, even though mostly qualitative, on any shortcomings in the standardization and quality of the data.

The data from the study will soon be available for the final analyses. This paper will consider statistical issues related to estimating the trends in incidence, trends in risk factors and the association between these, taking into account the known information about the accuracy of the estimates of trends. Earlier developments in the same topic have been considered by Dobson et al. (1996).

2. Estimating trends in incidence rates

The annual age-specific incidence rate we define simply as rt = et/nt, where et is the number of disease events in year t and nt is the mid-year population size. The simplest model for a trend in the incidence rate is

Image44.gif (1067 bytes) (1)

which implies directly that the change rate of the incidence rate is . For the relatively short periods in populations like those in MONICA, it is hardly worth trying to model non-linearity in the change rate of the incidence rate.

Two basic approaches for estimating the trend () are the generalized linear model assuming Poisson variation in the number of annual events, and the ordinary linear regression of log rt on t. Both approaches yield usually a very similar estimate of , but the standard errors differ. If the observed annual rates lie nearly in a straight line, which is quite possible by chance for data of 10 years or less, the standard error of becomes much smaller than is reasonable assuming the randomness in the individuals to get the disease. Poisson regression takes this randomness into account and therefore never gives such small random errors. On the other hand, Poisson regression assumes that all variation around the straight line comes from the Poisson variation, which is not necessarily the case. Therefore, we have used the regression model

Image31.gif (1061 bytes) (2)

where the variance of the error term has two components, Image32.gif (1053 bytes), with Image33.gif (873 bytes) representing the Poisson variation and Sigma2.gif (873 bytes) the variance of possible additional deviation from the regression line. Estimating Image33.gif (873 bytes) by 1/et and considering it fixed, the parameters can be estimated conveniently using the algorithm described in the Appendix.

When comparing incidence between populations in wide age groups, it is common to consider age standardized incidence rates:

(3)

where k refers to age group and uk is the size of the so called standard population. A trend in age standardized incidence rate can be estimated as described above, but using

. (4)

3. Estimating trends in risk factors

Dobson et al. (1996) calculated the change in a risk factor between two surveys on independent samples from the same population as the difference between the risk factor mean values divided by the distance between the surveys. This approach, however, has a number of complications:

  1. estimating the time distance between the surveys is not straight forward because the examinations are not necessarily distributed uniformly in the duration of the survey;
  2. the surveys may last several years, and the trend within the survey may already bring useful information;
  3. the approach has no obvious extension to the case of three surveys instead of two.

These complications can be solved in a simple way by pooling the data from each survey and calculating the trend from the individual observations by simple regression on the date of examination.

4. Estimating the association between trends in event rates and trends in risk factors

This is the statistically most challenging part of the analysis. For simplicity, we will consider the case of one risk factor, systolic blood pressure say, only. A straightforward approach is provided by the linear regression model:

Reg1.gif (1030 bytes) (5)

where yi, i = 1,….n, are the estimated trends in incidence in the n populations, xi are the estimated trends in blood pressure, and Eps.gif (862 bytes) are i.i.d. Image56.gif (983 bytes). We do, however, have additional relevant information: the standard errors of the incidence and risk factor trend estimates for each population. Therefore, instead of yi and xi we observe

X.gif (976 bytes) and Y.gif (972 bytes) (6)

where Mu.gif (871 bytes) and are i.i.d. Normp.gif (982 bytes) and Normth.gif (987 bytes) respectively. We can consider the Phi.gif (880 bytes) and Theta.gif (886 bytes)known, having the values of the variances of the trend estimates. The regression model now becomes:

Reg2.gif (1150 bytes) or in another form Reg3.gif (1045 bytes) (7)

where the Epspr.gif (869 bytes) are i.i.d. Image34.gif (1063 bytes), with Tau2.gif (1101 bytes).

There is yet additional relevant information for the analysis. Although we are happy with the quality of the data from most of the populations, there is a concern with the quality of some of them:

  1. In some surveys the blood pressure measurers were not trained properly or there was no adequate quality control during the survey. This may lead to a serious bias in the estimated blood pressure trends;
  2. In some populations there are changes in the equipment used for measurement or in the measurement procedures. Some of such changes cause changes to the level of the measurements;
  3. The response rates to the surveys vary between about 50% and 90%.

Our information on the bias involved in these concerns is mostly qualitative. One way of using this information is to exclude from the analysis the populations whose data we are concerned of. Although it is a safe approach, it would also lead to a loss of relevant information, in particular in analysis involving several risk factors. As a solution to this problem, we suggest to assign each population a weight Wi , based on the quality of the data, which would determine the contribution of the population in the analysis. Then the model to be considered is the same as (7) above, but with the Epspr.gif (869 bytes) i.i.d. Normta.gif (1141 bytes). The unknown parameters , and Sigma2.gif (873 bytes) of the model can be estimated conveniently using the algorithm described in the Appendix.

5. Discussion

This paper uses the MONICA project as reference, but the issues addressed are relevant to a large variety of population based disease and risk factor monitoring. It is perhaps surprising that some of the simple approaches for estimating incidence and risk factor trends have not been in common use earlier. For a large study like MONICA or in routine monitoring it is important that the methods of estimation apply to a large variety of populations and are computationally fast.

Changes in the population trends in cardiovascular risk factors, like cholesterol or blood pressure, which may be very relevant from the point of view of prevention of the disease, are much smaller than the heterogeneity of the risk factor levels between the populations. Therefore, many methodological factors which are not very crucial in the comparison of levels between different populations, may be very important for the estimation of trends within the populations. This requires that particular attention is paid not only in the measures to attain data of as high quality as possible, but also in assessing retrospectively the quality of data actually achieved. The information on the quality is important for the interpretation of the results, but also challenges the statistician to develop methods which can use the information, which is often qualitative, in an effective and adequate way.

In this paper weighting of populations according to the data quality is suggested. In the MONICA Project, the construction of the weights is going on together with the final assessment of the quality of the full ten-year data, but so far the weighting scheme looks feasible. There will no doubt be details in the weighting procedures where the solution will be partly arbitrary. Assessment of the sensitivity of the weighting on such details will have to be an inherent part of the application of the approach and will determine their ultimate usefulness.

6. Appendix

In the paper we have considered special cases of the regression model

Image35.gif (1030 bytes) (8)

where i = 1…n, Image36.gif (968 bytes), Image37.gif (1166 bytes), Tau1.gif (874 bytes) are fixed functions of and Wi are fixed weights. The unknown parameters , and Sigma2.gif (873 bytes) can be estimated iteratively using the approaches described by Pocock et al (1981) and Breslow (1984) for special cases. The algorithms are based on the findings that

  1. For fixed Sigma2.gif (873 bytes) we can estimate and iteratively by weighted least squares using weights Image40.gif (1174 bytes), with of the previous stage in the denominator.
  2. If Image39.gif (1055 bytes) are the true variances of Eps.gif (862 bytes),
(9)

Writing equation (9) without the expectation, we get:

. (10)

For fixed and we can estimate Sigma2.gif (873 bytes) iteratively from formula (10) using the previous value of Sigma2.gif (873 bytes) in the right side but keeping Image47.gif (926 bytes).

To estimate , and Sigma2.gif (873 bytes) we can now use the following algorithm:

  1. Obtain initial estimates Image50.gif (877 bytes) and Image51.gif (887 bytes) by unweighted linear regression.
  2. Obtain initial estimate
    (11)
  3. Obtain new Sigma2.gif (873 bytes)j iteratively using equation (10), as described above.
  4. Obtain new j and j by weighted least squares as described above.
  5. Repeat steps 3 and 4 until converge.

The variance of can be estimated from the inverse of the information matrix, assuming that the weights Image40.gif (1174 bytes) are fixed. We get

(12)

BIBLIOGRAPHY

Breslow N.E. (1984) Extra-Poisson variation in log-linear models. Appl. Statist. 33, 38-44.

Dobson A., Filipiak B., Kuulasmaa K., Beaglehole R., Stewart A., Hobbs M., Parsons R., Keil U., Greiser E., Korhonen H. and Tuomilehto J. (1996) Relations of changes in coronary disease rates and changes in risk factor levels: Methodological issues and a practical example. Am. J. Epidemiol., 143, 1025-34.

Pocock S., Cook D.G. and Beresford S.A.A. (1981) Regression of area mortality rates on explanatory variables: what weighting is appropriate? Appl. Statist., 30, 286-295.

WHO MONICA Project Principal Investigators, prepared by Tunstall-Pedoe H. (1988) The World Health Organization MONICA Project (Monitoring Trends and Determinants in Cardiovascular Disease): A major international collaboration. J. Clin. Epidemiol., 41, 105-14.

SUMMARY

The paper considers the common situation where data on populations have been obtained by measurement of the individuals in a series of independent samples (e.g. risk factors of cardiovascular disease) and by continuous counting of events in the individuals of the entire population (e.g. population mortality or incidence), but the risk factors of those getting the events are not known. Methods are described for the estimation of changes in the population mean of the risk factors, estimation of changes in the disease incidence and estimation of the extent to which the observed incidence changes are explained by the observed changes in the risk factors when data are available from a large number of populations. The units of the last analysis are the populations rather than the individuals, but the variances of the variables (i.e. risk factor changes and incidence changes) as well as qualitative information on the quality of the data are available for each population.

RESUME

Ce papier considère la situation courante selon laquelle les données ont été obtenues dans des populations par la mesure de sujets dans une série d'échantillons indépendants (par exemple, les facteurs de risque des maladies cardio-vasculaires) et par le dénombrement continu des événements survenus parmi les sujets de la population totale (par exemple, la mortalité ou l'incidence de la population), mais où les facteurs de risque des sujets ayant eu un épisode ne sont pas connus. Les méthodes sont décrites pour l'estimation des variations des facteurs de risque dans la moyenne de la population, pour l'estimation des variations de l'incidence de la maladie et l'évaluation du degré selon lequel les variations de l'incidence observée sont expliquées par les variations observées dans le niveau des facteurs de risque, quand les données sont disponibles pour un nombre élevé de populations. Les unités de la dernière analyse sont les populations plutôt que les individus, mais la variance des variables (c'est à dire les variations des facteurs de risque ou de l'incidence), ainsi que les informations qualitatives sur la qualité des données, sont disponibles pour chaque population.