Quality Assessment of Demographic Data in the WHO MONICA Project

Appendix 2. Method for checking the consistency of the population dynamics

1. Overview

In the WHO MONICA Project, annual data on population size were collected for a period of about ten years. The age range covered was 25-64 years or optionally 25-74 years, and the data were reported by 5-year age group. We describe here the method which was used for checking the consistency of the population dynamics as part of the data quality assessment in the MONICA Data Centre (MDC).

The approach is to estimate the migration pattern from the reported data on population size. If the migration pattern shows year specific or isolated jumps, very large annual migration, or other abnormalities, the data are considered to be inconsistent, and concern is aroused as to the quality of the data. Here we will use the term "migration" for the sum of all changes in the population size, including conventional migration and mortality. (Birth is not included because it is irrelevant in the age groups considered in MONICA.)

To facilitate the estimation of migration, we will need a model and a number of assumptions. The scheme for estimating migration is based on four elements:

These elements together with the reported data enable the estimation of one year population sizes and migration. The next three sections of this appendix give a mathematical description of the model and the three assumptions. The last two sections set up the optimisation problem, the solution of which yields the required estimate of the migration pattern, and introduces some computational aspects. The details of the optimisation algorithm and the computational scheme are beyond the scope of this appendix.

2. Notation

Matrices and vectors are denoted by bold-faced letters such as B and y, and scalars by unbold lower case letters.

Specific notation from set theory and theory of optimisation will be incorporated in the text. We will use one- and two-dimensional scalar functions defined on integer-valued domains. Notation [.,.] indicates a contiguous range of a function argument or index of summation: the first dot indicates the position of the value or variable for the lower limit of the range, and the second one the upper limit. A letter with subscript 0 denotes the lower limit of a parameter range,  while a capital letter denotes its upper limit, unless otherwise specified. For example, Img00001.gif (985 bytes) means, that parameter a belongs to the set of integer numbers {ao, ao+1, ... , A}.

Symbol Img00002.gif (843 bytes) means "is defined as". For example, expression Img00003.gif (1187 bytes) means that vector po is defined as a set of the specified values of function p.

In some cases where capital letters are used to denote  the value of a parameter, the corresponding lower case letter denotes its logarithm. Such cases will be specified in the text.

A certain amount of linear algebra is assumed to be known by the reader.

3. A model of demographic dynamics

Our model for the population dynamics, using integer valued calendar time and age, specifies the yearly changes in the population size of one year specific cohort groups:

Img00004.gif (1297 bytes) (1)

where

By introducing

Img00005.gif (1332 bytes)

we may rewrite (1) in form:

Img00006.gif (1184 bytes)

Taking the logarithm of both sides of the equation, we get

Img00007.gif (1191 bytes) (2)

where p = log(P) and c = log(C).

Here c(t,a) is actually the rate of change in the population size. For example c(t,a)=0.05 corresponds approximately to a 5% increase in the following year in the size of the population (t,a).

For the younger age groups the change reflects predominantly migration, and the contribution of mortality to the change increases with age. In the following analysis we call c the "migration rate" or just "migration", regardless of the fact that it also reflects mortality.

For simplicity, we set t=0 for the first calendar year and count the later years from the first one. We consider the "squared area”:

Img00008.gif (1120 bytes) (3)

and assume that c(t,a) is defined everywhere in the area (3), except the last year T. Furthermore, we assume that p(0,a) is given for all Img00009.gif (1005 bytes) (initial condition), and p(t,ao) is given for all Img00010.gif (982 bytes) (boundary condition). Using (2), we can then recursively express p(t,a) in area (3) for each t>0 and  a>ao. For notational convenience,  we extend the domains of c and p beyond the area (3):

Img00011.gif (1490 bytes) (4)

Based on (2) and (4) we can now express the logarithm of the population size p(t, a) in area (3) for calendar years t>0 using the population size of the first year and the annual migration rates:

Img00012.gif (1351 bytes) (5)

We will base further consideration in this formulation of the model of demographic dynamics . It defines the logarithm of the size of any one year birth cohort in any calendar year as a linear function of the migration rates c and the logarithm of its population size in the first year.

4. Consistency criteria

Our model (5) specifies the population size in one year age groups, but the MONICA data were reported by five year age groups, which cover the age range [ao,A] for every calendar year Img00010.gif (982 bytes). Let j be an index for the five year age groups: j=1 denotes the youngest age group (i.e. 25-29) and J the oldest age group (i.e. 60-64 or 70-74). By a(j) we denote the first year of age in age group j (e.g. a(1)=25). By R(t,j) we denote the reported population size of the five-year age group j in year t.

As model (5) deals with the logarithms of the population size, at first sight it seems that we should use r(t,j), defined as the logarithm of R(t,j). However, as model (5) uses the logarithms of the population sizes of the one year age groups, r(t,j) should be analogous to the sums of the logarithms of the five one year age group sizes. This is not the case with the logarithm of R(t,j), but using a simple approximation we may define:

Img00013.gif (1144 bytes) (6)

From now on we refer to r(t,j) as the "reported data".

For the given reported data r(t,j), our aim is to find a vector

Img00014.gif (1185 bytes)

and a matrix

Img00015.gif (1259 bytes)

such that the the degree of variability in their elements is feasible and, when substituted in (5), they provide a satisfactory fit with the reported data. In order that we can set exact criteria for these conditions, we construct three indicators:

We construct indicator S1 as the mean squared relative deviation of the 1-year age group sizes in the first calendar year:

Img00016.gif (1413 bytes) (7)

where aF = ao-T, and

Img00017.gif (2170 bytes)

Thus µ(a) is the moving average of the logarithms of the five one-year age group sizes.

We construct indicator S2 as the mean squared relative discrepancy between the population size estimates from the model and the reported data:

Img00018.gif (1325 bytes) (8)

where Img00019.gif (1309 bytes)

and  Img00020.gif (1348 bytes)

Before constructing indicator S3 we simplify our model by assuming that at any given year, matrix c is constant within the reported 5-year age group. Indicator S3 is then defined as:

Img00021.gif (2087 bytes) (9)

This indicator shows how far matrix c is from the matrix with all elements equal to zero, which corresponds to the case of no migration.

5. Setting up the optimisation problem

For the further analysis it is convenient to present all indicators as functions in a vector space, using vectors:

x = po,
u = (c(0,a(1)),...,c(0,a(J)),...,c(T-1,a(1)),...,c(T-1,a(J)))T,
z = (xT,uT)T,
y2 =( r(0,1),...,r(0,J),...,r(T,1),...,r(T,J))T.

Let k=dim(x), m=dim(u), n=k+m=dim(z).

We can define matrices B1x, B2x, B2u and B3u such that

S1(z) = S1(x, u) = ||B1xx||2;

S2(z) = S2(x, u) = ||B2xx+B2uu-y2||2;

S3(z) = S3(x, u) = ||B3uu||2 = ||Iu||2, where I is the m×m identity matrix.

(10)

where ||.|| denotes the Euclidean norm and I is the m×m identity matrix. One can identify the elements of matrices B1x, B2x, B2u and B3u by inspecting equations (7), (8) and (9). Note that all indicators S1, S2 and S3 are quadratic forms in the finite dimensional vector space En = Ek×Em with elements z .

Our optimisation problem, which we call "Problem P1", is to:

minimise Img00022.gif (1019 bytes)
subject to Img00023.gif (1125 bytes)

where Img00027.gif (872 bytes) and Img00028.gif (873 bytes) are positive constants.

6. Computational aspects

An analysis of Problem P1 has shown that there exists a unique solution for practically any values of constants Img00027.gif (872 bytes) and Img00028.gif (873 bytes), except when they both are too small. After computational experimentation, the constants were fixed to Img00027.gif (872 bytes) = 0.0025 and Img00028.gif (873 bytes) = 0.01.

The theory of optimisation provides an iterative algorithm for obtaining a solution. At each step an auxiliary optimisation problem is solved.This is minimising Lagrangian function for the Problem P1:

Img00024.gif (1447 bytes)

with non-negative parameters Img00025.gif (869 bytes) and Img00026.gif (874 bytes). Such an optimisation can be performed iteratively as the by-product of the weighted least squares estimation, where the matrix of observations is composed of matrices B1x, B2x, B2u and B3u and vector y2, and the weights are determined by Img00025.gif (869 bytes) and Img00026.gif (874 bytes) of the previous round of iteration.

The computation in the MDC was done using SAS. The program prepared for this includes tools for controlling the convergence and a more advanced method for approximating the logarithmic analogue of the reported data than that described in (6).