FINNISH RESEARCH PROGRAMME
ON ENVIRONMENTAL HEALTH
SYTTY
 
 

SMALL-AREA ANALYSES OF CANCER INCIDENCE AROUND A POINT SOURCE

Project leader: Antti Penttinen, University of Jyväskylä, Department of Statistics, P.O.Box 35 (MaD), FIN-40351 Jyväskylä, Finland, tel. +358-14-60 2987, e-mail: Penttine@maths.jyu.fi
 
 
PUBLICATIONS
TIIVISTELMÄ SUOMEKSI

Researchers:
Juha Pekkanen, National Public Health Institute, tel. +358-17-201368
e-mail: Juha.Pekkanen@ktl.fi
Esa Kokki, National Public Health Institute and COMAS Graduate School, University of Jyväskylä, tel. +358-17-201 393, e-mail: Esa.Kokki@ktl.fi

Financing SYTTY organisation: The Academy of Finland
Funding from SYTTY / Total funding of project (€): 97902 / 123889
Person-months of work funded by SYTTY / Total person-months of work: 38 / 49

KEY WORDS: small-area methods, GIS, Bayesian hierarchical modeling, Markov chain Monte Carlo (MCMC), cancer
 

EXTENDED ABSTRACT

1 Introduction

Many environmental exposures originate from point sources such as factories, power plants, oil refineries and dump areas. Due to the enormous development in geographical information systems (GIS) the point sources of environmental pollution have proved out to be interesting objectives in epidemiology.

In most studies concerning exposure from a point source by means of geographical information, the research is focused purely on the distance from the source. This area is still subject to statistical research. In most of these studies confounding factors other than age and sex are not utilized. On the other hand, populations near the pollution source often belong to the lower social groups, in which the health status tends to be lower independent of the exposure. Despite, strong conclusions may have been made from the statistical analyses although these are suggestive at the best.

In the present project, geographical distribution of lung cancer risk around a fixed pollutant point source is studied using fully Bayesian hierarchical modelling. Our objectives are in the investigation of the variation in lung cancer risk in relationship with the distance from a fixed source. As a model we use spatial log-linear regression, which has been applied successfully in disease mapping. This model combines covariate information, correlation between the neighbouring areas in the response and unexplained inhomogeneity in the data. When applying the hierarchical Bayesian approach in our modeling, the calculation of the posterior is based on Markov chain Monte Carlo (MCMC) method. Simulations are then used for calculation of data summaries of interest.

2 Methods

The register-based data have been aggregated into adjacent small square-formed areas of size 0.5 x 0.5 km2, called ”pixels”, containing the numbers of cancer cases and risk population counts. Sex, age group and socio-economic class are used as covariates in the calculation of the pixelwise expected number of cancer counts reflecting the nation level risk. In the study area of size 50 x 50 km2 only 20% of the pixels were inhabited being highly inhomoneous. This is a typical feature of high resolution spatial population data in Finland.

Lung cancer can be supposed to be a rare, non-infectious disease. Let Oi  stand for the variable describing the observed number of cases in pixel i and ?i for its expectation. We assume that Ois are conditionally independent given ?is because of the non-infectious nature of the disease, and they are conditionally Poisson disrtibuted. The assumption of conditional independence leads to the familiar Poisson likelihood
 
 






In case the number of cancer cases in pixel i follows the nation level risk then the expectation of Oi is Ei, which is calculated from the nationwide register taking sex, age group, socio-economic status and population in pixel i into account. We assume further that cancer risks in nearby pixels tend to be similar. This contextuality is modelled by means of a random field {?i} and, when using a multiplicative model, we obtain

(1)

where Ei is considered as an offset including the covariates. The role of ?i is smoothing of the ?is and exchange of information between nearby pixels. The simple model (1) can be enlarged to contain distances of the pixels i from a fixed point source by introducing a distance covariate hi. The extended model is then

(2)

where hi can be either a conventional parametric model or based on a subdivision defined by expert opinions . The last mentioned model is defined as follows: The distance effect is defined by choosing K risk areas, each of them being in a relation with the point source (e.g. co-centric spheres). Then the distance covariate can be defined as

(3)

The indicator term 1k(i) is 1 if pixel i belongs to the subarea k, 0 otherwise. The parameter beta-k indicates the risk in the area k in proportion to the expected level.
The prior construction for the logarithm of the random field ni = log(?i) follows the conventional choice,  that n is an intrinsic Gaussian random field defined by the joint distribution

(4)

Here K has the role of the smoothing parameter, wij is a weight relating the neighboring pixels i and j. The hyperprior for K is assumed to be Gamma(a,b) distribution, where a and b are treated as fixed hyperparameters.

In Bayesian modelling statistical inference for the model (1) is based on the posterior distribution which is the distribution of unknowns given the observations. The construction provides definition of suitable prior distributions for the unknowns. For example the random components are modelled by (4) a priori and the logaritms of the regression coefficients beta k by the Normal distribution with zero mean and variances ?k .

The posterior will be calculated through Markov chain Monte Carlo simulation. In practice this means that we construct a Markov chain with the posterior as its invariant distribution. This chain is then simulated resulting a sequence of simulated parameter values, which can be considered to be a dependent sample from the posterior. Further, expected values of parameters are approximated by the sample means of the sequence. In the construction of the Markov chain we apply single site Metropolis updates for n and beta and Gibbs updates for K.

3 Results and discussion

The simple model (1) and the extended model (2) have been constructed, a computation algorithm for their posterior is implemented using language C, the modelling has been experimented with simulated data and the case study data are analyzed. A new statistical result in the work is the introduction of the confirmatory small-area model, its implementation and experimenting with cancer data, see [3].

The project work has pointed out statistical problems not recognized in corresponding studies on disease mapping which are typically based on relatively large administrative regions. The main problem deals with small numbers of cases in pixels, and especially, with large number of uninhabited pixels. This leads to instability of the MCMC algorithm used in simulation of the posterior distribution. The second problem is with the neighbourhood relation among the pixels. Because of heterogeneity, there are pixels with no neighbours (4%) when using the conventional choice of neighbours with 8 adjacent pixels. This problem leads to undersmoothing which causes unstable estimates of parameters in those pixels. One way of getting rid of this drawback is to enlarge the neighbourhood with the cost of spatial precision.

We have suggested a solution to the first problem above. The suggestion is based on constraining the (high-resolution) model by a suitably chosen low-resolution model. We have implemented this idea in the following form: When the expected counts of cases from the high-resolution model (2) are aggregated into the expert-defined confirmatory areas and compared with the corresponding observed SIRs, large deviations are penalized. Simulation experiments show that constraining stabilizes computing of the posterior distribution. The new method also results in estimates of risks, which are more reliable than the ones given by the basic high-resolution model, see [5].

The obtained modelling is computationally intensive. Therefore a simpler Bayesian change-point model is introduced. The case study data are analyzed using the change-point method. The results obtained are promising and are the subject of the manuscript  [7].

One of the objectives of the project is to educate one expert into statistical modelling of environmental exposures on health and computation associated with it. The Ph.D. program of the researcher (Esa Kokki) contains courses in biostatistics organized by Finnish universities, StatNet (a nationwide organization for graduate and postgraduate studies in statistics) and in international workshops and researcher’s education courses. The project has supported this participation. The publications [1], [6] and manuscripts [5], [7] will be included into the Ph.D. thesis by Esa Kokki. The thesis will be completed during 2002.

4 Conclusions

The model suggested for estimation of spatial distribution of cancer incidence is well determined, utilizes the covariate information available and models both contextuality and the relation of the cases to the fixed source of the exposure. The data are allowed to be heterogeneous.

From a statistical point of view, this modelling is an ingenious approach to small-area problems in general. (It has recently introduced also in the field of small-area surveys.) The model for spatial distribution of cancers can be further developed. For example, applying a parametric model for the distance effect would be an interesting choice. Additional covariates can easily be added to the model. Recently new models for analyzing two diseases jointly have been suggested in the literature. This means an increase in the information concerning the effect of the exposure on health of the population.

The commonly applied Poisson regression suffers from the lack of ability to model spatial dependence. The present approach is a solution to this problem. The Bayesian approach allows us to calculate posterior probability intervals, which control the uncertainties of the estimates and gives correct intervals even the data are dependent (which in not the case with the Poisson regression). In addition, one can calculate posterior probabilities for those events being important from the epidemiological point of view. For example, one can calculate the probability that a fixed subarea near the point source has the highest risk among a set of subareas.

Further extensions, which are of importance in the development of the new monitoring system, will be considered and discussed within the group of international experts associated with the project.

5 Cooperation

The project has close connections and cooperation with several research groups and researchers:
Rolf Nevanlinna Institute, University of Helsinki: cooperation in hierarchical Bayesian modeling AND research in disease mapping (Prof. Elja Arjas, Ph.Lic. Jukka Ranta); Finnish Cancer Registry: expertise in cancer epidemiology, cancer databases (Doc. Eero Pukkala); Statistics Finland: population data bases; Geological Survey of Finland: software programming (M.Sc. Esa Kauniskangas); EU 5th framework EUROHEIS (European Environment and Health Information System) project: a project in 7 countries (UK, Finland, Sweden, Denmark, Italy, Eire and Spain). The project is a partner in this framework; University of Florence, exchange of experience in disease mapping (Ph.D. Fabio Divino).

6 References

[1] Haikonen A: Sairauden alueellisen ryvästymisen tutkiminen spatiaalisella tapaus-verrokkiasetelmalla. (Study of spatial clustering of disease using spatial case-control design.). M.Sc. Thesis in statistics, University of Jyväskylä, 1998.

[2] Kokki E: Tilastolliset pienalueanalyysit keuhkosyövän ilmaantuvuudesta päästölähteen ympäristössä. (Statistical small-area analysis of lung cancer incidence around a point source), manuscript of Licentiate’s Thesis in statistics, University of Jyväskylä,  2000.

[3] Kokki E, Ranta J, Penttinen A, Pukkala E, Pekkanen J: Small area estimation of incidence of cancer around a known source of exposure with fine resolution data. Occup. Environ. Med. 2001; 58:315-20.

[4] Kokki E, Pukkala E: Koordinaateilla tarkempi kuva syövän esiintyvyydestä. (Coordinates give more precise picture about incidence of cancer.) Positio 2001; 1: 8-9.

[5] Kokki E, Penttinen A, Ranta J, Pekkanen J: Constrained Bayesian calculation of disease risk around a point source. Manuscript, 2001, submitted to Stat. Med.

[6] Kokki E, Pukkala E, Verkasalo P, Pekkanen J: Small Area Statistics on Health (SMASH): A system for rapid investigations of cancer in Finland. In press, 2002.

[7] Kokki E, Penttinen A: Estimation of risk of disease around a point source with the changepoint method. Manuscript, 2002.

[8] Penttinen A: Small-area statistics in mapping of georeferenced data. The yearbook of the Finnish Statistical Society 1999-2000, 39-47, Helsinki, 2000.
 

[ Projects | Main Page ]