Contents Previous (Part III Section 5) Next (Part III Section 7)
Data management is one of the corner stones of a well-organized risk factor survey. Good data management ensures that:
Point one above involves data collection, checking, error correction, documentation and back-up of the study database. Point two concerns the analysis of the data from the database, to obtain survey results. Separation of the data management into two stages is highly recommended. If the database is not completed and the quality of the data are not documented before the data are analyzed for the survey results, it is likely that the data analysis will reveal problems in the data which should have been detected earlier in the preparation of the database. This usually results in much longer delays in the final analysis than if more care had been taken in preparation of the database. Furthermore, use of unchecked and uncorrected data will lead to incorrect results.
Well-planned data management facilitates good quality data and ready availability of the data for analysis. Poorly planned data management is usually expensive, increases risk of loss of data, decreases data quality and delays the data analysis.
We will consider here issues which have been found crucial for survey data management. There are good methods for many of the stages of the data management, often ranging from manual methods to computerized ones. The choice of the methods will depend on the local facilities, existing practice and the expertise available, and will not be considered here.
The first data management issue in a survey is usually the management of the documentation of the survey planning. The next main data management issue relates to sample selection and recruitment. As a minimum, the following information needs to be recorded for every subject selected for the sample:
The subjects selected in the sample form the basis for the control of the data completeness throughout the data management process. The survey history of every subject should be verifiable from the final data base.
At the recruitment stage, attempts made to contact each subjects need to be monitored, and the eventual success of the recruitment needs to be recorded. For the subjects who did not participate in the survey examinations, the reason should be recorded using a classification which at least includes the following categories:
Categories 'a' to 'c' are often called ineligible to the survey, whereas 'd' and 'e' are true respondents. The eligibility of category 'f' is often uncertain, and this category should be allocated sparingly.
Recording the survey measurements and getting the data from the different survey sites to a common database is a challenging part of the data management. All steps where data are transferred from one form to another or from one place to another require specific attention at the planning stage. Such steps include:
Three main challenges for data management in these steps are to ensure that:
Errors and incompleteness of the records can be prevented by good design of record forms and by routine checking of the forms and data. The earlier the errors can be detected, the easier they are to remedy. In particular, where feasible, detection of errors should be attempted when the subject is still in the interview or examination site.
Relevant data which were not obtained from the subject should never be left blank on the form, but a specific code for missing data should be used. Subsequently, the incompleteness of the data records can be detected easily as blanks in the data forms.
Computer-based data collection systems have the advantage that they reduce the number of manual data transfers and facilitate extensive data checking at an early stage. However, such systems should be used only if they have been tested in the field and found reliable. Otherwise there is an increased risk for losing records or delaying the examination schedule due to breakdown of the system.
The use of paper forms has proven to be reliable over decades, but they have the problem that on-site data checking is more difficult. If paper forms are used, the keying of the data into electronic format will need to be done carefully. The traditional method of double keying by different persons is still worth considering.
To prevent the loss of records, it is important that the subject identification becomes correctly recorded at all stages (see Section 6.1). Furthermore, whenever data or samples are transferred from one place to another, it is important that the transfer is logged properly. The recipient of the data or samples should be able to check immediately that he or she has received exactly the same records which were sent, and the person sending the data should make sure that everything was received.
The data should be checked as soon as possible after the data collection for:
A visual checking of the key items can be done at the interview or examination site even if paper forms were used, and extensive checking should take place as soon as the data have been computerized. When potential errors are detected, they should be investigated for correctness, and corrected only if it is found that they really are errors. It is a good practice to authorize only those who have made the errors to correct them, because they are usually in the best position to say if there really is an error, and they are usually the only ones who know the correct value. Each error and its possible correction should be documented.
The frequency of errors, which were not possible to remedy should be documented. The same concerns the results of the quality control during the data collection, any deviations from the survey protocol, and any other information which may be relevant in the interpretation of the results. Knowledge of these issues is essential to those who analyze the data and interpret the results. Examples of the routine error checking criteria and documentation of the quality of the data, in a multinational setting can be found in the MONICA quality assessment reports, which are available through the internet at http://www.ktl.fi/publications/monica/.
The database can be structured in different ways, depending on the available facilities and expertise, and size or complexity of the survey. Often the data management facilities of statistical computer packages (such as SAS or SPSS) are sufficient, but specific database management systems (such as ORACLE RDB) may provide more versatility. More important than the selection of the database management system is usually the fact that expertise on data management should be present from the beginning of the planning of the survey. There are too many examples of surveys where the data quality control and the data security had to be compromised because expertise in data management was acquired too late.
All data in electronic format has to be backed up routinely for accidental breaks of the storage devices, failures in data transfer and unintentional deletion of the data files. Common situations where important data have been lost, although some back-up was in place, are:
Back-up is not only needed for data in electronic format, but also for some paper documents, such as log books for the survey examinations.
Only authorized persons should have access to the data and all of them must understand the importance of the confidentiality of the data. After the data collection, the information from which a person can be identified and the code which connects this information to the subject identification of the survey records, has to be stored separate from the survey data. Usually very few people need access to the person identification information, whereas the rest of the survey data will need to be accessed by all who analyze the data. Specific precautions should also be defined for the handling and storage of paper forms in the examination site and elsewhere.
Routine statistical program packages usually carry out the analyses which the user instructs them to do correctly. However, any misprints in the user's instructions or misunderstandings of the language of the programs lead to incorrect results. Therefore, important analyses should be checked by another person, and the results of all analyses should be reviewed critically by the programmer and other persons involved in reporting the findings. Where possible, the results should be compared with other results of similar analysis, and the reasons for inconsistencies investigated. It is recommended that a data book is prepared as part of the preparation of the data, consisting of the distributions of all data items and basic descriptive statistics of the data. Such a data book will be an invaluable reference when checking the correctness of most later data analyses.
The documentation of certain issues relating to all data analyses which lead to reporting or publication will be valuable. The analyses may need to be repeated because of changes made to the data, or if similar analyses will be performed on other data, or if uncertainties about the correctness of the analysis emerge. The most important issues to be documented are the:
The analysis documentation can be very simple, and therefore need not take much time. It can be kept on paper or in electronic form, but it should be readily available to all those who may need it.
Contents Previous (Part III Section 5) Next (Part III Section 7)