Introduction P P P having made 1 hypothesis fishing data dredging. data mining 2 1 P The rest of the article is laid out as follows: First we review some mathematical methods that might be considered useful for counteracting hypothesis fishing. Then we explain our method in detail, and compare it to two-stage analysis of microarray data, followed by a case study where the method is applied to an analysis of low back pain. The article ends with discussion and conclusion. Mathematical remedies (that fail to solve the problem) If one considers hypotheses fishing to be a mathematical problem, it is reasonable to look for mathematical solutions. There is a large mathematical literature that relates to model building and multiple hypotheses, and we will only try to point out the main themes. 3 4 5 6 cross validation leave-one-out cross validation We mention these methods mainly to point out that they are not very relevant for controlling hypotheses fishing. The use of information criteria or cross validation helps in trading model fit for model size, but does not produce adjustments for the total number of variables. The cross validation procedure of splitting the data set, using one part to estimate model parameters and the other for validation is deceptively similar to the method we apply. But the purpose of validating a model’s predictions is entirely different from our purpose of conducting sound hypothesis testing. P The solution: data splitting P P P statistical significance Models: hypothesis variables, and confounders Epidemiological hypotheses are usually formulated within the framework of a model. Assume the hypothesis is that eating mushrooms increases the risk of cancer. To test this hypothesis, one would build a model with predictors like age, gender, smoking status, as well as mushroom habits, in order to control for these confounding factors. (If old people eat more mushrooms, excluding age from the model would give an incorrect positive association between mushroom eating and cancer.) From a purely computational point of view, no differences exist between predictors associated with hypotheses and confounders, yet the semantics are very different. The confounders are included only as a means of estimating the causal link between the cause (mushrooms) and its hypothesized effect (cancer). P P Size of Part 1 and Part 2 When deciding upon the relative size of Parts 1 and 2, a trade-off exists between the need to identify hypotheses by exploration in Part 1, and the need to achieve statistical significance in Part 2. An even split may be reasonable in cases where the need for exploration is high, particularly if the data set is large, so that half of the data set is sufficient to achieve statistical significance for stronger effects. In cases where greater domain knowledge is available based on the existing literature, a smaller Part 1 is reasonable, especially when the sample size is small. Multiple hypotheses It is possible to investigate multiple hypotheses within our splitting regime, using Bonferroni corrections. Assume, in the mushroom-cancer example, that the analysis of Part 1 also provided strong support for the hypothesis that eating bananas protects against cancer. One might then choose to include both mushroom habits and banana habits as hypothesis variables, and consequently divide alpha by 2 (the number of hypotheses). If either mushroom or banana habit fails the significance test in Part 2, it will still be in the model, as a confounder. P The mushroom-banana example is a clear case in which investigators should reduce the alpha-level to account for multiple hypotheses. At the other extreme, if independent research groups investigating different research questions based on independent data sets, their combined effort is obviously not a case of ‘multiple hypothesis testing’. A grey area exists with partially overlapping data sets, hypotheses, and research groups, often making it difficult to decide whether Bonferroni corrections are called for. A pragmatic solution may be to view a published article as a unit, and apply Bonferroni corrections within each one. Relation to two-stage analysis in genetics 7 There are a few differences, however. In a microarray context the set of possible hypotheses is given by the number of genes, and the FDR method is normally used to limit the number of incorrect findings. Rather than primarily counteracting hypothesis fishing, microarray two-stage analysis is usually motivated by cost effectiveness: By screening out promising candidates first, and then evaluating them, researchers can make a higher number of valuable discoveries for each monetary unit spent. In a microarray setting the procedure is also likely to be more automatic, as interesting genes are filtered out in two more or less mechanical steps of analysis. In our epidemiological application, on the other hand, there will be a man-in-the-loop, as the researcher builds a model with hypothesis variables and confounders based on a combination of his domain knowledge and Part 1 of the data. Case study of low back pain in the Ullensaker study Study sample and setting N n n Outcome measures 8 Independent variables (potential risk factors) In 1990, the survey questionnaire contained a number of socio-demographic and health-related factors, which could be included as risk factors in the present study. Socio-demographic variables were gender, age, marital status, and work status. Health-related variables were body mass index (BMI), smoking status, number of MSP sites other than the low back, duration of previous MSP, use of medication due to MSP, having been examined by a health care provider due to MSP during the last year, comorbidity, family history of musculoskeletal problems, emotional distress, leisure physical activity, participation in competitive sports, sleeping problems, and self-perceived health. Model and hypotheses 8 1 We hypothesized that smoking would be positively associated with LBP. Therefore, a 1-tailed hypothesis test was conducted. It was also hypothesized that individual pain sites would be positively associated with LBP. To limit the number of hypotheses, though, we hypothesized that the total number of pain sites would affect LBP probability, rather than run analyses for each level of the variable. Hypotheses testing P 2 P Table 1 Parameter estimates from Part 1, controlling for age, gender, and marital status Predictor OR estimate 95% CI for OR P a 0.012 1 or 2 pain sites 2.292 (1.248–4.208) 0.007 3 or 4 pain sites 2.690 (1.406–5.147) 0.003 5 or more pain sites 2.944 (1.193–7.262) 0.019 Smoking 2.079 (1.285–3.363) 0.003 a no pain sites Table 2 Parameter estimates from Part 2, controlling for age, gender, and marital status Predictor OR estimate 95% CI for OR P a 0.015 1 or 2 pain sites 1.328 (0.793–2.224) 0.281 3 or 4 pain sites 1.598 (0.857–2.979) 0.141 5 or more pain sites 3.941 (1.700–9.136) 0.001 Smoking 0.993 (0.627–1.571) 0.487* a no pain sites P P P P Parameter estimates number of pain sites 3 P Table 3 number of pain sites Predictor OR estimate 95% CI for OR P a 0.000 1 or 2 pain sites 1.637 (1.116–2.400) 0.012 3 or 4 pain sites 1.983 (1.285–3.061) 0.002 5 or more pain sites 3.346 (1.846–6.067) 0.000 a no pain sites Discussion P In this study, the investigators had free access to the entire data set prior to data splitting and during model development. However, any temptation to “peak” at the material was successfully avoided, as indicated by the discrepant results for smoking status. This indicates that the data splitting procedure can indeed function properly in the absence of strict external control of the data. Nevertheless, we recommend that the data set be handled by an independent party, so that researchers can document claims that only Part 1 of the data set was used for model and hypothesis development. Ideally, the establishment of an independent international body is recommended to manage splitting of survey data. A fixed date for releasing data for Part 2 would be agreed upon, so that only those hypotheses specified prior to the release date would undergo a true significance test. Although several challenges and practical issues would inevitably need resolution (i.e., data collection, confidentiality, release of data), such an organization should be feasible and acceptable to the scientific community. Conclusions Results demonstrated that the number of musculoskeletal pain sites significantly predicts low back pain at a 14-year follow-up, when controlling for age, gender, marital status, and smoking. The application of the data splitting method in our study indicates its potential as an effective and useful method to counteract hypothesis fishing in population surveys. In our opinion, systematic data splitting administered by an independent party would accomplish for statistical surveys what pre-registration has already done for clinical trials.