Intended for healthcare professionals

CCBYNC Open access
Research Special Paper

Exploratory analyses in aetiologic research and considerations for assessment of credibility: mini-review of literature

BMJ 2022; 377 doi: https://doi.org/10.1136/bmj-2021-070113 (Published 03 May 2022) Cite this as: BMJ 2022;377:e070113
  1. Kim Luijken, doctoral student1,
  2. Olaf M Dekkers, professor1,
  3. Frits R Rosendaal, professor1,
  4. Rolf H H Groenwold, professor1 2
  1. 1Department of Clinical Epidemiology, Leiden University Medical Centre, Leiden, Netherlands
  2. 2Department of Biomedical Data Sciences, Leiden University Medical Centre, Leiden, Netherlands
  1. Correspondence to: K Luijken k.luijken{at}umcutrecht.nl
  • Accepted 22 March 2022

Abstract

Objective To provide considerations for reporting and interpretation that can improve assessment of the credibility of exploratory analyses in aetiologic research.

Design Mini-review of the literature and account of exploratory research principles.

Setting This study focuses on a particular type of causal research, namely aetiologic studies, which investigate the causal effect of one or multiple risk factors on a particular health outcome or disease. The mini review included aetiologic research articles published in four epidemiology journals in the first issue of 2021: American Journal of Epidemiology, Epidemiology, European Journal of Epidemiology, and International Journal of Epidemiology, specifically focusing on observational studies of causal risk factors of diseases.

Main outcome measures Number of exposure-outcome associations reported, grouped by type of analysis (main, sensitivity, and additional).

Results The journal articles reported many exposure-outcome associations: a mean number of 33 (range 1-120) exposure-outcome associations for the primary analysis, 30 (0-336) for sensitivity analyses, and 163 (0-1467) for additional analyses. Six considerations were discussed that are important in assessing the credibility of exploratory analyses: research problem, protocol, statistical criteria, interpretation of findings, completeness of reporting, and effect of exploratory findings on future causal research.

Conclusions Based on this mini-review, exploratory analyses in aetiologic research were not always reported properly. Six considerations for reporting of exploratory analyses in aetiologic research were provided to stimulate a discussion about their preferred handling and reporting. Researchers should take responsibility for the results of exploratory analyses by clearly reporting their exploratory nature and specifying which findings should be investigated in future research and how.

Introduction

Reports of aetiologic studies often have results of multiple exploratory analyses, with the aim of identifying topics for future research. Although this form of reporting might seem reasonable, it is not without risk, because compared with the results of a confirmatory study, assessing the credibility of exploratory findings is generally more complicated.

The origin of exploratory data analysis can be traced back at least to Tukey in the 1960s and 1970s12 who encouraged statisticians to develop visualisation techniques for representing and capturing structures in datasets to establish new research questions. These new research questions should subsequently be answered with independent datasets (often termed confirmatory analysis). For example, when a new biomarker is thought to be part of a known causal pathway, performing a small preparatory exploratory study before conducting a full blown large cohort study seems worthwhile, because the cohort study is financially expensive and requires large investments of resources. Similarly, if a known exposure-outcome effect is thought to vary across subgroups of the population, exploring this idea first before embarking on confirmative analyses of the effect of heterogeneity seems appropriate.

Even when researchers consider an analysis to be exploratory, a hypothesis is easily promoted to a fact. For example, findings in journal articles can be exaggerated to more certain statements in press releases and news articles.3 In medical science in particular, where results are sometimes quickly implemented in clinical practice, researchers should take responsibility for the results they report. The Hippocratic oath (“First, do not harm”) applies as well to medical research as it does to clinical practice.

In this paper, we discuss issues that complicate the interpretation of exploratory analyses in causal studies. Causal research can refer to different types of research, such as randomised studies or intervention studies. We do not address these studies in our manuscript; we focus on aetiologic research, in which causes of disease are investigated. Specifically, the causal effect of risk factors on a health outcome or disease are studied, typically in an observational setting. We provide practical pointers for researchers on how to report exploratory analyses in aetiologic research and how to clarify what the exploratory results imply for future research and implementation in practice. We hope to encourage a discussion about the preferred handling and reporting of these analyses.

Methods

Exploratory analyses in aetiologic research

The term exploratory analysis typically refers to analyses for which the hypothesis was not specified before the data analysis.4 Considering exploratory analyses in a broader sense, however, is probably more relevant in aetiologic research, because of the observational data and clustering of analyses within cohorts. We use the term exploratory analyses here to indicate analyses that are initial and preliminary steps towards solving a research problem. Exploratory analyses are often conducted in addition to planned primary analyses of a study. We do not consider sensitivity analyses, where the main hypothesis is evaluated under different assumptions, to be exploratory in this paper. We also do not consider outcomes that are evaluated as a secondary objective but are correlated with the primary outcome to be exploratory, because these analyses contribute to the investigation of the primary research question. Genome-wide association studies, where the exploratory nature of analyses is commonly accounted for by looking at multiple testing,5 are beyond the scope of this paper.

Mini-review and overview of existing reporting guidance

Before we discuss considerations about the reporting of exploratory aetiologic studies, we wanted to illustrate some of the aspects of exploratory studies that need explicit reporting. Hence we performed a small review of published aetiologic studies. We identified all articles on original research in four journals in their first issue of 2021: American Journal of Epidemiology, Epidemiology, European Journal of Epidemiology, and International Journal of Epidemiology. We excluded studies that did not look at an aetiologic research question, such as prediction studies, studies on therapeutic interventions, and randomised trials. For each article, we counted the number of primary analyses, sensitivity analyses, and additional analyses that were performed. The unit of counting was the association estimator, where we counted only one association if the association was reported on different scales (eg, absolute and relative scales for binary endpoints).

Also, we reviewed existing reporting guidance documents on aspects relevant to exploratory analyses, specifically the STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) statement,6 RECORD (REporting of studies Conducted using Observational Routinely collected health Data) statement,7 STROBE-MR (Strengthening the Reporting of Observational Studies in Epidemiology Using Mendelian Randomisation) for mendelian randomisation studies,8 STREGA (Strengthening the Reporting of Genetic Association Studies) for genome association studies,9 and the CONSORT (Consolidated Standards of Reporting Trials) extension to randomised pilot and feasibility trials.10

Patient and public involvement

Involving patients or the public in the design, conduct, reporting, or dissemination plans of our research was not appropriate or possible.

Results

Mini-review

The mini-review included 25 original aetiologic articles. These articles reported a mean number of 33 (range 1-120) exposure-outcome associations for the primary analysis, 30 (0-336) for sensitivity analyses, and 163 (0-1467) for additional analyses, mainly concerning subgroup or interaction analyses (supplementary file). Most articles did not explicitly report which analyses were prespecified, and only one study referred to a publicly available protocol.11 The methodological scrutiny of the subgroup analyses varied from thoughtful evaluations of exposure effect heterogeneity in well established subgroups to evaluations of exposure effects across subgroups that seemed to have been formed exhaustively across many potential risk factors. Despite the fact that our review included only a small sample of studies, the image that arises from it is that many results were presented, and insufficient information was reported to fully judge the validity and merits of the results.

Existing reporting guidance

The STROBE6 and RECORD7 statements provide checklists of items to report in observational studies that are relevant to exploratory analyses (table 1). Extensions of STROBE, such as STROBE-MR8 and STREGA,9 provide additional guidance for reporting of studies where many analyses are performed. Guidance for reporting randomised trials also provides helpful information for reporting exploratory analyses in aetiologic research, in particular the CONSORT extension to randomised pilot and feasibility trials.10 Not all of these recommendations can be directly applied to observational aetiologic studies, however, because the procedures for generating and testing of hypotheses are more established in randomised studies than in observational settings.

Table 1

Considerations for reporting of exploratory aetiologic research

View this table:

Exploratory research principles

Inspired by the existing recommendations for reporting, we list six considerations for reporting and interpretation that can improve the assessment of the credibility of exploratory analyses in aetiologic research (table 1). The list is not exhaustive but we hope it will encourage further discussion on the reporting of exploratory research.

Consideration 1: explicitly state the objective of all analyses, including exploratory analyses

Stating the objective of an aetiologic study clarifies how to interpret the results. The objectives of confirmatory aetiologic research ideally contain a well defined targeted effect of a specific aetiologic factor on a specific outcome in a specific population.1314 In early discovery research, objectives are not always rigorously defined but could be specified more generally (eg, understanding the origin of a particular outcome). An implication of stating the objective in general terms, however, is that the methodological handling of the analysis becomes less clear and the number of researchers’ degrees of freedom becomes large.15 Consequently, interpreting results without deriving spurious (causal) conclusions requires thought and effort because the analysis does not necessarily provide information towards a causal effect (see consideration 4).161718 The more general an objective is stated, the more provisional the analysis becomes. This caveat includes machine learning approaches where no explicit causal modelling assumptions are made.

Because exploratory analyses in aetiologic research often aim to inform a future in-depth causal analysis, reporting both the objective of the provisional exploratory analysis and the (future) confirmatory analysis is important. This reporting is in line with the CONSORT reporting checklist for pilot randomised controlled trials which recommends that researchers state the objective of the eventual trial in the manuscript of a pilot study.10 The rationale and need for the exploratory analysis in aetiologic research should be outlined together with uncertainties that need to be dealt with before performing an independent confirmative analysis of the causal mechanism. Reporting the position of provisional analyses relative to future research clarifies the level of credibility of the findings from exploratory analyses.

Consideration 2: establish a study protocol before data analysis and make the protocol available to readers

Preregistered protocols help distinguish which analyses were planned before observing the data and which analyses were performed post hoc, thereby avoiding hypothesising after the results are known. For randomised trials, preregistration of the study protocol is considered the norm.19 Preregistration does not seem as widespread in observational aetiologic research, but is increasingly encouraged,2021 and explicitly recommended in the RECORD reporting checklist.7 Because aetiologic research often uses existing cohort data that have been analysed for related research questions, preregistration of aetiologic studies does not ensure the same level of credibility of statistical evidence as preregistration before collecting the data.

Nosek and colleagues22 have provided preliminary guidance on preregistration of analyses conducted with existing data. These authors suggest that what was known in advance about the dataset should be transparently reported so that the credibility of statistical findings can be assessed, taking into account analyses that have been performed previously. Implementing this advice is probably challenging in large epidemiological cohort studies because of the many analyses that might have been performed. But trying to clarify why and how an analysis is conducted before observing the data is a laudable practice that can be implemented directly in aetiologic studies. This practice is ideally accompanied by work on developing guidance for preregistration of aetiologic studies that use existing data.

Preregistration of analyses that are exploratory in nature is even less common, possibly contradicting the definition of exploration. We consider exploratory analysis, however, as discovery work that serves to motivate funding for larger studies that are, for example, better able to control confounding or to collect data rigorously. Given this important probing role, simply stating in a research protocol that certain relations will be explored is not enough; time and effort must be invested in designing the analysis appropriately. Not every detail can be specified in advance, but interpretation of the results provided by data can be challenging and unintentionally overconfident when no question was clearly articulated before seeing the answer.

Consideration 3: do not base judgments on significance values only

Only reporting the results of analyses that provided a P value below the prespecified α level (eg, 0.05) is discouraged throughout all scientific disciplines (for example, as discussed in a 2019 supplementary issue of The American Statistician).23 Avoiding selective reporting based on significance values is particularly relevant to exploratory findings because the statistical properties of exploratory tests are less well known than those of confirmatory tests.24 For example, the expected number of false positives (that is, the type I error rate) is probably increased when the choice for a statistical test was based on pattens in the observed data. Although procedures have been developed for correction of multiple testing in confirmatory settings, consensus on how to prevent false positive findings in exploratory settings has not yet been established.242526

Increasing the number of exploratory analyses, without correction for multiple testing, raises the risk of deriving false positive conclusions, but too strict corrections for multiple testing increases the probability of false negative findings (that is, the type II error rate).27 A raised type II error rate could occur, for example, when an analysis of various positively correlated hypotheses is corrected for multiple testing as if all of the hypotheses were independent (eg, by applying a Bonferroni correction). The decision to statistically correct for multiple testing depends, among other issues, on the total number of tests performed in the same dataset, correlation between the hypotheses being tested, and sample size. Reporting each of these considerations clarifies the analytical context of findings and helps to assess the credibility of the results. This form of reporting is in line with the STROBE-MR8 and STREGA9 checklists which recommend stating how multiple comparisons were managed, although recommendations for the handling of multiple testing seem more established in genome-wide association studies than in clinical aetiologic cohort studies.5

Consideration 4: interpret findings in line with the nature of the analysis

Interpreting and communicating results in line with the exploratory nature of an analysis is challenging because an accurate representation of the degree of tentativeness of the results is required. Assessing this degree of tentativeness based on only the results of an analysis (that is, based on the numerical estimates) is complicated because seemingly convincing results can be misleading and a clinical explanation can be found that does not follow from the statistical evidence.2829 Cognitive biases, such as hindsight bias, can distort the interpretation of findings.

Reporting of findings from exploratory analyses starts with indicating whether the analysis was planned before or after observing the data, which is recommended in the CONSORT extension to randomised pilot and feasibility trials.10 Results of exploratory analyses can be interpreted by focusing on what is reported about the objectives and applied methodology rather than overstepping the findings. The specificity with which findings are interpreted should match the generality with which the objective is stated (see consideration 1).161718 For example, when various subgroup analyses are performed with the general aim of identifying possible subgroups from the available data where an exposure effect was different, researchers should report that many subgroups were explored, including characterisation of the subgroups and description of the presence or absence of effect heterogeneity, rather than discussing only one or two specific subgroups where the effect size was extreme. Furthermore, exploratory analyses often fail to support strong conclusions. Recommendations for clinical practice or generalisations based on exploratory analyses should generally be avoided.

Consideration 5: report (summarised) results of all exploratory analyses that were performed

When findings are selectively reported, especially when reporting is guided by significant findings (see consideration 3), the credibility of reported findings is probably overstated.30 Reporting the results of all of the exploratory analyses that were conducted (possibly in a supplementary file) provides a transparent and honest report of the analysis and facilitates better interpretation of the findings. This approach is in line with the STROBE extension in STREGA, which recommends that all results of analyses should be presented, even if numerous analyses were undertaken.9

Reporting all analyses that have been conducted seems simple, but can be challenging in practice, mainly because the process of performing a study is typically iterative. A framework for initial data analysis by Huebner and colleagues could help keep track of all subanalyses that are conducted as part of a main analysis.31 This framework distinguishes exploratory analyses that are part of a primary analysis from additional exploratory analyses that require separate reporting. Another helpful practice could be to have a reflection period after performing analyses to establish whether the analyses look at (slightly) different research questions and to report separate analyses for each research question.

Consideration 6: accompany exploratory analyses by a proposed research agenda

The credibility of exploratory findings can be communicated through a research agenda prioritising future research and how this research should be set up. Reporting a research agenda is similar to the CONSORT extension to randomised pilot and feasibility trials that recommends reporting which and how future confirmative trials can be informed by the pilot study.10

Formulating a research agenda allows researchers to take responsibility for the exploratory findings presented and future research that should be performed, avoiding the empty statement that “more research is needed”. In medical science in particular, where study results are sometimes quickly implemented into clinical practice, researchers are encouraged to take responsibility for the results they report by clearly explaining which exploratory findings should be investigated in future research and how.

Discussion

Our mini-review showed that exploratory analyses in aetiologic research were not always reported optimally. The credibility of exploratory results is affected by a combination of the theoretical rationale for the analysis, clarity of the defined research problem, applied methodology, and degree to which analytical decisions are driven by the data. Choosing a particular analysis based on observed patterns in the data complicates statistical inferences. Moreover, the design and methods applied in an exploratory analysis might be less optimal than the primary analysis of the study, which further complicates interpretation of exploratory analyses. Therefore, information on these aspects should be clearly reported.

Exploration is essential to the progress of science. Strict confirmatory studies are a powerful mechanism for final evaluations before implementation in clinical practice, but will probably not stimulate new ideas.3233 Open minded exploratory analyses can lead to unexpected discoveries and resourceful innovations of epidemiological science, but effort is required to accurately interpret the results. Because exploratory analyses are usually done to generate new research questions, quickly performing a statistical test (or multiple tests) to get the first answer to the problem is tempting. When quick test results are presented in a research article, however, their interpretation might be ad hoc and unintentionally overconfident.

To show their full value, exploratory analyses of aetiologic research need to be conducted and interpreted correctly. We have provided six considerations for reporting of exploratory analyses to encourage a discussion on exploratory analyses and how the credibility of these analyses is ideally assessed in aetiologic research. Continuation of this discussion will contribute to the understanding of inferences that can be made from exploratory analyses in aetiologic research and will help strike a balance between their opportunities and risks.

What is already known on this topic

  • Exploratory analyses in aetiologic research are initial steps towards solving a research problem and are often conducted in addition to planned primary analyses of a study

  • Exploratory analyses might lead to new discoveries in aetiologic research, but effort is needed to accurately interpret the results because these analyses are often conducted with few data resources and insufficient adjusting for confounding

  • Statistical properties of exploratory tests are less well known than those of confirmatory tests

What this study adds

  • This study focuses on a particular type of causal research, namely aetiologic studies, which investigate the causal effect of one or multiple risk factors on a particular health outcome or disease

  • Six considerations for reporting of exploratory analyses in aetiologic research were provided to stimulate a discussion about their preferred handling and reporting

  • Researchers should take responsibility for results of exploratory analyses by clearly reporting their exploratory nature and specifying which findings should be investigated in future research and how

Ethics statements

Ethical approval

Not required.

Data availability statement

No additional data available.

Footnotes

  • Contributors: KL was involved in the conceptualisation, investigation, methodology, visualisation, and writing (original draft) of the article. OMD was involved in the conceptualisation, investigation, methodology, and writing (review and editing) of the article. FRR was involved in the conceptualisation, investigation, methodology, and writing (review and editing) of the article. RHHG was involved in the conceptualisation, investigation, methodology, supervision, and writing (review and editing) of the article. KL, OMD, FRR, RHHG gave final approval of the version to be published and are accountable for all aspects of the work. KL is the main guarantor of this study. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.

  • Funding: RHHG was supported by grants from the Netherlands Organisation for Scientific Research (ZonMW, project 917.16.430) and from Leiden University Medical Centre. The funders had no role in considering the study design or in the collection, analysis, interpretation of data, writing of the report, or decision to submit the article for publication.

  • Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/disclosure-of-interest/ and declare: support from the Netherlands Organisation for Scientific Research and Leiden University Medical Centre for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous three years; no other relationships or activities that could appear to have influenced the submitted work.

  • The lead author (the manuscript’s guarantor) affirms that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned (and, if relevant, registered) have been explained.

  • Dissemination to participants and related patient and public communities: An abstract was submitted to the annual Dutch epidemiology conference (www.weon.nl). The authors aim to share their work with stakeholders at the annual Dutch epidemiology conference (www.weon.nl), at institutional meetings, and will post a link with a plain language summary on their personal websites (www.rolfgroenwold.nl).

  • Provenance and peer review: not commissioned; externally peer reviewed.

http://creativecommons.org/licenses/by-nc/4.0/

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

References

View Abstract