The amassing of enormous data sets in of data, but rather as providing significant Ultimately, informatics should be genomics, proteomics and imaging has ‘added value’. Consider a commercial viewed neither as a bag of tools and pro- led a number of scientists to envision a database consisting of credit-card trans- grammes nor as inextricably linked to the future in which automated data-mining actions: its purpose is to keep track of idea of artificial intelligence, but rather as techniques, or ‘data-driven discovery’, individual accounts, and most of the pointing to a new approach to experimental will eventually rival the traditional queries to the database are specific, design that takes into account the future hypothesis-driven research that has domi- focused and initiated individually. In con- use of primary data. If investigators and nated biomedical science for at least the trast, automated data-mining techniques funding agencies simply included archiv- past century. It is no surprise that promi- permit the same database to be character- ing of samples and data into research nent scientists have expressed their scep- ised in terms of significant large-scale projects together with the metadata ticism—to say the least—about this point correlations that provide a rich array of needed to understand how the data were of view (Allen, 2001). However, I believe market research data. More importantly, collected, the increased efficiency and that framing the debate in terms of hypo- one can search on an ongoing basis for productivity that would accrue via data theses versus informatics, with the subtext anomalous patterns of activity that raise recycling should allow them to recoup of man versus machines, misses an import- the possibility of fraud; in fact, a commer- their investments many-fold. Admittedly, ant point: currently available informatics cial database that does not carry out such most fields within biomedical science still techniques can greatly assist traditional automated ‘data-driven discovery’ might lack an effective infrastructure for data hypothesis-driven research, but only if even be considered negligent. I suggest archiving, sharing and collaboration. But investigators slightly alter their practice to that research databases that are populated this only means that investigators need to take advantage of this opportunity. become actively involved to make this a and analysed according to specific For example, informatics tools exist that hypotheses (Valencia, 2002) should also reality and not retreat in the belief that can assist investigators in formulating, benefit from being monitored by compu- informatics represents a threat to hypothesis- assessing and prioritising their hypotheses. driven research. ter programs that search for unanticipated Many hypotheses are, in fact, straight- correlations and anomalous patterns. forward extrapolations from current find- ings: for example, knowing that apolipo- One of the basic concepts of informat- References ics is the ‘future value of primary data’. It protein E4 is a risk factor for Alzheimer’s Allen, J.F. (2001) In silico veritas. Data-mining is envisioned that the primary data—and, and automated discovery: the truth is in there. disease, it is almost an automatic process if possible, the actual samples—collected EMBO rep., 2, 542–544. to ask whether E4 may also be a risk factor for other neurological diseases or whether by one investigator will be archived and Koslow, S.H. (2000) Should the neuroscience it interacts with other known risk factors; made available to other investigators, community make a paradigm shift to sharing who may re-analyse the data from a primary data? Nat. Neurosci., 3, 863–865. if one knows that RNA interference different point of view, employ part of the Smalheiser, N.R. and Swanson, D.R. (1998) Using occurs in plants and lower organisms, it is Arrowsmith: a computer-assisted approach logical to wonder whether it may occur in data set not relevant to the first investigator, to formulating and assessing scientific mammals as well. Publicly available tools, pool data with other studies or conduct hypotheses. Comput. Methods Programs such as Arrowsmith (http://arrowsmith. new measurements on the original samples Biomed., 57, 149–153. psych.uic.edu), do not attempt to bypass (Koslow, 2000). This is entirely compat- Swanson, D.R. and Smalheiser, N.R. (1997) An scientists, but rather help them to integrate ible with hypothesis-driven research. interactive system for finding complementary knowledge that is retrievable from the Indeed, a good hypothesis is not one that literatures: a stimulus to scientific discovery. scientific literature in order to formulate is likely to be correct, but one that opens Artif. Intell., 91, 183–203. up a new arena of investigation. Since this Valencia, A. (2002) Search and retrieve. Large- hypotheses quickly, systematically and com- arena cannot be fully perceived in scale data generation is becoming increasingly prehensively (Swanson and Smalheiser, important in biological research. But how 1997; Smalheiser and Swanson, 1998). advance, one must be prepared to carry good are the tools to make sense of the data? These tools can be thought of as analo- out new analyses not included in the EMBO rep., 3, 396–400. gous to word processors: they do not original hypothesis. Yet, most current write manuscripts, and they do not do experimental design simply ignores this anything that people cannot do by them- fact: the investigator collects only those Neil R. Smalheiser selves, but they do promise a new standard data that are deemed relevant to the of efficiency and productivity. original hypothesis, and when new Neil R. Smalheiser is at the UIC Psychiatric Institute Likewise, data mining of research data- information causes the original hypothesis in Chicago, IL. bases need not be thought of as bypassing to change, the investigator must plan a E-mail: smalheiser@psych.uic.edu the traditional hypothesis-driven analysis new experiment from scratch. DOI: 10.1093/embo-reports/kvf164