Bioinformatics in The Pharmaceutical Industry

BIOINFORMATICS IN THE PHARMACEUTICAL INDUSTRY
NICHOLAS J. COLE
(nickc@searcher.demon.co.uk) Celltech Therapeutics Ltd, 216 Bath Road Slough, Berkshire SL1 4EN and
DAVID BAWDEN
(d. bawden@is.city.ac.uk) Department of Information Science, City University Northampton Square, London EC1V 0HV A review was carried out of the 'information landscape' within the pharmaceuticals-based molecular biology community, which examined the research problems requiring biological-sequence data, important sources of information, methods of access, information-seeking behaviour of end users and the role of libraries and information centres. This work concentrated on the practical aspects of how biological sequence information is managed and used in a research setting and was carried out as part of the MSc in Information Science at the City University. Fifteen questionnaires were sent to information scientists in the UK pharmaceutical industry and a user study was carried out amongst scientists at Celltech. Most of the important primary data are available freely or cheaply via the Internet and molecular biologists were found to be self-reliant in their use of these resources. Currency of information was found to be very important in the research process and the issue of Internet security was taken very seriously. Most questionnaire respondents saw a productive role in the future for information workers in thefieldof molecular biology, citing end-user training and data integration as possible roles, although the degree of involvement will depend on the particular mix of skills and experience that exist within an information department.
INTRODUCTION
Background and aims PHARMACEUTICAL RESEARCH is highly information-intensive and information professionals have a long tradition of helping the R&D effort within drug companies, where they have used their skills in information management, database design, online searching etc. to good effect. A pharmaceutical research information department would typically employ a number of information Journal of Documentation, vol. 52, no. 1, March 1996, pp. 51-68 51
JOURNAL OF DOCUMENTATION
vol. 52, no. 1
scientists who are subject specialists, to search biological, biomedical and chemical online databases on behalf of end users. These databases would normally be classified as bibliographic, full-text, numeric, directory, chemical structure or chemical reaction-type, although, until recently, information workers would not have found it necessary to carry out, for example, a homology search in a biological sequence database. This situation is now beginning to change due to the increasing role of biotechnology in drug research. The aim of this research project was to map out the 'information landscape' within the pharmaceuticals-based molecular biology community, using the activities at Celltech as an example and by interviewing relevant information workers from the UK pharmaceutical industry. Quite a lot has been written surveying the different data banks in molecular biology but the practical aspects of managing and using this information in an organisational setting are rarely discussed. The findings should be useful for anyone involved with the setting up of molecular biology information systems, information management policies and in developing strategies for meeting the needs of scientists. What is bioinfonnatics? The origin of the word 'bioinformatics' is hard to trace although it is normally used to encompass the generation, handling, storage and retrieval of biological sequence data, i.e. the sequences of nucleotides that make up the DNA of genes and the sequences of amino acids that form the primary structure of the proteins for which the genes code. According to Boguski [1], information science and technology (informatics) became a serious issue for biologists in the mid 1970s following the development of rapid DNA sequencing techniques. Since then, the amount of sequence data (and also gene mapping and protein crystal structure data) has grown exponentially and is now also being fuelled by data emerging from the world-wide Human Genome Mapping Project. The growth rate of Genbank and EMBL (European Molecular Biology Laboratory) databases has been exponential for the last five years; the latest release of Genbank (release 80.0) contains 164 megabases of sequence and the size is currently doubling every twenty-one months [2]. It is expected that, over the next decade, biomolecular databanks will grow between seven and sixty-fold [3]. Boguski considers the term bioinformatics to be wide in scope, involving computational analysis, databases and 'everything from laboratory automation and data acquisition to electronic publishing'. Andrew Lyall of Glaxo (personal communication) says the Glaxo interpretation of the word is made up of three elements as follows: 1. Computational genetics - encompassing activities such as the Human Genome Mapping Project and covering physical and logical genetic mapping. 2. Computation relating to molecular genetics, including sequence determination. This would also cover the automatic reading of data from automatic gene-sequencing machines. 3. Computation relating to three-dimensional protein structure determination. 52
March 1996
BIOINFORMATICS
Both of these definitions illustrate that the word bioinformatics can be interpreted to describe virtually any information-related activity applied to the sciences of genetics or molecular biology. Whilst both are equally valid, for the purpose of the work carried out and described in this article we shall confine our definition to include the storage, retrieval and analysis of nucleic acid and amino acid sequence data but will exclude three-dimensional structure computation, automated laboratory procedures or anything relating to the Human Genome Project. Molecular biology resources Databases It would be inappropriate to attempt to give a comprehensive review of the publicly available molecular sequence and structure databanks - some very good articles have been written on the subject [4-6]. However, some of the key databanks should be mentioned and others will be specifically referred to in the later sections. GenBank [7] was established in 1982 at the Los Alamos National Laboratory and contains nucleic acid sequences derived from the published literature. The database also contains bibliographic data, CAS (Chemical Abstracts Service) Registry numbers and other data such as the sequence length and source organism. Whereas GenBank has a us bias, the EMBL Nucleotide Sequence Databank [8] provides a similar service to Europe and the structure of the database is very similar. In fact, GenBank and EMBL share all of their data with each other and with the DNA Databank of Japan (DDBJ) - so in effect these three databases are one and the same (albeit with some time differences with respect to updates) and comprise the most comprehensive collection of nucleotide sequences. For amino acid sequences, SwissProt [9] (established in 1986) is a key resource. This is also a collaboration, between the Department of Medical Biochemistry at the University of Geneva and EMBL. SwissProt data come from the Protein Information Resource (PIR) database [10], from the translation (via the genetic code, from nucleic acid sequence to protein sequence) of entries from the EMBL database and directly from the literature. Entries consist of the 'core' (primary) sequence, literature citations, taxonomic data and annotation data (protein function, secondary structure information, diseases associated with the protein, etc.). The Protein Data Bank (PDB), maintained by Brookhaven National Laboratory (Long Island, New York, USA), contains all publicly available solved 3-D protein structures. Data include atomic co-ordinates and other data relating to how the structure was elucidated (e.g. crystallographic and NMR data). CAS has registered bio-sequences from the journal literature since 1957 although, until 1990, they were stored in electronic 'connection tables' which define a molecule in terms of the connectivity between individual atoms and therefore could only be searched by chemical sub-structure. The protein sequence data were enhanced in 1990 [11] with the computer generation of amino acid sequences for all of its (approximately 150,000) protein structures; thus proteins and peptides were additionally searchable using the common shorthand amino acid abbreviations. An 'exact' protein/peptide search retrieves only exact matches to the sequence query, whereas a 'sub-sequence' 53
vol. 52, no. 1
search looks for a string of amino acids anywhere in a chain (analogous to substructure chemical searching). Sub-sequence or exact 'family' searching is also possible, whereby each amino acid residue in a query is matched to any of its functional family members in the file structure, i.e. those that have similarities with respect to their acidity, hydrophobicity or aromaticity. The system also caters for uncommon amino acids and multi-chain systems. Since 1992, the CAS Registry file has also included nucleic acids from GenBank, as well as making the entire GenBank file available on the STN host. Searching databases for sequence similarity ('homology') The commonest questions about a given sequence that require recourse to molecular sequence databases are 'has this sequence been described in the literature before?' and 'are there any other known sequences that are similar to my own sequence (and how similar are they?)'. Homology searches involve the use of computer programs which use algorithms to calculate a similarity score between two different stretches of DNA, which is at least partly based on the summation of the number of matching nucleotide pairs within a defined local region of the complete sequence. 'Hit' sequences can therefore be ranked, exact matches having a maximum score. Many algorithms have been designed (e.g. BLAST - Basic Local Alignment Tool) and they all differ with respect to their computational speed and their sensitivity. There has been a growing need in recent years for an integrated approach towards gaining access to actual sequence data via cross-references in the literature. Entrez, developed at the NCBI, provides this capability and contains sequence records from a variety of database sources, including GenBank, EMBL, DDBJ, PIR, SwissProt, and the PDB. The sequence records are linked to the relevant literature citations from the sequence-associated subset of Medline. The retrieval software and databases are distributed on CD-ROM or as a free Internet service (Network Entrez). In addition to the 'core' databases mentioned above, there are a great number of specialised databases, such as those dealing with a particular chromosome in the human genome, types of cell receptor or vectors.
COMMUNICATIONS WITH EXTERNAL INFORMATION PROFESSIONALS
Methodology The aims of the external questionnaires were as follows: to ascertain the types of workers (by job title) who were involved with bioinformatics, the degree to which they were involved and the most important information sources; methods of current awareness used; the types of problems that require the use of sequence databases and how they impact upon the pharmaceutical research process; the requirement for specialist knowledge and the role of information scientists. 54
March 1996
BIOINFORMATICS
Fifteen questionnaires were sent to information workers in the pharmaceutical field and one in-depth interview was carried out. The selection of appropriate candidates was partly through recommendation and partly by scanning the TFPL directory Who's who in the UK information world 1994 [12]. One specific question was posted to an Internet Usenet news group. There were eleven replies to the questionnaire. All respondents were providers of research information within UK pharmaceutical companies - ten operated from within a library/information department and one was a bioinformatics consultant in an IT department. All except one (information officer/assistant) had at least a first degree in chemistry, biochemistry or pharmacology; five (two bioinformatics specialists, one department head, the IT consultant and one biomedical information scientist) had a life science related PhD; three (all biomedical information scientists) had an information related MSc. The breakdown of job titles was as follows: Job title Information scientist (biomedical) Bioinformatics analyst Head of department (scientific information) Information officer/assistant IT consultant (bioinformatics) Frequency 4 2 2 2 1
All of the organisations were fully integrated pharmaceutical companies except one which was a specialised bio-pharmaceuticals (biotechnology) company. On average, the information departments made up 2% of the headcount, in a range from 1% to 5%. The size distribution of UK operations by number of employees was as follows: Employees >3,000 1,000-3,000 2 0 0 - 1,000 Frequency 3 3 5
Level of involvement with bioinformatics Six respondents were actively involved with searching molecular biology sequence and/or structure information. Of these, three were occasional users with a chemistry or biochemistry background and three were bioinformatics specialists. Of the five who were not directly involved with molecular biology information, one stated that such work was carried out by a group within another department (and it was they who completed the rest of the questionnaire), another stated that there was a need but that this was a new field for the company (a molecular biology department was formed last year) and work was not done due to a lack of adequate knowledge of the information sources. The remaining three respondents cited lack of demand as their reason for not finding this kind of information relevant. 55
vol. 52, no. 1
Knowledge of sources The bioinformatics specialists were not surprisingly satisfied with their level of knowledge of the relevant information sources, although one mentioned that specialist bioinformatics training would be used to help others in the organisation if it were available. Of the occasional users, one was satisfied with his knowledge but the other was not, saying that he or she would like a better knowledge of sources and the content of databases. Access to information On the question of access to information, six out of the eight information intermediaries were satisfied with their degree of access to the relevant information. One stated that he was not satisfied due to very restricted access to the Internet at their organisation and the other because molecular biology was a new field and so the necessary knowledge was not yet available. Databases used Twenty-five databases/search programs were mentioned in all and Figure 1 shows the citation frequencies given by the questionnaire respondents, together with the data formats used, i.e. whether online, CD-ROM, hard copy, etc. The chart clearly shows the central importance of certain databases to molecular biology, especially Entrez, EMBL/GenBank and PIR/SwissProt - these resources were used by most of the respondents. Databases which are more specialist in nature, such as Rebase and TFD occur lower down the order and were only used by the bioinformatics specialists. Other points to notice are the predominance of access via the Internet, the low usage of hard copy data, and the fact that many databases were bought and maintained locally, with updates (presumably) coming in on magnetic tape. Frequency of molecular biology database use was one to five times a week for six out of the eight molecular biology information users. One user (a 'bioinformatician') used them more than five times a day and one (occasional user) only one to five times a month. Use of CAS databases Only one reply stated that CAS was a useful source of biosequence information, mentioning that it was 'a good starting point'. To put this reply into perspective, the respondent was an information scientist with a chemistry background who had a minor involvement with sequence searching but who stated that 'work of this nature is mainly carried out by end users who are subject specialists'. All of the bioinformatics-oriented users expressed negative opinions such as: 'too expensive'; 'I am not aware that there is any added value for primary data analysis'; 'I am unaware of the searching capabilities'. A Usenet news posting was made to try to obtain further comments and opinions regarding the CAS Registry file. In addition, the question of CAS under-use 56
March 1996
BIOINFORMATICS
57
vol. 52, no. 1
was put to members of the STN International Help desk. To summarise, the following reasons were given for the under-use of CAS: 1. 2. 3. 4. 5. 6. 7. 8. 9. expense; lack of knowledge of its availability; the same facilities are available elsewhere; the types of searches available do not meet the typical needs of a researcher; the database is available directly from NCBI; NCBI has designed convenient interfaces for sequence retrieval (via Entrez and www) and has many tools for sequence searching (BLAST via email, www or network client); it is not compatible with large-scale use, i.e. searching hundreds of sequences automatically; limited annotations compared with GenBank entries; the advent of Entrez, with its facility for linking gene sequence, protein sequence and Medline references.
The following positive reasons were given for searching CAS: 1. the ability to cross over (to other files) and get references; 2. searching patent literature. Research problems requiring the use of databases Typical replies to this section included: 'searching for new sequences that are not in a local database'; 'what is this (my own) sequence? Is it known in the literature? What is it similar to?'; 'what sequences have been identified for organism X ?'; 'is this sequence similar or related to a human sequence?'; 'what genes are associated with this disease?'
Current awareness Respondents who were information specialists but who never or infrequently searched molecular biology information cited traditional current awareness methods, e.g. customised 'SDIS' set up on conventional hosts (databases included Medline, Biosis and Derwent Biotechnology Abstracts); CCOD and journal scanning. All three bioinformatics specialists cited journals and Usenet newsgroups whereas two cited scientific conferences and only one mentioned the traditional methods such as CCOD. All respondents said that currency of genetic data was important or essential, especially for primary DNA analysis. One occasional user used it to justify full access to the Internet and another said that their company recently discovered a sequence on the Internet that was critical to their current work but which would have taken another two months to reach their internal database if they had to wait for the update via CD-ROM. One of the bioinformatics specialists stressed 58
March 1996
BIOINFORMATICS
that whereas currency was definitely important, it was the successful integration of genetic data with other types of data (e.g. protein structure, pharmacology and toxicology data) which was crucial to the success of a research project. Sequence analysis Tasks given as requiring calculation included alignment, pattern searching and protein homology modelling, i.e. very similar to the responses gained from the internal interviews. Six respondents said that their organisation used automatic sequencing machines. One predicted that data storage requirements would increase exponentially until the year 2000 and another said that the increase would be 'dramatic'. Five respondents used in-house databases to store sequences of interest. The Internet Figure 2 summarises the amount of Internet connectivity enjoyed by the questionnaire respondents, the tools used to gain access to the information (www and/or Gopher) and the principal activities conducted on the Net (email, database searching etc.). This shows that permanent access is almost entirely correlated with those companies that were active in the field of molecular biology. Of the four respondents without Internet access within their organisation, three comprised the group with very little demand for molecular biology information (one of these said that access was planned in the future). The remaining one said 'security issues are to be resolved before we have full access in the UK to the Internet'. All respondents from organisations with an interest in molecular biology as part of their drug discovery programmes had full access to the Internet; the vast majority (seven out of eight) having a permanent dedicated link with appropriate 'firewall' security features. The issue of security was taken very seriously by all respondents who were Internet users, although one did not comment. Another user who had personal concerns about network security implied that their organisation was less worried than it should have been. Searching remote databases necessarily involves the loss of control over some data when a query is uploaded and this problem is magnified due to the dispersed and uncontrolled nature of the Internet. Two respondents said that only internal databases were searched for highly sensitive sequence information. The following security risks were identified with the use of the Internet: the ease with which computer viruses can be distributed via networks; the risk of unauthorised access to one's own machines; the possibility of obtaining deliberately inaccurate or misleading information.
The impact of molecular biology on pharmaceutical research One respondent simply said 'every area' of pharmaceutical research is affected, however specific examples included: 59
vol. 52, no. 1
60
March 1996 assay development; gene therapy; a basis for understanding disease mechanisms; high throughput receptor screening; anti-sense oligonucleotide approaches; rational drug design.
BIOINFORMATICS
Requirement for specialist knowledge Only three respondents (all 'occasional' searchers who were subject generalists but not practitioners) thought specialist subject knowledge was not essential for sequence searches, although one said that it helps to have a knowledge of naming conventions and the basic relationships between nucleic acids and proteins. One who did believe in the necessity of specialist knowledge thought that the ideal situation would be the existence of information-knowledgeable practitioners of molecular biology. However it was conceded that these are exceptional therefore realistically, the problem is best solved by good communication between science practitioners and subject-knowledgeable information specialists. The balance of searching responsibility between these two groups would depend on the particular skills mix in an organisation, good communication being the key to success (such a balance might also prevent over-reliance on one or two 'information gurus' within the scientific departments who might leave at any time). It was considered very important that the information department should keep their knowledge of sources up to date (communication with scientists would also be very useful here). The role for information scientists Four out of the six respondents involved with biosequence searching were information specialists working from within a library or information department, whereas the remaining two were 'information gatekeepers', i.e. hybrid scientist/information specialists working from within the laboratory (questionnaires were passed on by the initial library-based contact). One molecular biologist said that with the emergence of user-friendly Internet access tools, molecular biologists who have informatics interests or skills can cover most routine needs. An information scientist (who was not involved with sequence information) hoped that there could be a role but was not convinced. The bioinformatics IT consultant was not sure and said that 'they [information scientists] have been slow to respond to the changing demand'. Another bioinformatics specialist provided molecular biology information services to all users on behalf of the Research Information Department. Major responsibilities included database searching and the training of end users to carry out their own searches using in-house or external systems. The remaining six respondents who answered the question were positive about the role that information scientists could play and gave the following examples: not all molecular biologists are information or IT aware - possible training role; 61
vol. 52, no. 1
integration/interpretation of genetic data; general application of skills in database searching etc.; provision of a service when molecular biology information is required outside of the research area, e.g. clinical and patents.
Patents - searching sequence databases for novelty or infringement The seven respondents who did sequence data patent searches mostly cited the well known sequence databases such as GenBank, EMBL etc. (for novelty) and Derwent's World Patents Index (WPI), CAS and Geneseq (for infringement and novelty searching).
CELLTECH USER STUDY
Celltech - a brief profile Celltech, founded in 1980, is one of the largest specialised biotechnology companies in Europe and is dedicated to finding novel therapeutics for cancer and immune disorders using its expertise in molecular biology, protein engineering and medicinal chemistry. Celltech Group Plc consists of two independent companies. Pharmaceutical research is carried out by Celltech Therapeutics Limited. Celltech Biologics Plc produces biopharmaceutical development products and specialises in antibody engineering, mammalian cell line development, manufacturing process development and industrial scale manufacture of such products to third parties, including Celltech Therapeutics Limited. Celltech floated on the stock exchange in December 1993 and currently has promising compounds in development for the treatment of cancer, rheumatoid arthritis, asthma and septic shock. The Information and Library Service at Celltech currently subscribes to 160 journal titles, has a collection of approximately 8,000 books and has a staff of three. Whilst the external study produced valuable information and demonstrated the range of approaches towards biological sequence data handling taken by pharmaceutical companies throughout the industry, it was of necessity carried out at 'arm's length' via questionnaires and interviews with only one company representative. The wish to gain a detailed insight into how such information was obtained and used within a single site prompted the Celltech user study. It was for illustrative purposes and was not intended to be representative of the rest of the industry, even though some of the findings were consistent with the other companies examined. The aims of the user-study were: to assess the level of user knowledge of the sources of publicly available data; to examine the computing infrastructure in use by scientists; to examine the information requirements in relation to specific areas/activities of work; to see how scientists at Celltech keep up to date with regard to sequence information. 62
March 1996
BIOINFORMATICS
Interviews were carried out with seven key Celltech scientists. All of the interviewees were molecular biologists except one, who was a chemist involved with molecular modelling. Typical research functions of the biologists included the cloning, sequencing and expression (in mammalian systems) of genes of therapeutic interest and subsequent analysis, e.g. of the proteins translated from those genes. One biologist carried out protein engineering and the computer modelling of macromolecules. The chemist was involved with all aspects of molecular modelling, from small to large (bio-) molecules, but also internal consultancy (general scientific computing) and systems administration for a Silicon Graphics (SG) computer. At the time of writing, there is no company (dedicated) Internet connection, although two users (both of whom were interviewed) had dial-up modem access. These users acted as 'gatekeepers' of the information held on the Net and performed searching and other activities on behalf of the others. Knowledge of sources Five out of the seven users considered that they had a good and adequate knowledge of molecular biology information sources. One had 'moderate' knowledge and another wanted to know more about different sequence analysis software packages. No users had any knowledge of the bio-sequence searching capabilities offered by CAS-Online and thus had never used the library's existing sequencesearching facilities (which consisted entirely of access to non-Internet 'conventional' commercial databases like CAS), although one thought it potentially useful and requested further information. Use of computers The two scientists with dial-up Internet access had their own desktop machines, the others used a shared machine. All of the machines were Macintoshes, linked via the local area network. Apart from general purpose applications such as word processing and graph drawing, the most important software was MacVector and the Entrez CD-ROM. MacVector was used to carry out alignment, hydrophobicity, accessibility (how accessible a particular protein structural feature is to ligand binding), prediction of secondary protein structure and other calculations, and also to store sequences of interest. All of this derived data could be used for comparison against published material in sequence databases and would also be helpful in defining the various parameters when uploading data to remote servers for processing, e.g. for BLAST calculations. It was intended by one user to set up a database of unknown sequences discovered in-house, but nothing had been implemented yet. The chemist's machine also acted as a terminal to the SG computer for molecular modelling, energy minimisation and molecular dynamics calculations. All interviewees made use of Entrez on CD-ROM although Network Entrez was sometimes used for the most up to date information. 63
vol. 52, no. 1
Information requirements in relation to specific areas/activities of work One molecular biology technique that is beginning to make an impact in drug discovery is 'differential display' (or 'novel gene' cloning), where previously unknown genetic sequences are produced by stressing cells, for example with toxins. Genes expressed that are novel (i.e. only appear on the stressed or altered cell) are potential targets for therapy. It is necessary to check these novel sequences against the most up to date collection of known sequences to ensure that the considerable amount of time and resources involved in assessing therapeutic potential is not wasted by duplicating earlier work. Novel gene cloning was considered a good example by three interviewees, where it would be essential to have access to the most up to date genetic data. To answer the question 'is this a known gene', BLAST similarity searches were carried out via email (with the parameters set fairly tightly to retrieve exact or near exact matches) in GenBank using the NCBI server. For the sake of currency, it was always considered preferable to use the 'parent' Internet servers to access databases, even though most of the databases can be found at a variety of other locations on the Internet, because of the time it takes for updates to filter through to the other locations. If the gene was known, the GenBank accession number was used to extract the relevant bibliographic information, e.g. from Entrez. If the gene was not known, then it was necessary to search protein sequence databases for the derived amino acid sequence. For macromolecular modelling, the most important resource was the PDB. Magnetic tapes (quarterly updates) were purchased and loaded in to Insight 2 software on a Silicon Graphics computer for general use, although very recent PDB information was accessed via the Internet. Internet access was thought preferable as there are between one and two new protein structures added per day; also errors detected in older structures are continually corrected and therefore show up later in the magnetic tape version. For the protein engineering of antibodies, sequence databases are essential. The initial selection of an antibody molecule to be worked on would normally be carried out in specialised databases such as KABAT (which contains only sequences of immunological interest) rather than in the larger comprehensive databases such as GenBank. Molecular modelling was then carried out on the chosen candidate. A typical scenario would be the selection of a human antibody which is most similar in sequence to a cloned murine antibody with a particular specificity. The human antibody would then be used as a framework for CDR-grafting. Current awareness The methods used to keep up-to-date can be summarised as follows: Information source Browsing of journals Medline
CCOD
Bionet newsgroups Online 'SDIS' organised by the library 64
Number of users 7 3 2 2 1
March 1996
BIOINFORMATICS
One scientist stated that pharmaceutical companies have become more reticent in recent years about publishing sequences in the journal literature until patents have been filed and subsequently published. This is because a nucleic acid sequence itself is 'enabling', i.e. can easily be synthesised, cloned and expressed. This runs contrary to the prevailing culture in the molecular biology community i.e. that molecular biologists have traditionally been unique amongst scientific workers in the degree that they share information on a 'goodwill' basis with workers from other institutions. Such give and take is considered essential if the information on which everyone thrives is to be maintained as a meaningful resource. Recently, the National Institutes of Health (NIH) in the us and the Medical Research Council (MRC) in the UK have reached an agreement that they will not automatically file patent applications for every new DNA sequence that is discovered as a result of the human genome project. It was hoped that companies will follow this lead in order to free-up the flow of information to repositories like GenBank and EMBL. In the meantime, most pharmaceutical companies will continue to rely on secrecy and will tend not to release sequence data into the public domain until they can be sure that it will not compromise their intellectual property.
DISCUSSION
Although the sample size was quite small, response to the questionnaire was very encouraging and a lot of useful data was obtained. The mix of job titles also enabled comparisons to be made between full-time bioinformatics specialists and the more generalist information scientists working in the field. All respondents (internal or external) said that the currency of genetic data was important or essential, and several current awareness methods were identified. According to Boguski [1] 'the only way to keep aware of important new developments is to master some of the instruments ... and to use them regularly'. User-friendly Internet access tools have been designed in recent years which have made searching the Net easier; however Boguski also foresees that 'intelligent agents' (software robots) will be programmed with individual interests and will 'continually scan the information space, automatically notifying us when any relevant data or observations become available'. With molecular biology, it is not just the information per se that needs to be followed but the entire information landscape. Much interest was focused on the place of CAS in molecular biology research, as it was known from past experience at Celltech that this very large repository of protein and nucleic acid sequences was hardly ever used. Questions put to external information professionals, to CAS itself (or STN, their UK representatives) and broadcast on the Internet confirmed that it was indeed ignored by most of the molecular biology community, mainly for reasons of cost although there was also a large amount of ignorance about the searching facilities amongst practising molecular biologists. It is possible that one of the main reasons for this is that the Chemical Abstracts database held on STN and other hosts has traditionally been searched by information professionals acting as intermediaries 65
vol. 52, no. 1
and that knowledge of biosequence database enhancements did not filter through to the scientists themselves. By contrast, one database which has become essential for all users is Entrez, which became available only two years ago and integrates the sequence information from many databases and enables crossreferencing to the relevant biomedical references from Medline. Its user-friendly interface and powerful information retrieval features and low price have put information into the hands of end users that only a few years ago would have required the running of complex algorithms on a remote super computer. The Celltech user study showed in general terms how scientists are using biosequence databases and computation in pharmaceutical research, for example to recognise similarities between a totally new sequence and sequences with known properties and function, in order to gain a 'handle' on the underlying disease process. The majority of users were satisfied with their knowledge of the necessary information sources although there was somewhat less satisfaction with the access to it, mostly due to the lack of a dedicated Internet link. This situation led to the existence of two information 'gatekeepers' who provided a service to the rest. The information itself consisted primarily of sequence data (both nucleic acid and peptide/protein) and protein structure information. The former can be either 'primary' information (the sequence itself, for example as submitted to one of the large public databanks prior to publication) or 'secondary', i.e. evaluated information which is found in journal articles or specialised databases; for example the Kabat database of proteins of immunological interest. Nearly all of the important bioinformatics resources are available freely or cheaply via the Internet and the scientific community have been users of this medium for communication and other uses for many years. Thus, it is natural that molecular biologists in universities and industry have exploited (and contributed to) those resources as they have become available. This is especially true as the information which makes up the resources can be both the virtual raw material and the end product of further experiments. These circumstances have led to a self-reliance for information among molecular biologists, although considerable work is involved in the retrieval and analysis of these data. Both the internal and the external surveys show that this has often led to the existence of a small number of information gatekeepers within the laboratory or research centre. These gatekeepers are true scientist/information scientist hybrids - proof of a converging role of information provider and end user; however when viewed from the perspective of the whole organisation they appear somewhat as an 'island', isolated from other functions and departments. It is my conclusion that more generalist information specialists, e.g. those working in information centres or libraries, can be increasingly helpful in integrating this data with other information from various sources and bridging that gap. The fact that four out of the six respondents (i.e. not the two gatekeepers) who were active in bioinformatics operated from within information/library departments showed that such departments were already playing an important role. Indeed, most questionnaire respondents saw a productive role in the future for information workers in the field of molecular biology, citing end-user training and data integration as areas where they might be involved. On the basis of this 66
March 1996
BIOINFORMATICS
research, it can be concluded that to a certain extent 'horses for courses' applies to the appropriate level of is involvement at a particular site, i.e. it will depend on the particular mix of skills and experience that exists within an information department. Any biomedical information department would be well advised to become knowledgeable about the main sources of molecular sequence data even if current research projects do not have much of a molecular biology component as there is a clear trend towards this type of research in the pharmaceutical industry. If research information professionals are to continue providing a complete service to workers in the pharmaceutical industry then they will need to gain a practical knowledge of bioinformatics. Major factors that have to be considered in developing an information policy will include available infrastructure (e.g. hardware, software, networking issues and data security), and the skills available in-house. For the training of end users in the knowledge of sources, use of the Internet and resources such as Entrez, most life-science degrees should provide enough subject knowledge. More detailed analyses such as alignments or homology searches will require at least such a first degree with a high molecular biology content but would probably best be carried out by the practitioners themselves, or by hybrid scientist/information workers, acting as 'local experts' and providing a service to other research scientists in the laboratory.
ACKNOWLEDGEMENTS
Many thanks to Mark Boguski for help with defining bioinformatics and for sending the offprint. Thanks also to Tina Jones for help with the graphics and to all those who gave time to be interviewed, replied to the questionnaires and answered the Usenet postings.
REFERENCES
1. 2. 3.
4.
5. 6. 7.
BOGUSKI, M.S. Bioinformatics. Current Opinion in Genetics and Development, 4, 1994, 383-388. ALTSCHUL, S.F., BOGUSKI, W. andWOOTON,J.C. Issues in searching molecular sequence databases. Nature Genetics, 6, 1994, 119-129. SILLINCE, M. and SILLINCE, J.A.A. Sequence and structure databanks in molecular biology: the reasons for integration. Journal of Documentation, 49(1), 1993, 1-28. KEHOE, K. Specialised databases in molecular biology and genetics: the nucleic acid and protein sequence databases. Science and Technology Libraries, 11(1), 1990, 99-105. FUCHS, R., RICE, P. and CAMERON, G.N. Molecular biological databases present and future. Trends in Biotechnology, 10, 1992, 61-66. BUNTROCK, R.E. Sequence databases: what's in it for me? Database, June 1991, 107-109. BENSON, D., LIPMAN, D.J. and OSTELL, J. GenBank. Nucleic Acids Research, 2/(13), 1993, 2963-2965.
67
JOURNAL OF DOCUMENTATION 8.
vol. 52, no. 1
STOEHR, P. and CAMERON, G.N. The EMBL data library. Nucleic Acids Research, 19 (supplement), 1991, 2227-2230. 9. BAIROCH, A. and BOECKMANN, B. The SwissProt protein sequence data bank. Nucleic Acids Research, 20 (supplement), 1992, 2019-2022. 10. BARKER, W.C., GEORGE, D.G., MEWES, H. and TSUGITA, A. T h e PIR-
International protein sequence database. Nucleic Acids Research, 20 (supplement), 1992, 2023-2026.
11. LIU-JOHNSON, H.N., HAINES, R. and HACKETT, W. Searching for protein
12.
sequences in CAS Online. Biotech Forum Europe, 5(4), 1991, 204-209. Who's Who in the UK Information World 1994, 4th edition. TFPL Publishing, 1994.
(Revised version received 24 October 1995)
68

Bioinformatics in The Pharmaceutical Industry

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Bioinformatics in The Pharmaceutical Industry

Încărcat de

Drepturi de autor:

Formate disponibile

BIOINFORMATICS IN THE PHARMACEUTICAL INDUSTRY

vol. 52, no. 1

vol. 52, no. 1

vol. 52, no. 1

vol. 52, no. 1

vol. 52, no. 1

vol. 52, no. 1

vol. 52, no. 1

Bionet newsgroups Online 'SDIS' organised by the library 64

vol. 52, no. 1

vol. 52, no. 1

(Revised version received 24 October 1995)

S-ar putea să vă placă și