Sunteți pe pagina 1din 21

BIOINFORMATICS

Introduction

1 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu


Mark Gerstein, Yale University
bioinfo.mbb.yale.edu/mbb452a
What is Bioinformatics?

• (Molecular) Bio - informatics

2 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu


• One idea for a definition?
Bioinformatics is conceptualizing biology in terms of
molecules (in the sense of physical-chemistry) and
then applying “informatics” techniques (derived
from disciplines such as applied math, CS, and
statistics) to understand and organize the
information associated with these molecules, on a
large-scale.
• Bioinformatics is “MIS” for Molecular Biology
Information. It is a practical discipline with many
applications.
Organizing
Molecular Biology
Information:
Redundancy and

3 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu


Multiplicity
• Different Sequences Have the
Same Structure
• Organism has many similar genes
• Single Gene May Have Multiple
Functions
Integrative Genomics -
• Genes are grouped into Pathways genes ↔ structures ↔
• Genomic Sequence Redundancy functions ↔ pathways ↔
due to the Genetic Code expression levels ↔
regulatory systems ↔ ….
• How do we find the
similarities?.....
A Parts List Approach to Bike Maintenance

4 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu


A Parts List Approach to Bike Maintenance
How many roles
can these play?
How flexible and
adaptable are they
mechanically?

5 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu


What are the
shared parts (bolt,
nut, washer, spring,
bearing), unique
parts (cogs,
levers)? What are
the common parts - Where are
- types of parts the parts
(nuts & washers)? located?
What is Bioinformatics?

• (Molecular) Bio - informatics

6 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu


• One idea for a definition?
Bioinformatics is conceptualizing biology in terms of
molecules (in the sense of physical-chemistry) and
then applying “informatics” techniques (derived
from disciplines such as applied math, CS, and
statistics) to understand and organize the
information associated with these molecules, on a
large-scale.
• Bioinformatics is “MIS” for Molecular Biology
Information. It is a practical discipline with many
applications.
General Types of
“Informatics” techniques
in Bioinformatics
• Databases • Geometry

7 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu


◊ Building, Querying ◊ Robotics
◊ Object DB ◊ Graphics (Surfaces, Volumes)
• Text String Comparison ◊ Comparison and 3D Matching
(Visision, recognition)
◊ Text Search
◊ 1D Alignment • Physical Simulation
◊ Significance Statistics ◊ Newtonian Mechanics
◊ Alta Vista, grep ◊ Electrostatics
◊ Numerical Algorithms
• Finding Patterns
◊ Simulation
◊ AI / Machine Learning
◊ Clustering
◊ Datamining
New Paradigm for
Scientific Computing
• Because of • Physics

8 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu


increase in data and ◊ Prediction based on physical
improvement in computers, principles
new calculations become ◊ Exact Determination of Rocket
possible Trajectory
◊ Supercomputer, CPU
• But Bioinformatics has a new
style of calculation... • Biology
◊ Two Paradigms ◊ Classifying information and
discovering unexpected
relationships
◊ globin ~ colicin~ plastocyanin~
repressor
◊ networks, “federated” database
Bioinformatics Topics --
Genome Sequence
• Finding Genes in Genomic

9 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu


DNA
◊ introns
◊ exons
◊ promotors
• Characterizing Repeats in
Genomic DNA
◊ Statistics
◊ Patterns
• Duplications in the Genome
• Sequence Alignment
◊ non-exact string matching, gaps Bioinformatics
◊ How to align two strings optimally
via Dynamic Programming Topics --
◊ Local vs Global Alignment
◊ Suboptimal Alignment Protein Sequence
◊ Hashing to increase speed

10 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu


(BLAST, FASTA) • Scoring schemes and
◊ Amino acid substitution scoring Matching statistics
matrices
◊ How to tell if a given alignment or
• Multiple Alignment and match is statistically significant
Consensus Patterns ◊ A P-value (or an e-value)?
◊ How to align more than one ◊ Score Distributions
sequence and then fuse the (extreme val. dist.)
result in a consensus ◊ Low Complexity Sequences
representation
◊ Transitive Comparisons
◊ HMMs, Profiles
◊ Motifs
Bioinformatics
Topics --
Sequence /
Structure

11 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu


• Secondary Structure
“Prediction”
◊ via Propensities
◊ Neural Networks, Genetic • Tertiary Structure Prediction
Alg.
◊ Fold Recognition
◊ Simple Statistics
◊ Threading
◊ TM-helix finding
◊ Ab initio
◊ Assessing Secondary
Structure Prediction • Function Prediction
◊ Active site identification
• Relation of Sequence Similarity to
Structural Similarity
Topics -- Structures

• Basic Protein Geometry and • Structural Alignment

12 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu


Least-Squares Fitting ◊ Aligning sequences on the basis
◊ Distances, Angles, Axes, of 3D structure.
Rotations ◊ DP does not converge, unlike
• Calculating a helix axis in 3D sequences, what to do?
via fitting a line ◊ Other Approaches: Distance
◊ LSQ fit of 2 structures Matrices, Hashing
◊ Molecular Graphics ◊ Fold Library
• Calculation of Volume and
Surface
◊ How to represent a plane
◊ How to represent a solid
◊ How to calculate an area
◊ Docking and Drug Design as
Surface Matching
◊ Packing Measurement
• Relational Database Topics --
Concepts
◊ Keys, Foreign Keys Databases
◊ SQL, OODBMS, views, forms,
transactions, reports, indexes • Clustering and Trees

13 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu


◊ Joining Tables, Normalization ◊ Basic clustering
• Natural Join as "where" • UPGMA
selection on cross product
• single-linkage
• Array Referencing (perl/dbm)
• multiple linkage
◊ Forms and Reports
◊ Other Methods
◊ Cross-tabulation
• Parsimony, Maximum
• Protein Units? likelihood
◊ What are the units of biological ◊ Evolutionary implications
information?
• The Bias Problem
• sequence, structure
◊ sequence weighting
• motifs, modules, domains
◊ sampling
◊ How classified: folds, motions,
pathways, functions?
Topics -- Genomics

• Expression Analysis • Genome Comparisons

14 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu


◊ Time Courses clustering ◊ Ortholog Families, pathways
◊ Measuring differences ◊ Large-scale censuses
◊ Identifying Regulatory Regions ◊ Frequent Words Analysis
• Large scale cross referencing ◊ Genome Annotation
of information ◊ Trees from Genomes
◊ Identification of interacting
• Function Classification and proteins
Orthologs
• The Genomic vs. Single- • Structural Genomics
molecule Perspective ◊ Folds in Genomes, shared &
common folds
◊ Bulk Structure Prediction
• Genome Trees

Topics -- Simulation

• Molecular Simulation • Parameter Sets

15 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu


◊ Geometry -> Energy -> Forces • Number Density
◊ Basic interactions, potential
energy functions
• Poisson-Boltzman Equation
◊ Electrostatics • Lattice Models and
◊ VDW Forces Simplification
◊ Bonds as Springs
◊ How structure changes over
time?
• How to measure the change
in a vector (gradient)
◊ Molecular Dynamics & MC
◊ Energy Minimization
What is Bioinformatics?

• (Molecular) Bio - informatics

16 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu


• One idea for a definition?
Bioinformatics is conceptualizing biology in terms of
molecules (in the sense of physical-chemistry) and
then applying “informatics” techniques (derived
from disciplines such as applied math, CS, and
statistics) to understand and organize the
information associated with these molecules, on a
large-scale.
• Bioinformatics is “MIS” for Molecular Biology
Information. It is a practical discipline with many
applications.
Major Application I:
Designing Drugs
• Understanding How Structures Bind Other Molecules (Function)

17 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu


• Designing Inhibitors
• Docking, Structure Modeling
(From left to right, figures adapted from Olsen Group Docking Page at Scripps, Dyson NMR Group Web page at Scripps, and from
Computational Chemistry Page at Cornell Theory Center).
Major Application II: Finding Homologs

18 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu


Major Application I|I:
Overall Genome Characterization
• Overall Occurrence of a

19 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu


Certain Feature in the
Genome
◊ e.g. how many kinases in Yeast
• Compare Organisms and
Tissues
◊ Expression levels in Cancerous vs
Normal Tissues
• Databases, Statistics

(Clock figures, yeast v. Synechocystis,


adapted from GeneQuiz Web Page, Sander Group, EBI)
Schematic
Bioinformatics

20 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu


1980
Bioinformatics - History
• Single Structures
◊ Modeling & Geometry
1985
◊ Forces & Simulation
◊ Docking

21 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu


• Sequences, Sequence-
1990 Structure Relationships
◊ Alignment
◊ Structure Prediction
◊ Fold recognition
1995
• Genomics
◊ Dealing with many sequences
◊ Gene finding & Genome Annotation
2000 ◊ Databases
• Integrative Analysis
◊ Expression & Proteomics Data
◊ Datamining
2005 ◊ Simulation again….

S-ar putea să vă placă și