Documente Academic
Documente Profesional
Documente Cultură
2015/2016 Bioinformatics
Novembro
2015
1/5 Pedro
Borrego
ESSEM
2015/2016
Bioinformatics
1.6.
Use
a
text
editor
(eg.
WordPad)
to
open
the
downloaded
file
and
see
how
does
a
FASTA
sequence
looks
like.
1.7.
You
can
use
Boolean
operators
(AND,
OR,
NOT)
to
make
your
searches
more
specific.
Search
for
new
records
by
combining
the
terms
human
immunodeficiency
virus
type
1
and
protease
with
these
operators.
Another
option
to
restrict
your
search
is
to
use
filters
(Species,
Molecule
types,
etc.);
choose
additional
filters
with
Show
additional
filters.
Search
for
human
immunodeficiency
virus
type
1
AND
protease
and
filter
your
results
by
Molecule
type;
choose
Genomic
DNA/RNA.
Take
a
look
at
the
Results
by
taxon
list
on
the
right;
it
is
a
simple
way
to
select
records
from
specific
organisms.
1.8.
Select
the
first
20
records
of
your
search
and
save
them
as
a
FASTA
file
(Step
1.5)
with
the
name
hiv1ProteaseGb.fasta.
1.9.
If
you
already
know
the
Accession
numbers
you
are
interested
in,
the
search
is
more
direct.
You
can
even
write
a
sequence
of
Accession
numbers.
Hence,
try
to
find
these
records
AJ302212
AJ302213
AJ302214
and
save
them
in
FASTA
format
(hiv1ProteaseGbAcc.fasta).
1.10.
Open
a
text
editor
(eg.
WordPad)
and
copy/paste
this
three
sequences
to
the
hiv1ProteaseGb.fasta.
1.11.
The
instructions
above
were
applied
to
the
Nucleotide
collection.
Similar
actions
can
be
made
to
the
Protein
(protein
sequences),
EST
(Expressed
Sequence
Tags
database)
and
GSS
(Genome
Survey
Sequences
database)
collections
1.12.
When
your
are
interested
in
retrieving
a
large
dataset
of
sequences
and
you
already
know
their
Accession
numbers
or
GenInfo
Identifier
(GI,
sequence
identifier
that
tracks
sequence
histories
in
GenBank;
it
changes
every
time
a
change
is
made
to
a
sequence),
you
can
use
Batch
Entrez
(http://www.ncbi.nlm.nih.gov/sites/batchentrez).
Just
choose
the
database
youre
interested
in
(eg.
Nucleotide),
upload
a
text
file
with
a
list
of
Accession
numbers
or
GIs
(eg.
create
a
text
file
in
WordPad
with
the
Accession
numbers
of
Step
1.9)
and
press
on
Retrieve
to
Novembro
2015
2/5 Pedro
Borrego
ESSEM
2015/2016
Bioinformatics
Novembro
2015
3/5 Pedro
Borrego
ESSEM
2015/2016
Bioinformatics
can
be
ordered
by
any
column.
Max
score
is
the
highest
alignment
score
from
that
database
sequence,
while
Total
score
is
the
total
alignment
scores
from
all
alignment
segments.
E
value
is
an
estimate
of
the
number
of
false
positives
(matches)
one
can
expect
to
find
by
chance.
The
lower
the
E
values,
the
more
significant
the
match
is.
E
value
<
0.1
gives
a
good
level
of
confidence
that
that
hit
is
homologous
(share
a
common
ancestor)
to
the
query
sequence.
If
0.1
<
E
value
<
10,
that
hit
might
be
homologous
to
the
query,
but
you
should
be
cautious.
If
E
value
>
10,
there
is
not
enough
confidence
to
accept
the
result.
In
this
search,
the
first
sequence
is
the
query
sequence.
A
detailed
explanation
of
this
output
can
be
found
in
the
Blast
report
description
(top
right
corner
of
the
web
page)
and
in
the
Help
tab.
You
can
select
any
sequences
of
interest
and
Download
them
or
examine
their
GenBank
records;
they
can
be
used
for
further
phylogenetic
analysis
with
more
sensitive
methods.
You
should
not
conclude
about
the
evolutionary
relationships
between
sequences
solely
based
on
BLAST
results!!
2.4.
Go
back
to
BLAST
Home
page
and
select
protein
blast
to
search
the
protein
database
using
an
HIV-1
envelope
protein
sequence
(Accession
number
AAC55466).
Use
the
default
settings.
Select
one
sequence
and
click
on
the
GenPept
links
to
open
the
GenBank
record.
Under
the
Related
Information
heading
(right
column
of
the
record),
follow
the
Related
Structure
links
to
find
three
dimensional
structure
records
that
contain
one
or
more
protein
molecules
similar
in
sequence
to
the
current
protein.
The
Structure
database
allows
you
to
visualize
each
structure
with
the
corresponding
annotations.
Any
structure
of
interest
can
be
downloaded
and
used
for
future
analysis
(eg.
homology
modelling).
3.
Retrieving
data
from
HIV
Databases
In
HIV
databases
you
will
find
all
HIV
genetic
sequences
contained
in
GenBank,
and
also
data
on
immunological
epitopes,
drug
resistance-associated
mutations,
and
vaccine
trials.
3.1.
Go
to
HIV
Databases
(www.hiv.lanl.gov)
3.2.
Click
on
Sequence
Database.
Alternatively
you
can
follow
the
Other
Viruses
link
to
search
for
sequences
of
Hepatitis
C
and
Haemorrhagic
Fever
Viruses.
Novembro
2015
4/5 Pedro
Borrego
ESSEM
2015/2016
Bioinformatics
3.3.
Sequence
Database
has
a
comprehensive
set
of
information
regarding
HIV
sequences,
including
premade
alignments
of
reference
sequences
and
useful
tools
for
sequence
analyses.
You
should
explore
these
resources
after
this
module,
for
now
lets
just
take
a
look
at
Search
Interface
and
Geographical
Search
Interface.
3.4.
Search
Interface
allows
you
to
make
a
more
generic
search
(e.g.
Virus,
Subtype,
Find
all
sequences
for
a
specific
gene
or
region,
etc.),
a
specific
search
(e.g.
Genbank
Accession
number,
etc.),
or
an
advanced
search
(e.g.
Sample
tissue,
Patient
Information,
etc.).
You
can
also
add
Geographical
Information
to
your
search.
In
this
case
we
are
interested
in
HIV-1
protease
sequences
sampled
in
Europe,
Virus
HIV-1
>
Genomic
region
protease
>
Geographical
information,
Geographic
region
Europe
>
Search.
3.5.
This
table
summarizes
the
information
for
each
sequence.
For
each
sequence
you
can
make
a
blast
search
(Blast),
get
the
GenBank
Record
(Accession)
or
look
at
the
annotated
map
of
the
genomic
region
(Genomic
Region).
For
now,
just
select
a
couple
of
sequences
of
1041
bp
from
Switzerland
and
click
Download
Sequences
(keep
the
default
options).
3.6.
Open
a
text
editor
(eg.
WordPad)
and
copy/paste
these
sequences
to
the
hiv1ProteaseGb.fasta
(the
same
way
you
did
in
step
1.10.)
3.7.
Alternatively,
and
if
your
main
interest
is
to
retrieve
sequences
based
on
geographical
distribution,
go
back
to
the
Sequence
Database
main
page
and
click
on
Geographical
Search
Interface.
This
is
a
very
intuitive
interface.
For
instance,
use
the
map
to
find
Portugal
or
use
the
search
fields
below
(Select
Europe
>
Select
Portugal
>
Show
All).
The
pie
chart
shows
the
distribution
of
sequences
per
subtype
and
recombinant
form.
You
can
either
get
all
sequences
or
click
a
pie
slice
to
retrieve
sequences
from
a
specific
subtype
or
recombinant.
THE
END
J
Novembro
2015
5/5 Pedro
Borrego