Sunteți pe pagina 1din 5

ESSEM

2015/2016 Bioinformatics

How to get HIV sequences from databases?



1. Retrieving data from GenBank
GenBank is a database maintained by the NCBI (USA) that contains all publicly available DNA sequences. It
is part of an international collaboration with similar databases in Europe (EMBL) and Japan (DDBJ). These
organizations exchange their information every day.

1.1. Lets start with the Global Cross-database NCBI Search GQuery
(http://www.ncbi.nlm.nih.gov/sites/gquery).

1.2. Search for the term human immunodeficiency virus type 1. This results in a cross-database
search for this term in all NCBI databases. Alternatively you can make a search in a specific
database (eg. Nucleotide).

1.3. Your search found 87452 results in PubMed. Click on the link to display these citations.
There are also 95908 free full-text journal articles in PubMed Central for this search term.

1.4. In Nucleotide database (the GenBank database, http://www.ncbi.nlm.nih.gov/genbank/ or
http://www.ncbi.nlm.nih.gov/nucleotide/) there are 612193 DNA and RNA sequences. They vary
in length and origin. The first sequences are RNA structures, the 17th result is a complete
genome sequence, the 18th, 19th and 20th are partial sequences from reverse transcriptase and
envelope coding genes.

1.5. Click on the 17th result to examine the record of a complete HIV-1 genome. Here you can
find several info including the definition, source, references, coding sequences, etc. , of this HIV-
1 complete genome. An important field is the Accession number, the unique identifier of each
sequence. To see a description of the various fields, go to the Sample GenBank Record
(http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#AccessionB). You can highlight
biological features in the sequence by clicking on the links in Features. For example, click on the
CDS (Coding DNA Sequence) of the env gene and navigate through the Feature Highlight Bar at
the bottom of the page. To download this HIV-1 genome sequence in FASTA format, press Send
> File > Format FASTA > Create File > hiv1Genome.fasta.

Novembro 2015
1/5 Pedro Borrego
ESSEM 2015/2016 Bioinformatics


1.6. Use a text editor (eg. WordPad) to open the downloaded file and see how does a FASTA
sequence looks like.

1.7. You can use Boolean operators (AND, OR, NOT) to make your searches more specific. Search
for new records by combining the terms human immunodeficiency virus type 1 and protease
with these operators. Another option to restrict your search is to use filters (Species, Molecule
types, etc.); choose additional filters with Show additional filters. Search for human
immunodeficiency virus type 1 AND protease and filter your results by Molecule type; choose
Genomic DNA/RNA. Take a look at the Results by taxon list on the right; it is a simple way to
select records from specific organisms.

1.8. Select the first 20 records of your search and save them as a FASTA file (Step 1.5) with the
name hiv1ProteaseGb.fasta.

1.9. If you already know the Accession numbers you are interested in, the search is more direct.
You can even write a sequence of Accession numbers. Hence, try to find these records AJ302212
AJ302213 AJ302214 and save them in FASTA format (hiv1ProteaseGbAcc.fasta).

1.10. Open a text editor (eg. WordPad) and copy/paste this three sequences to the
hiv1ProteaseGb.fasta.

1.11. The instructions above were applied to the Nucleotide collection. Similar actions can be
made to the Protein (protein sequences), EST (Expressed Sequence Tags database) and GSS
(Genome Survey Sequences database) collections

1.12. When your are interested in retrieving a large dataset of sequences and you already know
their Accession numbers or GenInfo Identifier (GI, sequence identifier that tracks sequence
histories in GenBank; it changes every time a change is made to a sequence), you can use Batch
Entrez (http://www.ncbi.nlm.nih.gov/sites/batchentrez). Just choose the database youre
interested in (eg. Nucleotide), upload a text file with a list of Accession numbers or GIs (eg.
create a text file in WordPad with the Accession numbers of Step 1.9) and press on Retrieve to

Novembro 2015
2/5 Pedro Borrego
ESSEM 2015/2016 Bioinformatics

get the records you are searching for.



1.13. Another useful resource is PopSet (http://www.ncbi.nlm.nih.gov/popset). This is a set of
DNA sequences collected in a population study of different variants of the same isolate,
different members of the same species or several organisms of different species. Use the search
term C2-V3-C3 and you will find 21 HIV-1 or HIV-2 population studies. You can examine the
corresponding sequences and save them in a single dataset.

1.14. These few steps are just a quick start guide for GenBank. Please take your time to explore
this database. For instance, you can Manage Filters in the search page or customise the
information shown in each GenBank record by changing the features selected in the Customize
view and Change region shown (right side of the web page).


2. Similarity searches using BLAST (Basic Local alignment Search Tool)
The BLAST algorithm compares our sequence of nucleotides or proteins (query sequence) with all
the sequences existing in public databases. It allows us to find the most similar sequences based
on the score and statistical inference of each match.

2.1. Go to BLAST search engine (http://blast.ncbi.nlm.nih.gov/Blast.cgi).

2.2. Use nucleotide blast to search for similar nucleotide sequences to the HIV-1 complete
genome saved from GenBank in Step 1.5 (HIV1genome.fasta). To this end, upload your file to
the dialog box in Enter Query Sequence (or copy/paste the sequence, including the heading
>), select Choose Search Set > Database > Nucleotide collection (nr/nt) to make sure you
search the entire collection, Program Selection > Optimize for > megablast for a faster search,
Algorithm parameters > General Parameters > Max target sequences > 1000. Press BLAST and
wait for the results. Alternatively, you can optimize your search parameters, like choosing a
blastn instead of megablast for a more refined (and slow!) search, word size to regulate the
sensitivity, etc.

2.3. The Descriptions table summarizes the results. They are usually ordered by Max score, but

Novembro 2015
3/5 Pedro Borrego
ESSEM 2015/2016 Bioinformatics

can be ordered by any column. Max score is the highest alignment score from that database
sequence, while Total score is the total alignment scores from all alignment segments. E value is
an estimate of the number of false positives (matches) one can expect to find by chance. The
lower the E values, the more significant the match is. E value < 0.1 gives a good level of
confidence that that hit is homologous (share a common ancestor) to the query sequence. If 0.1
< E value < 10, that hit might be homologous to the query, but you should be cautious. If E value
> 10, there is not enough confidence to accept the result. In this search, the first sequence is the
query sequence. A detailed explanation of this output can be found in the Blast report
description (top right corner of the web page) and in the Help tab. You can select any sequences
of interest and Download them or examine their GenBank records; they can be used for further
phylogenetic analysis with more sensitive methods. You should not conclude about the
evolutionary relationships between sequences solely based on BLAST results!!

2.4. Go back to BLAST Home page and select protein blast to search the protein database using
an HIV-1 envelope protein sequence (Accession number AAC55466). Use the default settings.
Select one sequence and click on the GenPept links to open the GenBank record. Under the
Related Information heading (right column of the record), follow the Related Structure links to
find three dimensional structure records that contain one or more protein molecules similar in
sequence to the current protein. The Structure database allows you to visualize each structure
with the corresponding annotations. Any structure of interest can be downloaded and used for
future analysis (eg. homology modelling).


3. Retrieving data from HIV Databases
In HIV databases you will find all HIV genetic sequences contained in GenBank, and also data on
immunological epitopes, drug resistance-associated mutations, and vaccine trials.

3.1. Go to HIV Databases (www.hiv.lanl.gov)

3.2. Click on Sequence Database. Alternatively you can follow the Other Viruses link to search
for sequences of Hepatitis C and Haemorrhagic Fever Viruses.

Novembro 2015
4/5 Pedro Borrego
ESSEM 2015/2016 Bioinformatics

3.3. Sequence Database has a comprehensive set of information regarding HIV sequences,
including premade alignments of reference sequences and useful tools for sequence analyses.
You should explore these resources after this module, for now lets just take a look at Search
Interface and Geographical Search Interface.

3.4. Search Interface allows you to make a more generic search (e.g. Virus, Subtype, Find all
sequences for a specific gene or region, etc.), a specific search (e.g. Genbank Accession number,
etc.), or an advanced search (e.g. Sample tissue, Patient Information, etc.). You can also add
Geographical Information to your search. In this case we are interested in HIV-1 protease
sequences sampled in Europe, Virus HIV-1 > Genomic region protease > Geographical
information, Geographic region Europe > Search.

3.5. This table summarizes the information for each sequence. For each sequence you can make
a blast search (Blast), get the GenBank Record (Accession) or look at the annotated map of the
genomic region (Genomic Region). For now, just select a couple of sequences of 1041 bp from
Switzerland and click Download Sequences (keep the default options).

3.6. Open a text editor (eg. WordPad) and copy/paste these sequences to the
hiv1ProteaseGb.fasta (the same way you did in step 1.10.)

3.7. Alternatively, and if your main interest is to retrieve sequences based on geographical
distribution, go back to the Sequence Database main page and click on Geographical Search
Interface. This is a very intuitive interface. For instance, use the map to find Portugal or use the
search fields below (Select Europe > Select Portugal > Show All). The pie chart shows the
distribution of sequences per subtype and recombinant form. You can either get all sequences
or click a pie slice to retrieve sequences from a specific subtype or recombinant.

THE END J

Novembro 2015
5/5 Pedro Borrego

S-ar putea să vă placă și