Perl Regexp

Perl for biologists
Pattern matching with Regular

Expressions
Perl for biologists

KING HENRY V
We do not mean the coursing snatchers only,
if ($text =~ /King/) { But fear the main intendment of the Scot,
Who hath been still a giddy neighbour to us;
print “found\n”; For you shall read that my great-grandfather
} Never went with his forces into France
But that the Scot on his unfurnish'd kingdom
Came pouring, like the tide into a breach,
if ($text =~ /King/i) With ample and brim fulness of his force,
Galling the gleaned land with hot assays,
{
Girding with grievous siege castles and towns;
print “found\n”; That England, being empty of defence,
} Hath shook and trembled at the ill neighbourhood.
CANTERBURY
She hath been then more fear'd than harm'd, my liege;
For hear her but exampled by herself:
When all her chivalry hath been in France
And she a mourning widow of her nobles,
She hath herself not only well defended
But taken and impounded as a stray
The King of Scots; whom she did send to France,
To fill King Edward's fame with prisoner kings...
1
Perl for biologists
if ($text =~ /King/i) {
print “found\n”; this is a
“modifier” and
} means ignore
case.
=~ is the binding
operator and in an if / / is used to denote a
means “contains” pattern (called a “regular
expression” or RE ) which can
be used to search within
textual data
•A regular expression can be just a piece of text but usually

it is more complex representing a pattern or template
•Regular expressions are case sensitive and by default look
for strings of chars not whole words.
Perl for biologists

Example-1
print “Enter a motif to search for: “;

Enter a motif to
$motif=<STDIN>; search for: SVLQ
# remove the newline at the end of the Found SVLQ
$motif
chomp $motif; Enter a motif to
# look for the motif search for: svlq
Could not find
if ($protein =~ /$motif/) { svlq
print “Found $motif\n”;
} else {
print “Could not find $motif\n”;
}
2
Perl for biologists
Example-2
print “Enter a motif to search for: “; OUTPUT

$motif=<STDIN>; Enter a motif to
# remove the newline at the end of the search for: svlq
$motif Found svlq
chomp $motif;
# look for the motif
if ($protein =~ /$motif/i) { i is a modifier, i.e.
it alters the
print “Found $motif\n”; standard behaviour
} else { of the RE. The i
modifier means
print “Could not find $motif\n”; ignore case
}
Perl for biologists
Other uses of regular expressions: the s operator
The substitution operator s uses a RE to replace part of a string with

something else:
$DNA=“ACGGCGGACCCGGAATTACTA”;
print ”Starting DNA\n$DNA\n”;
Starting DNA
# Transcribe the DNA to RNA by
ACGTCGGACCCGGAATTACTA
replacing T’s with U’s
$RNA=$DNA;
$RNA =~ s/T/U/g;
# print RNA to screen Transcribed RNA
ACGUCGGACCCGGAAUUACUA
print “Transcribed RNA\n$RNA\n”;
3
Perl for biologists
$RNA =~ s/T/U/g ;
RE
RE giving
giving gg modifier
modifier
variable
variable binding
binding pattern
pattern to
to replacement
replacement
operator
operator means
means
look
look for
for text
text “global”,i.e.
“global”,i.e.
everywhere
everywhere
in
in the
the
string
string
Naturally, the pattern to be found and the
replacement text can both be variables
Perl for biologists
Restricting the search

By default regular expressions look for matches anywhere within the input
string. Sometimes it is convenient to restrict the search to portions within the
data, e.g. at the beginning or at the end.
# program to look for the stop codon

TAG in a DNA sequence
$DNA=“GTCCGTAAGTCAGTCATTACGTCAGTCGGTAA
GTCAGTCATTACGTAGGTCGGTAAGTCAGTCATTAGAT
CAGGGCCTAAGTAG”;
if ($DNA =~ /TAG/) {
print “Stop codon TAG found\n”;
}
4
Perl for biologists
# program to look for the stop codon

TAG in a DNA sequence
$DNA=“GTCCGTAAGTCAGTCATTACGTCAGTCGGTAA
GTCAGTCATTACGTAGGTCGGTAAGTCAGTCATTAGAT
CAGGGCCTAAGTAG”;
if ($DNA =~ /TAG$/) {
print “Stop codon TAG found\n”;
}
$ metacharacter
means find string at
end of text
Perl for biologists

It is also possible to restrict the search to the beginning of the data with ^ :
# look for ATG at the start of a sequence
if ($DNA =~ /^ATG/) { ^ and $ are
print “Found start codon\n”; called anchors
they restrict
}
the scope of the
... search
# look for FASTA header
while ($line=<DB>){
if ($line=~/^>/) {
print “new sequence found\n”;
# read sequence
..
# skip comments in a UNIX file
while (..) {
next if ($line =~/^#/);
...
5
Perl for biologists
Making the search more flexible
Consider the following:
# see if either of 2 motifs ADV and ASV are in a

protein sequence
if ($protein =~ /ADV/) or ($protein =~ /ASV/) {
print “Found ADV or ASV \n”;
}
Perl for biologists

This can be simplified using character classes
# see if either of 2 motifs ADV

and ASV are in a protein sequence
if ($protein =~ /A[DS]V/) { the [ ] enclose a
print “Found ADV or ASV \n”; list or range of
characters
}
which can be
selected at that
point
Although the two alternatives are equivalent, using [] is

generally more efficient.
6
Perl for biologists
Other examples:
# translate codon to amino acid The . means
if ( $codon =~ /GC./) {$aa=“ala”; } # alanine any char
elsif ($codon =~ /TG[TC]/) { $aa=“cys”;}# cysteine except newline
elsif ($codon =~ /GA[TC]/) {$aa=“asp”;} # aspartic
acid
elsif ($codon =~ /GA[AG]/) {$aa =“glu”;}# glutamic
acid
..
# non bases
if ($dna =~/[^acgt]/) {
print “Invalid base pairs found\n”;
} character
# check input is a number range
if ($input[$i] !~ /[0-9]/) {
print “Invalid input\n”;
exit;
}
Perl for biologists
Escaping characters
# look for the $$$$ separator in a .sd
structure file
# will this work ?

if ( $line =~ /$$$$/) {
print “New structure found\n”;
}
Characters with special meanings in REs (e.g. $,^,[, etc) need to

be “escaped” with \ if you need to search for them
7
Perl for biologists
Escaping characters
# look for the $$$$ separator in
a .sd structure file
# This should work

if ( $line =~ /\$\$\$\$/) {
print “New structure found\n”;
}
The
The$$has
hasspecial
special
meaning
meaningso
soininthis
this
example
exampleititneeds
needstoto
be
beescaped
escapedwith
with\\
Perl for biologists
Special characters
Characters which cannot be typed in or are not easily visibile also use the \
notation. In addition, Perl defines certain meta-characters for REs.
\n newline
\t tab
\s whitespace (i.e. space or tab)
\r carriage return
\033 octal char
\w match a word char
\W match a non-word char
\d match a digit
\D match a non-digit
8
Perl for biologists
Functions using REs: split

The split function “breaks up” a string according to a regular
expression and puts the pieces into an array. Often used in
reading databases or tabular output.
@fields = split(/\s/, $record);

..
# for parsing lines of the form a=b
($param,$value) = split(/=/,$line);
/sex=“male”
/cell_line=“HuS-L12”
/organism=“H. Sapiens”
/dev_stage=“embryo”
/gene=“PCCX1”
Perl for biologists
Functions using REs: split

# Example 2: Reading tabular output of blast (-m8)
while ($line=<>) {
chomp($line); # remove newlines
@fields=split(/\s/,$line);
$score=$fields[11];
$length=$fields[3];
$pcid=$fields[2];
...
BE090210
BE090210 gi|42406291|ref|NC_000007.10|NC_000007
gi|42406291|ref|NC_000007.10|NC_000007 98.40
98.40 500
500 66 22 23
23
521
521 107552639
107552639 107552141
107552141 0.0
0.0 848.4
848.4
BQ379587
BQ379587 gi|42406304|ref|NC_000017.8|NC_000017
gi|42406304|ref|NC_000017.8|NC_000017 97.55
97.55 204
204 55 00 24
24
227
227 40141778
40141778 40141575
40141575 1.4e-78 298.4
1.4e-78 298.4
BQ379587
BQ379587 gi|42406304|ref|NC_000017.8|NC_000017
gi|42406304|ref|NC_000017.8|NC_000017 94.69
94.69 207
207 55 33 24
24
227
227 40115930
40115930 40115727
40115727 1.6e-60
1.6e-60 238.4
238.4
BQ379587
BQ379587 gi|42406304|ref|NC_000017.8|NC_000017
gi|42406304|ref|NC_000017.8|NC_000017 95.00
95.00 200
200 66 22 24
24
221
221 20569436
20569436 20569239
20569239 2.6e-56 224.4
2.6e-56 224.4
BQ379587
BQ379587 gi|42406304|ref|NC_000017.8|NC_000017
gi|42406304|ref|NC_000017.8|NC_000017 95.96
95.96 198
198 88 00 24
24
221
221 16936304
16936304 16936107
16936107 4.1e-55 220.4
4.1e-55 220.4
BQ379587
BQ379587 gi|42406304|ref|NC_000017.8|NC_000017
gi|42406304|ref|NC_000017.8|NC_000017 94.00
94.00 200
200 88 22 24
24
221
221 18544194
18544194 18544391
18544391 4.5e-43 180.4
4.5e-43 180.4
9
Perl for biologists
join
The join function does the reverse of split, it sticks together fragments
using a string (strictly speaking join is not a RE function).
# @DNA is an array of nucleotides

# more useful to have a string
@DNA=<DNAFILE>;
$DNA = join(‘’,@DNA);
..
# create a tab-delimited record

$record=join(‘\t’,$field1,$field2,$field3);
Perl for biologists
Some useful regular expressions..

# removing leading spaces
$line =~ s/^\s*//g;
# remove trailing spaces

$line =~ s/\s*$//g;
# remove line numbers from DNA

sequence in GenBank
$dna =~ s/[0-9]//g;
10
Perl for biologists
Summary
• Text in Perl can be searched and manipulated with regular expressions (REs)
• By default REs are enclosed by two / , e.g. /word/
• Modifiers such as i or g can be added change default behaviour, e.g.
/word/i searches for “word” regardless of case
• REs can be just text but they can also contain metacharacters to be
more/less specific. Examples include
• ^ and $ (anchors) to restrict search at ends of text
• char classes with [], e.g. [0-9] to allow a selection of chars
• full stop . means any character (except new line)
• Characters which musn’t be interpreted as metacharacters need to be
escaped, e.g. \$, \^ , \[ , etc
• Useful functions include split and the operators =~ and substitute s.
11

Perl Regexp

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Perl Regexp

Încărcat de

Drepturi de autor:

Formate disponibile

Perl for biologists

Pattern matching with Regular

Perl for biologists

•A regular expression can be just a piece of text but usually

Perl for biologists

print “Enter a motif to search for: “;

print “Enter a motif to search for: “; OUTPUT

Perl for biologists

Other uses of regular expressions: the s operator

The substitution operator s uses a RE to replace part of a string with

Perl for biologists

Restricting the search

# program to look for the stop codon

Restricting the search

# program to look for the stop codon

Perl for biologists

Making the search more flexible

Consider the following:

# see if either of 2 motifs ADV and ASV are in a

Perl for biologists

# see if either of 2 motifs ADV

Although the two alternatives are equivalent, using [] is

Perl for biologists

# will this work ?

Characters with special meanings in REs (e.g. $,^,[, etc) need to

# This should work

Perl for biologists

Functions using REs: split

@fields = split(/\s/, $record);

Perl for biologists

Functions using REs: split

# @DNA is an array of nucleotides

# create a tab-delimited record

Perl for biologists

Some useful regular expressions..

# remove trailing spaces

# remove line numbers from DNA

S-ar putea să vă placă și