Sunteți pe pagina 1din 11

Perl for biologists

Pattern matching with Regular


Expressions

Perl for biologists


KING HENRY V
We do not mean the coursing snatchers only,
if ($text =~ /King/) { But fear the main intendment of the Scot,
Who hath been still a giddy neighbour to us;
print “found\n”; For you shall read that my great-grandfather
} Never went with his forces into France
But that the Scot on his unfurnish'd kingdom
Came pouring, like the tide into a breach,
if ($text =~ /King/i) With ample and brim fulness of his force,
Galling the gleaned land with hot assays,
{
Girding with grievous siege castles and towns;
print “found\n”; That England, being empty of defence,
} Hath shook and trembled at the ill neighbourhood.

CANTERBURY
She hath been then more fear'd than harm'd, my liege;
For hear her but exampled by herself:
When all her chivalry hath been in France
And she a mourning widow of her nobles,
She hath herself not only well defended
But taken and impounded as a stray
The King of Scots; whom she did send to France,
To fill King Edward's fame with prisoner kings...

1
Perl for biologists

if ($text =~ /King/i) {
print “found\n”; this is a
“modifier” and
} means ignore
case.

=~ is the binding
operator and in an if / / is used to denote a
means “contains” pattern (called a “regular
expression” or RE ) which can
be used to search within
textual data

•A regular expression can be just a piece of text but usually


it is more complex representing a pattern or template
•Regular expressions are case sensitive and by default look
for strings of chars not whole words.

Perl for biologists


Example-1

print “Enter a motif to search for: “;


Enter a motif to
$motif=<STDIN>; search for: SVLQ
# remove the newline at the end of the Found SVLQ
$motif
chomp $motif; Enter a motif to
# look for the motif search for: svlq
Could not find
if ($protein =~ /$motif/) { svlq
print “Found $motif\n”;
} else {
print “Could not find $motif\n”;
}

2
Perl for biologists
Example-2

print “Enter a motif to search for: “; OUTPUT


$motif=<STDIN>; Enter a motif to
# remove the newline at the end of the search for: svlq
$motif Found svlq
chomp $motif;
# look for the motif
if ($protein =~ /$motif/i) { i is a modifier, i.e.
it alters the
print “Found $motif\n”; standard behaviour
} else { of the RE. The i
modifier means
print “Could not find $motif\n”; ignore case
}

Perl for biologists

Other uses of regular expressions: the s operator

The substitution operator s uses a RE to replace part of a string with


something else:

$DNA=“ACGGCGGACCCGGAATTACTA”;
print ”Starting DNA\n$DNA\n”;
Starting DNA
# Transcribe the DNA to RNA by
ACGTCGGACCCGGAATTACTA
replacing T’s with U’s
$RNA=$DNA;
$RNA =~ s/T/U/g;
# print RNA to screen Transcribed RNA
ACGUCGGACCCGGAAUUACUA
print “Transcribed RNA\n$RNA\n”;

3
Perl for biologists

$RNA =~ s/T/U/g ;

RE
RE giving
giving gg modifier
modifier
variable
variable binding
binding pattern
pattern to
to replacement
replacement
operator
operator means
means
look
look for
for text
text “global”,i.e.
“global”,i.e.
everywhere
everywhere
in
in the
the
string
string
Naturally, the pattern to be found and the
replacement text can both be variables

Perl for biologists

Restricting the search


By default regular expressions look for matches anywhere within the input
string. Sometimes it is convenient to restrict the search to portions within the
data, e.g. at the beginning or at the end.

# program to look for the stop codon


TAG in a DNA sequence
$DNA=“GTCCGTAAGTCAGTCATTACGTCAGTCGGTAA
GTCAGTCATTACGTAGGTCGGTAAGTCAGTCATTAGAT
CAGGGCCTAAGTAG”;
if ($DNA =~ /TAG/) {
print “Stop codon TAG found\n”;
}

4
Perl for biologists

Restricting the search

# program to look for the stop codon


TAG in a DNA sequence
$DNA=“GTCCGTAAGTCAGTCATTACGTCAGTCGGTAA
GTCAGTCATTACGTAGGTCGGTAAGTCAGTCATTAGAT
CAGGGCCTAAGTAG”;
if ($DNA =~ /TAG$/) {
print “Stop codon TAG found\n”;
}
$ metacharacter
means find string at
end of text

Perl for biologists


Restricting the search
It is also possible to restrict the search to the beginning of the data with ^ :
# look for ATG at the start of a sequence
if ($DNA =~ /^ATG/) { ^ and $ are
print “Found start codon\n”; called anchors
they restrict
}
the scope of the
... search
# look for FASTA header
while ($line=<DB>){
if ($line=~/^>/) {
print “new sequence found\n”;
# read sequence
..
# skip comments in a UNIX file
while (..) {
next if ($line =~/^#/);
...

5
Perl for biologists

Making the search more flexible

Consider the following:

# see if either of 2 motifs ADV and ASV are in a


protein sequence
if ($protein =~ /ADV/) or ($protein =~ /ASV/) {
print “Found ADV or ASV \n”;
}

Perl for biologists


Making the search more flexible
This can be simplified using character classes

# see if either of 2 motifs ADV


and ASV are in a protein sequence
if ($protein =~ /A[DS]V/) { the [ ] enclose a
print “Found ADV or ASV \n”; list or range of
characters
}
which can be
selected at that
point

Although the two alternatives are equivalent, using [] is


generally more efficient.

6
Perl for biologists
Making the search more flexible
Other examples:
# translate codon to amino acid The . means
if ( $codon =~ /GC./) {$aa=“ala”; } # alanine any char
elsif ($codon =~ /TG[TC]/) { $aa=“cys”;}# cysteine except newline
elsif ($codon =~ /GA[TC]/) {$aa=“asp”;} # aspartic
acid
elsif ($codon =~ /GA[AG]/) {$aa =“glu”;}# glutamic
acid
..
# non bases
if ($dna =~/[^acgt]/) {
print “Invalid base pairs found\n”;
} character
# check input is a number range
if ($input[$i] !~ /[0-9]/) {
print “Invalid input\n”;
exit;
}

Perl for biologists

Escaping characters
# look for the $$$$ separator in a .sd
structure file

# will this work ?


if ( $line =~ /$$$$/) {
print “New structure found\n”;
}

Characters with special meanings in REs (e.g. $,^,[, etc) need to


be “escaped” with \ if you need to search for them

7
Perl for biologists

Escaping characters
# look for the $$$$ separator in
a .sd structure file

# This should work


if ( $line =~ /\$\$\$\$/) {
print “New structure found\n”;
}

The
The$$has
hasspecial
special
meaning
meaningso
soininthis
this
example
exampleititneeds
needstoto
be
beescaped
escapedwith
with\\

Perl for biologists

Special characters
Characters which cannot be typed in or are not easily visibile also use the \
notation. In addition, Perl defines certain meta-characters for REs.

\n newline
\t tab
\s whitespace (i.e. space or tab)
\r carriage return
\033 octal char
\w match a word char
\W match a non-word char
\d match a digit
\D match a non-digit

8
Perl for biologists

Functions using REs: split


The split function “breaks up” a string according to a regular
expression and puts the pieces into an array. Often used in
reading databases or tabular output.

@fields = split(/\s/, $record);


..
# for parsing lines of the form a=b
($param,$value) = split(/=/,$line);
/sex=“male”
/cell_line=“HuS-L12”
/organism=“H. Sapiens”
/dev_stage=“embryo”
/gene=“PCCX1”

Perl for biologists

Functions using REs: split


# Example 2: Reading tabular output of blast (-m8)
while ($line=<>) {
chomp($line); # remove newlines
@fields=split(/\s/,$line);
$score=$fields[11];
$length=$fields[3];
$pcid=$fields[2];
...
BE090210
BE090210 gi|42406291|ref|NC_000007.10|NC_000007
gi|42406291|ref|NC_000007.10|NC_000007 98.40
98.40 500
500 66 22 23
23
521
521 107552639
107552639 107552141
107552141 0.0
0.0 848.4
848.4
BQ379587
BQ379587 gi|42406304|ref|NC_000017.8|NC_000017
gi|42406304|ref|NC_000017.8|NC_000017 97.55
97.55 204
204 55 00 24
24
227
227 40141778
40141778 40141575
40141575 1.4e-78 298.4
1.4e-78 298.4
BQ379587
BQ379587 gi|42406304|ref|NC_000017.8|NC_000017
gi|42406304|ref|NC_000017.8|NC_000017 94.69
94.69 207
207 55 33 24
24
227
227 40115930
40115930 40115727
40115727 1.6e-60
1.6e-60 238.4
238.4
BQ379587
BQ379587 gi|42406304|ref|NC_000017.8|NC_000017
gi|42406304|ref|NC_000017.8|NC_000017 95.00
95.00 200
200 66 22 24
24
221
221 20569436
20569436 20569239
20569239 2.6e-56 224.4
2.6e-56 224.4
BQ379587
BQ379587 gi|42406304|ref|NC_000017.8|NC_000017
gi|42406304|ref|NC_000017.8|NC_000017 95.96
95.96 198
198 88 00 24
24
221
221 16936304
16936304 16936107
16936107 4.1e-55 220.4
4.1e-55 220.4
BQ379587
BQ379587 gi|42406304|ref|NC_000017.8|NC_000017
gi|42406304|ref|NC_000017.8|NC_000017 94.00
94.00 200
200 88 22 24
24
221
221 18544194
18544194 18544391
18544391 4.5e-43 180.4
4.5e-43 180.4

9
Perl for biologists

join
The join function does the reverse of split, it sticks together fragments
using a string (strictly speaking join is not a RE function).

# @DNA is an array of nucleotides


# more useful to have a string
@DNA=<DNAFILE>;
$DNA = join(‘’,@DNA);
..

# create a tab-delimited record


$record=join(‘\t’,$field1,$field2,$field3);

Perl for biologists

Some useful regular expressions..


# removing leading spaces
$line =~ s/^\s*//g;

# remove trailing spaces


$line =~ s/\s*$//g;

# remove line numbers from DNA


sequence in GenBank
$dna =~ s/[0-9]//g;

10
Perl for biologists

Summary
• Text in Perl can be searched and manipulated with regular expressions (REs)
• By default REs are enclosed by two / , e.g. /word/
• Modifiers such as i or g can be added change default behaviour, e.g.
/word/i searches for “word” regardless of case
• REs can be just text but they can also contain metacharacters to be
more/less specific. Examples include
• ^ and $ (anchors) to restrict search at ends of text
• char classes with [], e.g. [0-9] to allow a selection of chars
• full stop . means any character (except new line)
• Characters which musn’t be interpreted as metacharacters need to be
escaped, e.g. \$, \^ , \[ , etc
• Useful functions include split and the operators =~ and substitute s.

11

S-ar putea să vă placă și