Sunteți pe pagina 1din 8

Introduction to Regular Expressions

Objectives:
Utilize regular expression to identify, replace, and modify text, words patterns or characters. Be familiar on how to handle string using regular expressions Develop a strong background on the Linux/Unix commands that uses regular expressions for string handling o sed o awk o grep Expose students to the tools that use or support regular expressions

Concepts
What are Regular Expressions?
A regular expression, often called a pattern in Perl, is a template that either matches or does not match a given string. That is, there are an infinite number of possible text strings; a given pattern divides that infinite set into two groups: the ones that match, and the ones that dont. A pattern may match just one possible string, or just two or three or a dozen or a hundred or an infinite number. Or it may match all strings except for one, or except for some, or except for an infinite number. Regular expression is a simple language because the programs have just one task: to look at a string and say it matches or it doesnt match. Thats all they do. One of the tools that use regular expression is the Unix grep command. The grep command prints out text lines matching a given pattern. Example: If you wanted to see which lines in a given file mention flint and somewhere later on the same line, stone, you might do something like this with the Unix grep command: $ grep flint.*stone chapter*.txt chapter3.txt: a piece of flint, a stone which may be used to start a fire by striking chapter3.txt: found obsidian, flint, granite, and small stones of basaltic rock, which chapter9.txt: flintlock rifle in poor condition. The sandstone mantle held several

Using Simple Patterns


To match a pattern (regular expression) against the contents of $_, simply put the pattern between a pair of forward slashes ( / )

#!/usr/bin/perl $_ = qq{yabba dabba doo}; if(/abba/){ print It matched!\n; }

The expression /abba/ looks for that four-letter string in $_; if it finds it, it returns a true value. In this case, its found more than once, but that does not make any difference. If its found at all, its a match; if its not in there at all, it fails. Because the pattern match is generally being used to return a true or false value, it is almost always found in the conditional expression of if or while. All of the usual backslash escapes that you can put into double-quoted strings are available in patterns, so you could use the pattern /coke\tsprite/ to match the 11 characters of coke, a tab and sprite.

Examples: Simple word matching


Example 1: Hello World is a simple double-quoted string. World is the regular expression and the // enclosing /World/ tells Perl to search a string for a match. The operator =~ associates the string with the regexp match and produces a true value if the regexp match or false if the regexp did not match.

Example 2: There are useful variations on this. The sense of the match can be reversed by using the !~ operator.

Example 3: The // default delimeter for a match can be changed to arbitrary delimiters by putting the m out front.

Exercise 1: Which of the following regexps would match Hello World? and explain why it does and why it does not match.

About Metacharacters
If patterns matched only simple literal strings, they would not be very useful. That is why there are a number of special characters called metacharacters that have special meanings in regular expressions. For example, the dot (.) is a wildcard character it matches any single character except a newline (\n). So, the pattern /bet.y/ would match betty. Or it would match betsy, or bet=y, or any other string that has bet, followed by any one character except a newline, followed by y. Below is the table of other metacharacters used in Perl. \ Quote the next metacharacter ^ Match the beginning of the line $ Match the end of the line (or before newline at the end) | Alteration () Grouping [] Bracketed Character Class . Matches exactly one character, regardless of what the character is. A character class, a list of possible characters inside square brackets ( [ ] ), matches any single character from within the class. It matches just one single character, but that one character may be any of the ones listed.

Quantifiers * Match 0 or more times + Match 1 or more times ? Match 1 or 0 times {n} Match exactly n times {n, } Match at least n times {n,m} Match at least n but not more than m times

Examples: Metacharacters
Example 4: As mentioned, not all characters can be used as is in a match. Metacharacters are reserved for use in regexp notation. It is important to know that a matacharacter can be matched by putting a backslash before it.

Example 5: Escape sequences are used in double-quoted strings and in fact the regexps in Perl are mostly treated as double-quoted strings. This means that variables can be used as well.

Example 6: If the regexp matched anywhere in the string, it is considered as a match. Sometimes, wed like to specify where in the string the regexp should try to match. To do this we need to use the anchor metacharacters ^ and $.

Example 7:

Grouping in Patterns As in mathematics, parentheses ( ( ) ) may be used for grouping. So, parentheses are also metacharacters. As an example, the pattern /fred+/ matches strings like freddddd, but strings like that do not show up often in real life. But the pattern /(fred)+/ matches strings like fredfredfred, which is more likely to be what you wanted. The parentheses also give us a way to reuse part of the string directly in the match. We can use back references to refer to text that we matched in the parentheses. We denote a back reference as a backslash followed by a number, like \1, \2 and so on. The number denotes the parentheses group. When we use the parentheses around the dot, we match any non-newline character. We can match again whichever character we matched in those parentheses by using the back reference \1:
#!/usr/bin/perl if(/(.)\1/){ print "It matched same character next to itself!\n"; }

Character Classes
A character class, a list of possible characters inside square brackets( [ ] ), matches any single character from within the class. It matches just one single character, but the one character may be any of the ones listed. For example, the character class [abcwxwz] may match any one of those seven characters. For convenience, you may specify a range of characters with a hyphen (-) so that class may also be written as [a-cw-z]. Sometimes, it is easier to specify the characters left out, rather than the ones within character class. A caret (^) at the start of the character class negates it. That is, [^def] will match any single character except one of those three. Character Shortcuts Symbol \d \D \s \S \w \W

Meaning Digit Nondigit Whitespace Nonwhitespace Word Character Non-(word character)

As Bytes [0-9] [^0-9] [\t\n\r\f] [^ \t\r\n\f] [a-zA-Z0-9_] [^a-zA-Z0-9_]

Example 8: Even though c is the first character in the class, a matches because the first character position in the string is the earliest point at which the regexp can match.

Example 9: The special character - acts as a range operator within character classes so that a contiguous set of characters can be written as a range.

Example 10: The special character ^ in the first position of a character class denotes negated character class which matches any character but those in the brackets.

Option Modifiers
There are several option modifier letters, sometimes called flags, which may be appended as a group right after the ending delimeter of a regular expression to change its behavior from the default.

Case-Insensitive Matching with /i

Matching Any Character with /s If you might have newlines in your strings, and you want the dot to be able to match them, the /s modifier will do the job. It changes every dot in the pattern to act like the character class [\d\D] does, which is to match any character, even if it is a newline.

Adding Whitespace with /x /x is used to add arbitrary whitespace to a pattern to make it easier to read.
/-?\d+\.?\d*/ / -? \d+ \.? \d* /x #what is this doing? # better to read and understand

Exercises:
1. Make a program that prints each line of its input that mentions fred. It should not do anything for other lines of input. Does it match if your input string is Fred, Frederick, or Alfred? Make a small text file with a few lines mentioning fred flinstone and his friends then use that file as input to this program. 2. Modify the program in item 2 to allow Fred to match as well. Does it match now if your input string is Fred, Frederick or Alfred. (Please feel free to add lines with these names to the text file). 3. Make a program that prints each line of its input that contains a period ( . ), ignoring other lines of input. Try it on the small text file from the previous exercise: does it notice Mr. Slate? 4. Make a program that prints each line that has a word that is capitalize but not ALL capitalized. Does it match Fred but neither fred nor FRED? 5. Write a program that prints out any input line that mentions both wilma" and fred.

S-ar putea să vă placă și