Sunteți pe pagina 1din 24

Regular Expressions

& Finite State


Automata
Lecture 1

What is a Regular
Expression
Notation for specifying set of strings
Used for search
Corpus: text(s) to search through / learn from
Used to define (formal) language

Creating a Regular
Expression
Perl notation uses / / around regexes
Expressions composed of:
Category

Symbols

Literal
Characters

Example
/the/

Example
Matches
the, other, The

Character Sets . [ ] \d \D \w \W /[a-zA-Z]/


\s \S

A, a, t, S, Z, ab

Disjunction

/T|the/

The, the

Boundaries

\b \B ^ $ \n \t

/\bthe\b/

the, other, the.

Quantifiers

* + ? {}

/colou?r/

color, colour

Special
Characters

/.+\.com/

Yahoo.com

Capturing

( ) \1

/(\d{5}).+\1/

Same zip twice

Creating a Regular
Expression
Defining a regex involves iteratively
improving:
Accuracy/Precision: minimizing false positives
e.g. /the/ /\bthe\b/
Coverage/Recall: minimizing false negatives
e.g. /the/ /T|the/

Using Regular
Expressions
Generally used to search or replace:
Perl:
$str = other people
if($str =~ /the/)

Java:
import java.util.regex.*;

Pattern r = Pattern.compile(\d);
Matcher m = r.matcher(D0es th1s c0nta1n d1g1ts?);
if(m.find())

Python:
import re
searchObj = re.search(rthe, other people)
phone = Tel: 209-867-5309
re.sub(r\d, #, phone)

References
Good tutorials and cheat sheets available
online:
http://regexone.com/lesson
http://web.mit.edu/hackl/www/lab/turkshop/slide
s/regex-cheatsheet.pdf
http://donovanh.com/pages/regex_list.html

Textbook also has cheat sheet on cover

ELIZA (1966)

Cascading regexes to simulate Rogerian


psychologist
Available online: http://nlp-addiction.com/eliza/
Embodiment of Searles Chinese Room

ELIZA

Cascading regexes to simulate Rogerian


psychologist
s/Im/YOU ARE/
s/M|my/YOUR/

ELIZA

Cascading regexes to simulate Rogerian


psychologist
s/YOU ARE (depressed|sad)/I AM SORRY TO HEAR YOU ARE
\1/
s/YOU ARE (depressed|sad)/WHY DO YOU THINK THAT YOU
ARE\1/

ELIZA

Cascading regexes to simulate Rogerian


psychologist
s/\ball\b/IN WHAT WAY/
s/\balways\b/CAN YOU THINK OF A SPECIFIC EXAMPLE/

Finite State Automata

Finite State Automata


(FSAs)
Regular Expressions are convenient way to
describe an FSA:
Sheep language: /baa+!/

FSAs and probabilistic cousins (Markov


models) are used extensively in NLP.
Perfectly capture regular languages
Capture parts of natural languages: phonology,
morphology, syntax.

FSA representation

States are represented by circles


Q0 or state with incoming arrow: start state
Double circled states: final/accepting state
Directed links: transitions between states

Imagine tape with input try to match to


transition:

Formal Representation
Specify the following:
Q = {q0,q1,qn-1} a finite set of N states

a finite input alphabet of symbols


(symbols can have internal
structure)
the start state
q0
F
(q,i)

the set of final states F Q


a transition function that maps Qx
to Q

Transition Table
Convenient for computer representation,
too:
Input
State

D-Recognize
Deterministic: no choice points

Generative Uses

Any model that recognizes a formal


language (FSA, regex, CFG) can be used to
generate valid strings.
Starting in q0, select random transitions until
reach final state.

Non-Deterministic FSAs
More than one transition possible for a
particular state and input combination:

Or uses epsilon transitions, where no input


characters are read:

Non-Deterministic FSAs
In NFSA there exists at least one path
through the machine for any string in the
language defined by the machine.
Not all paths directed through the machine
for an acceptable string lead to an accept
state.
No paths through the machine lead to an
accept state for a string not in the
language.
Challenge: what to do if make wrong

Resolving NonDeterminism
Backup: when reach a choice point, mark
state and input position (search-state),
then if needed roll backwards.
Look-Ahead: Look at following input
symbols to try to choose correct transition.
Parallelism: Follow each of the transition
options in parallel.
Convert: All NFSAs can be converted to an
equivalent FSA.

Backup
Need to modify transition table:
Add epsilon transition column
Allow multiple destination states for given
search-state.
Input

Input

Stat
e

Stat
e

2,3

NFSA Search: BFS or DFS


Keep a stack or queue of
search-states remaining to
explore.

Computing Theory
You may recall from (or learn in) COMP
147:
Class of languages definable by regular
expressions is same as class definable by
FSAs. These are called regular languages.

Your Turn
Lab 1: Regular Expression Practice
Project 1: ELIZA reborn