Flat Unit 6

Application of Regular Expression
● Regular expressions are useful in a wide variety of text processing tasks,

and more generally string processing, where the data need not be
textual. Common applications include data validation, data scraping
(especially web scraping), data wrangling, simple parsing, the
production of syntax highlighting systems, and many other tasks.
● While regexps would be useful on Internet search engines, processing
them across the entire database could consume excessive computer
resources depending on the complexity and design of the regex.
● Basically regular expressions are used in search tools
● Text file search in Unix (tool: egrep)
● This command searches for a text pattern in the file and list the file names
containing that pattern .
●
Application of Finite Automata

● A finite automata is useful to design text editors.
● A finite automata is highly useful to design spell checkers.
● A finite automata is useful to design sequential circuit design
(Transducer).
● For the designing of lexical analysis of a compiler.
● For recognizing the pattern using regular expressions.
● For the designing of the combination and sequential circuits
using Mealy and Moore Machines.
● Used in text editors.
● For the implementation of spell checkers.

Application of PDA
● For designing the parsing phase of a compiler (Syntax Analysis).

● For implementation of stack applications.
● For evaluating arithmetic expressions.
● For solving the Tower of Hanoi Problem.
Application of CFG
● For defining programming languages

● For parsing the program by constructing syntax tree
● For translation of programming languages
● For describing arithmetic expressions
● For construction of compilers
Application of Turing Machine
Primitive recursive Functions
In computability theory, a primitive recursive function is roughly speaking a

function that can be computed by a computer program whose loops are all
"for" loops (that is, an upper bound of the number of iterations of every loop
can be determined before entering the loop). Primitive recursive functions
form a strict subset of those general recursive functions that are also total
functions.
The primitive recursive functions are among the number-theoretic functions,
which are functions from the natural numbers (nonnegative integers) {0, 1, 2,
...} to the natural numbers. These functions take n arguments for some
natural number n and are called n-ary.
Recursive and Recursive Enumerable

Languages
Recursive Enumerable (RE) or Type -0 Language
RE languages or type-0 languages are generated by type-0 grammars. An RE
language can be accepted or recognized by Turing machine which means it
will enter into final state for the strings of language and may or may not enter
into rejecting state for the strings which are not part of the language. It
means TM can loop forever for the strings which are not a part of the
language. RE languages are also called as Turing recognizable languages.
Recursive Language (REC)

A recursive language (subset of RE) can be decided by Turing machine which
means it will enter into final state for the strings of language and rejecting
state for the strings which are not part of the language. e.g.; L= {anb c
n n|n>=1}
is recursive because we can construct a turing machine which will move to
final state if the string is of the form anb cn
n else move to non-final state. So
the TM will always halt in this case. REC languages are also called as Turing
decidable languages. The relationship between RE and REC languages can
be shown in Figure 1.

Closure Properties of Recursive Languages
● Union: If L1 and If L2 are two recursive languages, their union L1∪L2

will also be recursive because if TM halts for L1 and halts for L2, it
will also halt for L1∪L2.
Concatenation: If L1 and If L2 are two recursive languages, their

concatenation L1.L2 will also be recursive. For Example:
L1= {anbnc |n>=0}
n
L2= {dme
mf m| m>=0}
L3= L1.L2
= {anbncn
dm
emf m| m>=0 and n>=0} is also recursive.
●
L1 says n no. of a’s followed by n no. of b’s followed by n no. of c’s. L2
says m no. of d’s followed by m no. of e’s followed by m no. of f’s.
Their concatenation first matches no. of a’s, b’s and c’s and then
matches no. of d’s, e’s and f’s. So it can be decided by TM.
● Kleene Closure: If L1is recursive, its kleene closure L1* will also be
recursive. For Example:
L1= {anbn
cn
| n>=0}
L1*= { anb cn

n | |n>=0}* is also recursive.
Intersection and complement: If L1 and If L2 are two recursive languages,

their intersection L1 ∩ L2 will also be recursive. For Example:
L1= {anbnc dm|n>=0 and m>=0}
n
L2= {anbn
cn
d |n>=0 and m>=0}
n
L3=L1 ∩ L2
= { anbncn
d |n>=0} will be recursive.
n
●
L1 says n no. of a’s followed by n no. of b’s followed by n no. of c’s
and then any no. of d’s. L2 says any no. of a’s followed by n no. of b’s
followed by n no. of c’s followed by n no. of d’s. Their intersection
says n no. of a’s followed by n no. of b’s followed by n no. of c’s
followed by n no. of d’s. So it can be decided by turing machine,
hence recursive.
Similarly, complementof recursive language L1 which is ∑*-L1, will
also be recursive.
Note: As opposed to REC languages, RE languages are not closed under
complementon which means complement of RE language need not be RE.
Universal Turing Machine
● A Turing machine is said to be universal Turing machine if it can accept:

○ The input data, and
○ An algorithm (description) for computing.
● This is precisely what a general purpose digital computer does. A digital
computer accepts a program written in high level language. Thus, a
general purpose Turing machine will be called a universal Turing
machine if it is powerful enough to simulate the behavior of any digital
computer, including any Turing machine itself.
● More precisely, a universal Turing machine can simulate the behavior of
an arbitrary Turing machine over any set of input symbols. Thus, it is
possible to create a single machine that can be used to compute any
computable sequence.
If this machine is supposed to be supplied with the tape on the
beginning of which is written the input string of quintuple separated
with some special symbol of some computing machine M, then the
universal Turing machine U will compute the same strings as those by
M.
● The model of a Universal Turing machine is considered to be a
theoretical breakthrough that led to the concept of stored programmer
computing device.
● Designing a general purpose Turing machine is a more complex task.
Once the transition of Turing machine is defined, the machine is
restricted to carrying out one particular type of computation.
● Digital computers, on the other hands, are general purpose machines
that cannot be considered equivalent to general purpose digital
computers until they are designed to be reprogrammed.
● By modifying our basic model of a Turing machine we can design a
universal Turing machine. The modified Turing machine must have a
large number of states for stimulating even a simple behavior. We
modify our basic model by:
○ Increase the number of read/write heads
○ Increase the number of dimensions of input tape
○ Adding a special purpose memory
● All the above modification in the basic model of a Turing machine will
almost speed up the operations of the machine can do.
● A number of ways can be used to explain to show that Turing machines
are useful models of real computers. Anything that can be computed
by a real computer can also be computed by a Turing machine. A
Turing machine, for example can simulate any type of functions used in
programming language. Recursion and parameter passing are some
typical examples. A Turing machine can also be used to simplify the
statements of an algorithm.
● A Turing machine is not very capable of handling it in a given finite
amount of time. Also, Turing machines are not designed to receive
unbounded input as many real programmers like word processors,
operating system, and other system software.

Lexical Analyzer
Compiler Design | Lexical Analysis

Lexical Analysis is the first phase of compiler also known as scanner. It
converts the High level input program into a sequence of Tokens.
● Lexical Analysis can be implemented with the D

eterministic finite
Automata.
● The output is a sequence of tokens that is sent to the parser for
syntax analysis
What is a token?
A lexical token is a sequence of characters that can be treated as a unit in the
grammar of the programming languages.

Example of tokens:
● Type token (id, number, real, . . . )
● Punctuation tokens (IF, void, return, . . . )
● Alphabetic tokens (keywords)
Keywords; Examples-for, while, if etc.
Identifier; Examples-Variable name, function name etc.
Operators; Examples '+', '++', '-' etc.
Separators; Examples ',' ';' etc
Example of Non-Tokens:
● Comments, preprocessor directive, macros, blanks, tabs, newline

etc
Lexeme: The sequence of characters matched by a pattern to form
the corresponding token or a sequence of input characters that comprises a
single token is called a lexeme. eg- “float”, “abs_zero_Kelvin”, “=”, “-”, “273”, “;” .
How Lexical Analyzer functions
1. Tokenization .i.e Dividing the program into valid tokens.
2. Remove white space characters.
3. Remove comments.
4. It also provides help in generating error message by providing row number
and column number.
The lexical analyzer identifies the error with the help of automation
machine and the grammar of the given language on which it is based
like C , C++ and gives row number and column number of the error.
Suppose we pass a statement through lexical analyzer –
a = b + c ; It will generate token sequence like this:
id=id+id; Where each id reference to it’s variable in the symbol table
referencing all details

For example, consider the program
int main()
{
// 2 variables
int a, b;
a = 10;
return 0;
}
All the valid tokens are:

'int' 'main' '(' ')' '{' 'int' 'a' ',' 'b' ';'
'a' '=' '10' ';' 'return' '0' ';' '}'
Above are the valid tokens.
You can observe that we have omitted comments.
As another example, consider below printf statement.

There are 5 valid token in this printf statement.
Text editor and Search using Regular Expression
Using a text editor and regular

expressions
What is a text editor?
Unlike a word processor, a text editor only handles plain text (i.e. no graphics or
fancy formating). However, in return, text editors provide powerful tools for
manipulating text, including the ability to use regular expressions, highlight
differences between two or more files, and search and replace functionality on
steroids. Since most data is available in or can be exported to plain text, and
computer programs are written in plian text, a text editor is one of the most
basic and useful applcations for anyone dealing with massive amounts of data.
Text is universal - unlike binary formats (try to open a WordPerfect 4.2
document), text documents are editable on any platform, and will still be
readable in 30 years time.
First look
This session will asume that you are using the TextWrangler editor on a Mac. The

functionality of most text editors is very similar, and you should be able to follow
along if you are using a different editor. We will take the lazy option of
familiarizing you to TextWrangler by using the official tutorial. Open
TextWrangler and take a few minutes to familiarize yourself with its anatomy -
look at the menus, toolbars, icons etc. Most of this should be pretty familiar to
you from other programs. Open the file Lorem ipsum.txt in the Tutorial
Examples/Lesson 3 folder. Now read the description of the toolbar on Page 10 of
the tutorial.
Mini-exercise
When you first open the Lorem ipsum.txt file, it looks like

Use the TextWrangler toolbar Text Options to get here
Then use the splitbar to split into two windows

Now drag the splitbar all the way to the top to get back a single window.
Finding and replacing
You can read all about TextWrangler in the tutorial at your leisure at home, but
most things such as cut, copy and paste and undo/redo work as expected. Since
we have a lot to cover, we’ll just skip ahead to lesson 7. Open the file Find and
Replace Sample.txt in the Lesson 7 folder and work through the exercise on page
39.
Next, open the Email Table.txt file in the Lesson 7 folder and work through the
exercise on page 42 to learn how to do search and replace on invisible
characters.
Regular expressions
We now come to the main part of this session, which is how to use regular
expressions (or regex) for text manipulation described in pages 44 to 51 of the
TextWrangler tutorial. You have already come across regular expressions in the
Unix session - now you will see how to use them to manipulate text files.
We’ll continue with the cats and dogs file just to get comfortable with the basic
elements of regular expressions. Open the Find dialog and check the Grep and
Wrap around boxes. The Grep option tells TextWrangler that we are using
regular expressions, and wrap around that we want to do search and replace on
the whole document regardless of where our cursor/insertion point is. If at any
point you get confused over the next few paragraphs, look at the examples from
page 47-49 for concrete illustrations of how to use regular expressions.
On page 45, there is a table of Wildcards. Put a period . in the Find box and click
Next to see what matches. Keep clicking Next - you will see that the . matches
every character as advertised. Next try \s in the Find box and click Next. What
does it match? Repeat for all the Wildcards to get a good intuition of what each
wildcard matches.
Now do the same with the Class patterns. Feel free to edit the text to include
new words or numbers if you are curious as to how they will be matched. Also
experiment with creating your own class search patterns - for example, what
would a class to match DNA nucleotide symbols look like?
Now try the Repetition patterns *, *?, +, +? and ?. There is another repetition
pattern {n, m} that means match at least n but not more than m of the previous
character or pattern. If n is left out, we have {,m} which means match no more
than m characters or patterns. If m is left out, we have {n,} which means match at
least n occurrences. What is the difference between * and *? or + and +?? When
would you use the greedy and non-greedy versions?
Now figure out what the ^ and $ positional assertions and the alternation
patterns mean. You can use parenthesis to enclose alternations if you need to
also match stuff before and/or after the alternation.
Mini-exercise
1. Construct a regular expression to find two or more

consecutive vowels in the cats and dogs file.
2. Use a regular expression pattern to delete all punctuation
from the example.
3. Use a positional assertion with the alternation pattern to find
all cat, cats, dog or dogs that occur at the end a sentence.
You should find that the ends of sentences 1, 3, 5, 9 and 11
match.
Subpatterns and replace patterns
Regular expressions really begin to show their power when we use subpatterns
and replace patterns. This can be a little tricky to wrap your head around when
encountered for the first time, so we start with some simple contrived examples
to give you some practice.
Subpatterns are simply any regular expression enclosed by parenthesis. For
example, (.) is a subpattern that matches any character, just like . by itself.
However, we can refer to captured subpatterns to refer to whatever is in the
parenthesis. For example, what does the regex (.)\1 find? See if you can guess,
then use TextWrangler to check if you were correct.
We can also use the captured subpatterns in the replace pattern. Here again, \1
refers to the first regex in parenthesis, 2` to the second one etc, and & to refer to
the full regular expression matched (not just an individual subpattern). As an
example, what does putting (.)\1([aeiou]+) in the Find box and &\2\1& do? Test it
out in TextWrangler to find out what word is changed and what it is changed to.
Exercise
Open the file Ch3observations.txt in the examples folder. It looks like this
13 January, 1752 at 13:53 -1.414 5.781 Found in tide pools
17 March, 1961 at 03:46 14 3.6 Thirty specimens observed
1 Oct., 2002 at 18:22 36.51 -3.4221 Genome sequenced to confirm
20 July, 1863 at 12:02 1.74 133 Article in Harper's
Write a regular expression to convert it into this:

1752 Jan. 13 13 53 -1.414 5.781
1961 Mar. 17 03 46 14 3.6
2002 Oct. 1 18 22 36.51 -3.4221
1863 Jul. 20 12 02 1.74 133
Natural Language Processing
Natural Language Processing (NLP) refers to AI method of communicating with

an intelligent systems using a natural language such as English.
Processing of Natural Language is required when you want an intelligent
system like robot to perform as per your instructions, when you want to hear
decision from a dialogue based clinical expert system, etc.
The field of NLP involves making computers to perform useful tasks with the
natural languages humans use. The input and output of an NLP system can be −
● Speech
● Written Text
Components of NLP
There are two components of NLP as given −
Natural Language Understanding (NLU)
Understanding involves the following tasks −
● Mapping the given input in natural language into useful representations.

● Analyzing different aspects of the language.
Natural Language Generation (NLG)
It is the process of producing meaningful phrases and sentences in the form of
natural language from some internal representation.
It involves −
● Text planning − It includes retrieving the relevant content from knowledge base.
● Sentence planning − It includes choosing required words, forming meaningful
phrases, setting tone of the sentence.
● Text Realization − It is mapping sentence plan into sentence structure.
The NLU is harder than NLG.
Difficulties in NLU
NL has an extremely rich form and structure.
It is very ambiguous. There can be different levels of ambiguity −
● Lexical ambiguity − It is at very primitive level such as word-level.

● For example, treating the word “board” as noun or verb?
● Syntax Level ambiguity − A sentence can be parsed in different ways.
● For example, “He lifted the beetle with red cap.” − Did he use cap to lift the beetle
or he lifted a beetle that had red cap?
● Referential ambiguity − Referring to something using pronouns. For example,
Rima went to Gauri. She said, “I am tired.” − Exactly who is tired?
● One input can mean different meanings.
● Many inputs can mean the same thing.
NLP Terminology
● Phonology − It is study of organizing sound systematically.
● Morphology − It is a study of construction of words from primitive meaningful
units.
● Morpheme − It is primitive unit of meaning in a language.
● Syntax − It refers to arranging words to make a sentence. It also involves
determining the structural role of words in the sentence and in phrases.
● Semantics − It is concerned with the meaning of words and how to combine
words into meaningful phrases and sentences.
● Pragmatics − It deals with using and understanding sentences in different
situations and how the interpretation of the sentence is affected.
● Discourse − It deals with how the immediately preceding sentence can affect the
interpretation of the next sentence.
● World Knowledge − It includes the general knowledge about the world.
Steps in NLP
There are general five steps −
● Lexical Analysis − It involves identifying and analyzing the structure of words.

Lexicon of a language means the collection of words and phrases in a language.
Lexical analysis is dividing the whole chunk of txt into paragraphs, sentences, and
words.
● Syntactic Analysis (Parsing) − It involves analysis of words in the sentence for
grammar and arranging words in a manner that shows the relationship among
the words. The sentence such as “The school goes to boy” is rejected by English
syntactic analyzer.

● Semantic Analysis − It draws the exact meaning or the dictionary meaning from
the text. The text is checked for meaningfulness. It is done by mapping syntactic
structures and objects in the task domain. The semantic analyzer disregards
sentence such as “hot ice-cream”.
● Discourse Integration − The meaning of any sentence depends upon the
meaning of the sentence just before it. In addition, it also brings about the
meaning of immediately succeeding sentence.
● Pragmatic Analysis − During this, what was said is re-interpreted on what it
actually meant. It involves deriving those aspects of language which require real
world knowledge.
Implementation Aspects of Syntactic Analysis
There are a number of algorithms researchers have developed for syntactic
analysis, but we consider only the following simple methods −
● Context-Free Grammar
● Top-Down Parser
Let us see them in detail −

Flat Unit 6

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Flat Unit 6

Încărcat de

Drepturi de autor:

Formate disponibile

Application of Regular Expression

● Regular expressions are useful in a wide variety of text processing tasks,

Application of Finite Automata

● A finite automata is useful to design sequential circuit design

● For the designing of lexical analysis of a compiler.

● For recognizing the pattern using regular expressions.

● For the designing of the combination and sequential circuits

using Mealy and Moore Machines.

● Used in text editors.

● For the implementation of spell checkers.

● For designing the parsing phase of a compiler (Syntax Analysis).

● For defining programming languages

Application of Turing Machine

Primitive recursive Functions

In computability theory, a primitive recursive function is roughly speaking a

Recursive and Recursive Enumerable

RE languages or type-0 languages are generated by type-0 grammars. An RE

language. RE languages are also called as Turing recognizable languages.

Recursive Language (REC)

be shown in Figure 1.

Closure Properties of Recursive Languages

● Union​: If L1 and If L2 are two recursive languages, their union L1∪L2

Concatenation:​ If L1 and If L2 are two recursive languages, their

L1*= { a​nb ​ ​cn

Intersection and complement​: If L1 and If L2 are two recursive languages,

complementon which means complement of RE language need not be RE.

Universal Turing Machine

● A Turing machine is said to be universal Turing machine if it can accept:

Compiler Design | Lexical Analysis

converts the High level input program into a sequence of ​Tokens​.

● Lexical Analysis can be implemented with the D

A lexical token is a sequence of characters that can be treated as a unit in the

grammar of the programming languages.

Keywords; Examples-for, while, if etc.

Identifier; Examples-Variable name, function name etc.

Operators; Examples '+', '++', '-' etc.

Separators; Examples ',' ';' etc

● Comments, preprocessor directive, macros, blanks, tabs, newline

Lexeme​: The sequence of characters matched by a pattern to form

the corresponding token or a sequence of input characters that comprises a

How Lexical Analyzer functions

1. Tokenization .i.e Dividing the program into valid tokens.

2. Remove white space characters.

and column number.

a = b + c​ ; It will generate token sequence like this:

id=id+id​; Where each id reference to it’s variable in the symbol table

referencing all details

All the valid tokens are:

'a' '=' '10' ';' 'return' '0' ';' '}'

Above are the valid tokens.

You can observe that we have omitted comments.

As another example, consider below printf statement.

There are 5 valid token in this printf statement.

Text editor and Search using Regular Expression

Using a text editor and regular

What is a text editor?

This session will asume that you are using the TextWrangler editor on a Mac. The

Use the TextWrangler toolbar Text Options to get here

Then use the splitbar to split into two windows

Finding and replacing

1. Construct a regular expression to find two or more

Subpatterns and replace patterns

13 January, 1752 at 13:53 -1.414 5.781 Found in tide pools

17 March, 1961 at 03:46 14 3.6 Thirty specimens observed

● Union: If L1 and If L2 are two recursive languages, their union L1∪L2

Concatenation: If L1 and If L2 are two recursive languages, their

L1*= { anb cn

Intersection and complement: If L1 and If L2 are two recursive languages,

converts the High level input program into a sequence of Tokens.

Lexeme: The sequence of characters matched by a pattern to form

a = b + c ; It will generate token sequence like this:

id=id+id; Where each id reference to it’s variable in the symbol table