Sunteți pe pagina 1din 23

Application of Regular Expression

● Regular  expressions  are useful in a wide variety of text processing tasks, 


and  more  generally  string  processing,  where  the  data  need  not  be 
textual.  Common  applications  include  data  validation,  data  scraping 
(especially  web  scraping),  data  wrangling,  simple  parsing,  the 
production of syntax highlighting systems, and many other tasks. 
● While  regexps  would  be  useful  on  Internet  search  engines,  processing 
them  across  the  entire  database  could  consume  excessive  computer 
resources depending on the complexity and design of the regex. 
● Basically regular expressions are used in search tools 
● Text file search in Unix (tool: egrep) 
● This  command  searches  for  a  text  pattern  in  the  file  and  list  the  file  names 
containing that pattern . 
●  

Application of Finite Automata 


● A finite automata is useful to design text editors. 
● A finite automata is highly useful to design spell checkers. 

● A  finite  automata  is  useful  to  design  sequential  circuit  design 

(Transducer). 

● For the designing of lexical analysis of a compiler. 

● For recognizing the pattern using regular expressions. 

● For  the  designing  of  the  combination  and  sequential  circuits 

using Mealy and Moore Machines. 

● Used in text editors. 

● For the implementation of spell checkers. 

 
Application of PDA

● For designing the parsing phase of a compiler (Syntax Analysis). 


● For implementation of stack applications. 
● For evaluating arithmetic expressions. 
● For solving the Tower of Hanoi Problem​. 

Application of CFG 

● For defining programming languages 


● For parsing the program by constructing syntax tree 
● For translation of programming languages 
● For describing arithmetic expressions 
● For construction of compilers 

Application of Turing Machine

Primitive recursive Functions 

In computability theory, a primitive recursive function is roughly speaking a 


function that can be computed by a computer program whose loops are all 
"for" loops (that is, an upper bound of the number of iterations of every loop 
can be determined before entering the loop). Primitive recursive functions 
form a strict subset of those general recursive functions that are also total 
functions. 
The primitive recursive functions are among the number-theoretic functions, 
which are functions from the natural numbers (nonnegative integers) {0, 1, 2, 
...} to the natural numbers. These functions take n arguments for some 
natural number n and are called n-ary. 

Recursive  and  Recursive  Enumerable 


Languages 
Recursive Enumerable (RE) or Type -0 Language 

RE  languages  or  type-0  languages are generated by type-0 grammars. An RE 

language  can  be  accepted  or  recognized  by  Turing  machine  which  means  it 

will  enter into final state for the strings of language and may or may not enter 

into  rejecting  state  for  the  strings  which  are  not  part  of  the  language.  It 

means  TM  can  loop  forever  for  the  strings  which  are  not  a  part  of  the 

language. RE languages are also called as Turing recognizable languages. 

Recursive Language (REC) 

 
 
A  recursive  language  (subset  of RE) can be decided by Turing machine which 

means  it  will  enter  into  final  state  for  the  strings  of  language  and  rejecting 

state  for  the  strings  which  are  not  part  of  the  language.  e.g.;  L=  {a​nb ​ c
​ n ​ ​n​|n>=1} 

is  recursive  because  we  can  construct  a  turing  machine  which  will  move  to 

final  state  if  the  string  is  of  the  form  a​nb ​ ​cn
​ n ​   else  move  to  non-final  state.  So 

the  TM  will  always  halt  in  this  case.  REC  languages  are  also  called  as  Turing 

decidable  languages.  The  relationship  between  RE  and  REC  languages  can 

be shown in Figure 1. 


 

Closure Properties of Recursive Languages 

● Union​: If L1 and If L2 are two recursive languages, their union L1∪L2 


will also be recursive because if TM halts for L1 and halts for L2, it 
will also halt for L1∪L2. 

Concatenation:​ If L1 and If L2 are two recursive languages, their 


concatenation L1.L2 will also be recursive. For Example: 
L1= {a​n​b​nc ​ ​|n>=0}  
​ n

L2= {d​me
​ ​mf​ ​m|​ m>=0} 

L3= L1.L2 

= {a​n​b​n​cn
​ ​dm
​ ​ e​mf​ ​m|​ m>=0 and n>=0} is also recursive. 

●  
L1 says n no. of a’s followed by n no. of b’s followed by n no. of c’s. L2 
says m no. of d’s followed by m no. of e’s followed by m no. of f’s. 
Their concatenation first matches no. of a’s, b’s and c’s and then 
matches no. of d’s, e’s and f’s. So it can be decided by TM. 
● Kleene Closure:​ If L1is recursive, its kleene closure L1* will also be 
recursive. For Example: 

L1= {a​n​bn
​ ​cn
​ |​ n>=0} 

L1*= { a​nb ​ ​cn


​ n ​ |​ |n>=0}* is also recursive. 

Intersection and complement​: If L1 and If L2 are two recursive languages, 


their intersection L1 ∩ L2 will also be recursive. For Example: 
L1= {a​n​b​nc ​ ​dm|n>=0 and m>=0}  
​ n

L2= {a​n​bn
​ ​cn
​ d ​ ​|n>=0 and m>=0} 
​ n

L3=L1 ∩ L2 

= { a​n​b​n​cn
​ d ​ ​ |n>=0} will be recursive. 
​ n

●  
L1 says n no. of a’s followed by n no. of b’s followed by n no. of c’s 
and then any no. of d’s. L2 says any no. of a’s followed by n no. of b’s 
followed by n no. of c’s followed by n no. of d’s. Their intersection 
says n no. of a’s followed by n no. of b’s followed by n no. of c’s 
followed by n no. of d’s. So it can be decided by turing machine, 
hence recursive. 
Similarly, complementof recursive language L1 which is ∑*-L1, will 
also be recursive. 

Note:  As  opposed  to  REC  languages,  RE  languages  are  not  closed  under 

complementon which means complement of RE language need not be RE. 

Universal Turing Machine 

● A Turing machine is said to be universal Turing machine if it can accept: 


○ The input data, and 
○ An algorithm (description) for computing. 
● This is precisely what a general purpose digital computer does. A digital 
computer accepts a program written in high level language. Thus, a 
general purpose Turing machine will be called a universal Turing 
machine if it is powerful enough to simulate the behavior of any digital 
computer, including any Turing machine itself. 
● More precisely, a universal Turing machine can simulate the behavior of 
an arbitrary Turing machine over any set of input symbols. Thus, it is 
possible to create a single machine that can be used to compute any 
computable sequence. 
If this machine is supposed to be supplied with the tape on the 
beginning of which is written the input string of quintuple separated 
with some special symbol of some computing machine M, then the 
universal Turing machine U will compute the same strings as those by 
M. 
● The model of a Universal Turing machine is considered to be a 
theoretical breakthrough that led to the concept of stored programmer 
computing device. 
● Designing a general purpose Turing machine is a more complex task. 
Once the transition of Turing machine is defined, the machine is 
restricted to carrying out one particular type of computation. 
● Digital computers, on the other hands, are general purpose machines 
that cannot be considered equivalent to general purpose digital 
computers until they are designed to be reprogrammed. 
● By modifying our basic model of a Turing machine we can design a 
universal Turing machine. The modified Turing machine must have a 
large number of states for stimulating even a simple behavior. We 
modify our basic model by: 
○ Increase the number of read/write heads 
○ Increase the number of dimensions of input tape 
○ Adding a special purpose memory 
● All the above modification in the basic model of a Turing machine will 
almost speed up the operations of the machine can do. 
● A number of ways can be used to explain to show that Turing machines 
are useful models of real computers. Anything that can be computed 
by a real computer can also be computed by a Turing machine. A 
Turing machine, for example can simulate any type of functions used in 
programming language. Recursion and parameter passing are some 
typical examples. A Turing machine can also be used to simplify the 
statements of an algorithm. 
● A Turing machine is not very capable of handling it in a given finite 
amount of time. Also, Turing machines are not designed to receive 
unbounded input as many real programmers like word processors, 
operating system, and other system software. 

 
Lexical Analyzer  

Compiler Design | Lexical Analysis 


Lexical Analysis is the first phase of compiler also known as scanner. It 

converts the High level input program into a sequence of ​Tokens​. 

● Lexical Analysis can be implemented with the D


​ eterministic finite 
Automata​. 
● The output is a sequence of tokens that is sent to the parser for 
syntax analysis 

What is a token? 

A lexical token is a sequence of characters that can be treated as a unit in the 

grammar of the programming languages. 

 
 
Example of tokens: 
● Type token (id, number, real, . . . ) 
● Punctuation tokens (IF, void, return, . . . ) 
● Alphabetic tokens (keywords) 

Keywords; Examples-for, while, if etc. 

Identifier; Examples-Variable name, function name etc. 

Operators; Examples '+', '++', '-' etc. 

Separators; Examples ',' ';' etc 

Example of Non-Tokens: 

● Comments, preprocessor directive, macros, blanks, tabs, newline 


etc 

Lexeme​: The sequence of characters matched by a pattern to form 

the corresponding token or a sequence of input characters that comprises a 

single token is called a lexeme. eg- “float”, “abs_zero_Kelvin”, “=”, “-”, “273”, “;” . 

How Lexical Analyzer functions 

1. Tokenization .i.e Dividing the program into valid tokens. 

2. Remove white space characters. 

3. Remove comments. 
4. It also provides help in generating error message by providing row number 

and column number. 

The lexical analyzer identifies the error with the help of automation 
machine and the grammar of the given language on which it is based 
like C , C++ and gives row number and column number of the error. 
Suppose we pass a statement through lexical analyzer – 

a = b + c​ ; It will generate token sequence like this: 

id=id+id​; Where each id reference to it’s variable in the symbol table 

referencing all details 


For example, consider the program 

int main() 

// 2 variables 

int a, b; 

a = 10; 

return 0; 

All the valid tokens are: 

 
 

'int' 'main' '(' ')' '{' 'int' 'a' ',' 'b' ';' 

'a' '=' '10' ';' 'return' '0' ';' '}' 

Above are the valid tokens. 

You can observe that we have omitted comments. 

As another example, consider below printf statement. 


 

There are 5 valid token in this printf statement. 

Text editor and Search using Regular Expression 

Using  a  text  editor  and  regular 


expressions 

What is a text editor? 

Unlike  a  word  processor,  a  text  editor  only  handles  plain  text  (i.e.  no  graphics  or 
fancy  formating).  However,  in  return,  text  editors  provide  powerful  tools  for 
manipulating  text,  including  the  ability  to  use  regular  expressions,  highlight 
differences  between  two  or  more  files,  and  search  and  replace  functionality  on 
steroids.  Since  most  data  is  available  in  or  can  be  exported  to  plain  text,  and 
computer  programs  are  written  in  plian  text,  a  text  editor  is  one  of  the  most 
basic  and  useful  applcations  for  anyone  dealing  with  massive  amounts  of  data. 
Text  is  universal  -  unlike  binary  formats  (try  to  open  a  WordPerfect  4.2 
document),  text  documents  are  editable  on  any  platform,  and  will  still  be 
readable in 30 years time. 
First look 

This  session  will  asume  that  you are using the TextWrangler editor on a Mac. The 


functionality  of  most  text  editors  is  very  similar,  and  you  should be able to follow 
along  if  you  are  using  a  different  editor.  We  will  take  the  lazy  option  of 
familiarizing  you  to  TextWrangler  by  using  the  official  tutorial.  Open 
TextWrangler  and  take  a  few  minutes  to  familiarize  yourself  with  its  anatomy  - 
look  at  the  menus,  toolbars,  icons  etc.  Most  of  this  should  be  pretty  familiar  to 
you  from  other  programs.  Open  the  file  Lorem  ipsum.txt  in  the  Tutorial 
Examples/Lesson  3  folder.  Now  read  the  description  of  the  toolbar  on  Page  10 of 
the tutorial. 

Mini-exercise 

When you first open the Lorem ipsum.txt file, it looks like 
 

Use the TextWrangler toolbar Text Options to get here 

Then use the splitbar to split into two windows 


 

Now drag the splitbar all the way to the top to get back a single window. 

Finding and replacing 

You  can  read  all  about  TextWrangler  in  the  tutorial  at  your  leisure  at  home,  but 
most  things  such  as  cut,  copy  and  paste  and  undo/redo work as expected. Since 
we  have  a  lot  to  cover,  we’ll  just  skip  ahead  to  lesson  7.  Open  the  file  Find  and 
Replace  Sample.txt in the Lesson 7 folder and work through the exercise on page 
39. 

Next,  open  the  Email  Table.txt  file  in  the  Lesson  7  folder  and  work  through  the 
exercise  on  page  42  to  learn  how  to  do  search  and  replace  on  ​invisible 
characters. 
Regular expressions 

We  now  come  to  the  main  part  of  this  session,  which  is  how  to  use  ​regular 
expressions  (or  regex)  for  text  manipulation  described  in  pages  44  to  51  of  the 
TextWrangler  tutorial.  You  have  already  come  across  regular  expressions  in  the 
Unix session - now you will see how to use them to manipulate text files. 

We’ll  continue  with  the  cats  and  dogs  file  just  to  get  comfortable  with  the  basic 
elements  of  regular  expressions.  Open  the  Find  dialog  and  check  the  ​Grep  and 
Wrap  around  boxes.  The  Grep  option  tells  TextWrangler  that  we  are  using 
regular  expressions,  and  wrap  around  that  we  want  to  do  search  and replace on 
the  whole  document  regardless  of  where  our  cursor/insertion  point  is.  If  at  any 
point  you  get  confused  over  the  next  few paragraphs, look at the examples from 
page 47-49 for concrete illustrations of how to use regular expressions. 

On  page  45,  there  is  a  table  of  ​Wildcards​.  Put  a  period  ​. in the Find box and click 
Next  to  see  what  matches.  Keep  clicking  Next  -  you  will  see  that  the  ​.  matches 
every  character  as  advertised.  Next  try  ​\s  in  the  Find  box  and  click  Next.  What 
does  it  match?  Repeat  for  all  the  Wildcards  to  get  a  good  intuition of what each 
wildcard matches. 

Now  do  the  same  with  the  ​Class  patterns.  Feel  free  to  edit  the  text  to  include 
new  words  or  numbers  if  you  are  curious  as  to  how  they  will  be  matched.  Also 
experiment  with  creating  your  own  class  search  patterns  -  for  example,  what 
would a class to match DNA nucleotide symbols look like? 

Now  try  the  ​Repetition  patterns  ​*​,  ​*?​,  ​+,​   ​+?  and  ​?.​   There  is  another  repetition 
pattern  ​{n,  m}  that  means  match  at  least  ​n  but  not  more  than ​m of the previous 
character  or  pattern.  If  ​n  is  left  out,  we  have  ​{,m}  which  means  match  no  more 
than  ​m characters or patterns. If ​m is left out, we have ​{n,} which means match at 
least  ​n  occurrences.  What  is  the  difference  between  ​*  and  ​*?  or  ​+  and  ​+?​?  When 
would you use the greedy and non-greedy versions? 

Now  figure  out  what  the  ​^  and  ​$  positional  assertions  and  the  alternation 
patterns  mean.  You  can  use  parenthesis  to  enclose  alternations  if  you  need  to 
also match stuff before and/or after the alternation. 

Mini-exercise 

1. Construct a regular expression to find two or more 


consecutive vowels in the cats and dogs file. 
2. Use a regular expression pattern to delete all punctuation 
from the example. 
3. Use a positional assertion with the alternation pattern to find 
all cat, cats, dog or dogs that occur at the end a sentence. 
You should find that the ends of sentences 1, 3, 5, 9 and 11 
match. 

Subpatterns and replace patterns 

Regular  expressions  really  begin  to  show  their  power  when  we  use  ​subpatterns 
and  ​replace  patterns​.  This  can  be  a  little  tricky  to  wrap  your  head  around  when 
encountered  for  the  first  time,  so  we  start  with  some  simple contrived examples 
to give you some practice. 

Subpatterns  are  simply  any  regular  expression  enclosed  by  parenthesis.  For 
example,  ​(.)  is  a  subpattern  that  matches  any  character,  just  like  ​.  by  itself. 
However,  we  can  refer  to  ​captured  subpatterns  to  refer  to  whatever  is  in  the 
parenthesis.  For  example,  what  does  the  regex  ​(.)\1  find?  See  if  you  can  guess, 
then use TextWrangler to check if you were correct. 

We  can  also  use  the  captured  subpatterns  in  the  replace  pattern.  Here  again,  ​\1 
refers  to  the  first  regex  in  parenthesis,  2`  to  the  second one etc, and ​& to refer to 
the  full  regular  expression  matched  (not  just  an  individual  subpattern).  As  an 
example,  what  does  putting  ​(.)\1([aeiou]+)  in  the  Find  box  and  ​&\2\1&  do?  Test  it 
out in TextWrangler to find out what word is changed and what it is changed to. 

Exercise 

Open the file ​Ch3observations.txt​ in the examples folder. It looks like this 

13 January, 1752 at 13:53 -1.414 5.781 Found in tide pools 

17 March, 1961 at 03:46 14 3.6 Thirty specimens observed 

1 Oct., 2002 at 18:22 36.51 -3.4221 Genome sequenced to confirm 

20 July, 1863 at 12:02 1.74 133 Article in Harper's 

Write a regular expression to convert it into this: 


1752 Jan. 13 13 53 -1.414 5.781 

1961 Mar. 17 03 46 14 3.6 

2002 Oct. 1 18 22 36.51 -3.4221 

1863 Jul. 20 12 02 1.74 133 

Natural Language Processing 

Natural Language Processing (NLP) refers to AI method of communicating with 


an intelligent systems using a natural language such as English. 

Processing  of  Natural  Language  is  required  when  you  want  an  intelligent 
system  like  robot  to  perform  as  per  your  instructions,  when  you  want  to  hear 
decision from a dialogue based clinical expert system, etc. 

The  field  of  NLP  involves  making  computers  to  perform  useful  tasks  with  the 
natural languages humans use. The input and output of an NLP system can be − 

● Speech 
● Written Text 

Components of NLP 
There are two components of NLP as given − 

Natural Language Understanding (NLU) 

Understanding involves the following tasks − 

● Mapping the given input in natural language into useful representations. 


● Analyzing different aspects of the language. 
Natural Language Generation (NLG) 

It  is  the  process  of  producing  meaningful  phrases  and  sentences  in  the  form of 
natural language from some internal representation. 

It involves − 

● Text planning − It includes retrieving the relevant content from knowledge base. 
● Sentence planning − It includes choosing required words, forming meaningful 
phrases, setting tone of the sentence. 
● Text Realization − It is mapping sentence plan into sentence structure. 

The NLU is harder than NLG. 

Difficulties in NLU 
NL has an extremely rich form and structure. 

It is very ambiguous. There can be different levels of ambiguity − 

● Lexical ambiguity − It is at very primitive level such as word-level. 


● For example, treating the word “board” as noun or verb? 
● Syntax Level ambiguity − A sentence can be parsed in different ways. 
● For example, “He lifted the beetle with red cap.” − Did he use cap to lift the beetle 
or he lifted a beetle that had red cap? 
● Referential ambiguity − Referring to something using pronouns. For example, 
Rima went to Gauri. She said, “I am tired.” − Exactly who is tired? 
● One input can mean different meanings. 
● Many inputs can mean the same thing. 

NLP Terminology 
● Phonology − It is study of organizing sound systematically. 
● Morphology − It is a study of construction of words from primitive meaningful 
units. 
● Morpheme − It is primitive unit of meaning in a language. 
● Syntax − It refers to arranging words to make a sentence. It also involves 
determining the structural role of words in the sentence and in phrases. 
● Semantics − It is concerned with the meaning of words and how to combine 
words into meaningful phrases and sentences. 
● Pragmatics − It deals with using and understanding sentences in different 
situations and how the interpretation of the sentence is affected. 
● Discourse − It deals with how the immediately preceding sentence can affect the 
interpretation of the next sentence. 
● World Knowledge − It includes the general knowledge about the world. 

Steps in NLP 
There are general five steps − 

● Lexical Analysis − It involves identifying and analyzing the structure of words. 


Lexicon of a language means the collection of words and phrases in a language. 
Lexical analysis is dividing the whole chunk of txt into paragraphs, sentences, and 
words. 
● Syntactic Analysis (Parsing) − It involves analysis of words in the sentence for 
grammar and arranging words in a manner that shows the relationship among 
the words. The sentence such as “The school goes to boy” is rejected by English 
syntactic analyzer. 
 

● Semantic Analysis − It draws the exact meaning or the dictionary meaning from 
the text. The text is checked for meaningfulness. It is done by mapping syntactic 
structures and objects in the task domain. The semantic analyzer disregards 
sentence such as “hot ice-cream”. 
● Discourse Integration − The meaning of any sentence depends upon the 
meaning of the sentence just before it. In addition, it also brings about the 
meaning of immediately succeeding sentence. 
● Pragmatic Analysis − During this, what was said is re-interpreted on what it 
actually meant. It involves deriving those aspects of language which require real 
world knowledge. 
Implementation Aspects of Syntactic Analysis 
There  are  a  number  of  algorithms  researchers  have  developed  for  syntactic 
analysis, but we consider only the following simple methods − 

● Context-Free Grammar 
● Top-Down Parser 

Let us see them in detail − 

S-ar putea să vă placă și