Sunteți pe pagina 1din 25

CS262 Unit 2: Lexical Analysis

Contents
Introduction HTML Lexical Analyzer
Lexical Analyzer Quiz Ambiguity Quiz Rule Order Rule Order Quiz String Snipping Crafting Input Quiz Commented Html Commented Html Quiz HTML Comments Token Counting Quiz

Five Factorial
Honor And Honour Honor And Honour Quiz Identifier Identifier Quiz Numbers Quiz The End Of The Line The End Of The Line Quiz

Wrap Up Answers

Introduction
In the last unit, we learned about regular expressions and finite state machines. We saw that regular expressions are a concise notation that we can use to denote or match a number of strings. We learned how to use regular expressions in Python by importing the functions and data types from the regular expression library, re. We saw how we could turn regular expressions into finite state machines that accept the same language as our regular expressions. In this unit, we're going to combine those concepts to make a lexical analyzer, a program that reads in a web page or a bit of JavaScript and breaks it down into words, just like we might break an English sentence down into words. This is going to be a really important tool in our arsenal, and it's one of the first steps towards making a web browser.

Specification
First we need to learn how to specify important parts of HTML and JavaScript, and we're going to do this specification using regular expressions. Remember, the goal of the overall project is to start with a web page and then break it down into important words while largely ignoring white space or new line characters. Then we want to take those words and diagram them into a tree-like structure, and finally well interpret that tree to figure out what it means to get the result.

HTML
HTML stands for "hypertext markup language." You may have some previous experience with HTML, but that's not necessary for this course. HTML was invented by Tim Berners-Lee around 1990. For our purposes, HTML just tells a web browser how to display a webpage. In fact, HTML is not all that different from using symbols like stars or underscores to emphasis text that you're writing to someone else. Consider this line of HTML: I <b>really</b> like you. In HTML the special punctuation <b> means "let's do some bold now", and </b> means "I'm done with bold, let's go back to normal". These are called HTML 'tags'. The <b> is called the 'opening tag', while </b> is the 'closing tag'. When rendered in a web browser, this will appear as: I really like you

Other HTML tags that are commonly used for emphasising text include underline, <u>underline</u>, and italics, <i>italic</i>.

Really Quiz
Which of these HTML fragments show the word really in bold? George Orwell was <b>really</b> Eric Blair. George Orwell was </b>really<b> Eric Blair. George Orwell was <b> really</b> Eric Blair. <b>George Orwell was really Eric Blair.</b>

Tags
We have seen the HTML <b> tag that modifies text by telling the browser to display it as bold. Another very common tag is the anchor tag <a>, which is used to add hyperlinks to web pages. In some ways, this is the defining characteristic of what it means to be a webpage. Here is a fragment of HTML that includes such an anchor tag. Click here <a href="www.google.com"> now! </a> It begins with <a, but unlike the relatively simple bold and underline tags, the anchor tag has an argument. This means pretty much the same thing it did when we were talking about functions in Python. Here, the argument or modifier for the anchor tag is href =. This stands for hypertext reference - the target of the link. After the href = we have a string that is a URL, a web address, in this case to Google. The text in the middle (now!) is often rendered in blue with an underline, although it doesn't have to be. The anchor tag ends with </a> For this fragment of HTML, the words "Click here now!" will be drawn on the screen. The word now! will be a hyperlink.

Interpreting HTML Quiz


Here is a significantly more complicated fragment of HTML:
<a href="http://wikipedia.org"> Mary Wollstonecrafz wrote <i> A Vindication of the right of women </i> </a>

Which of these words will be displayed on the screen by the web browser? href Mary wikipedia i Vindication wrote

Taking HTML Apart


Now that we understand how HTML works, we want to separate out the HTML tags from the words that will be displayed on the screen. Breaking up words like this is actually a surprisingly common task in real life. This set of letters is inscribed on the arch of Titus in Rome: SENTATUSPOPULUSQUEROMANUS Roman inscriptions like this were often written without spaces, and it requires a bit of domain-knowledge to know how to break this up - SENTATUS POPULUSQUE ROMANUS - "The Senate and People of Rome." Similarly, many written Asian languages, don't explicitly include spaces or punctuations between the various characters or glyphs, and in text messaging, some domain-knowledge is required to break I<3u up into "I love you". We will want to do the same thing for HTML: Wollstonecraft </a> wrote We'll want to break it up into the words like "Wollstonecraft" and "wrote" that will appear on the screen or the special left angle bracket slash construct that tells us that we're starting end tag, the special word in the middle that tells us which tag it was, and then this closing right angle bracket: Word Wollstonecraft Start of the closing tag </ Another word a End of the closing tag > Another word wrote

We need to do this in order to write our web browser. To interpret HTML and JavaScript, we're going to have to break sentences down into their component words so that we can figure out what's going on. This process is called lexical analysis. You'll be pleased to know that we're going to use regular expressions to solve this problem.

Taking HTML Apart Quiz


Let's tackle the problem in reverse. start of tag < word b end of tag > word salvator word dali

Select each of these HTML fragments that would decompose into this sequence of five elements. </b> salvator dali < b>salvator dali <b>salvatordali <b> salvator dali </b> <b> salvator </b> dali <b> salvator dali

HTML Structure
Since HTML is structured, we're going to want to break it up into words, punctuation and word-like elements. We will use the special term token to mean all of those. In general, a token is the smallest unit of the output from a lexical analysis. It can refer to a word, a string, numbers, or punctuation. In most cases, tokens do not refer to white space, which is just a formal way of referring to the spaces between words. We're going to focus on lexical analysis, a process whereby we break down a string, like a sentence or an utterance or a webpage, into a list of tokens. One string might contain many tokens in the same way that one sentence might contain many words. Here are six HTML tokens: LANGLE < LANGLESLASH </ RANGLE > EQUAL = STRING "google.com" WORD Welcome!

The naming of tokens is a bit arbitrary. In general, though, tokens are given uppercase names to help us distinguish them from other words or variables. Here, LANGLE corresponds to an angle bracket facing left, <. LANGLESLASH is a < followed by a forward-slash, /. RANGLE is an angle bracket facing right, >. EQUAL is just '='. STRING is going to have double-quotes around it, and WORD is anything else. Now, it turns out that naming tokens is not entirely arbitrary. We're just going to go with these token names for now, but if you were designing a system from the ground up, you could rename them to be anything you like.

Specifying Tokens
We're going to use regular expressions to specify tokens. Regular expressions are very good at specifying sets of strings . Later on, we'll want to match a range of different tokens from web-pages or JavaScript. This is how we write out token definitions in Python.

def t_RANGLES(token) r'>' return token The t_ tells us, and tells the Python interpreter, that we're declaring a token. The next letters are the name of the token. You could make this up yourself, but in homeworks we'll tell you what we want it to be. In a sense, token is a function of the text it matches. We'll look at this more a little later. Next, we have a regular expression corresponding to the token. In this case, for the right angle token, there's really only one string it can correspond to, so we've written out the regular expression that corresponds to that string, r'>'. We then return the text of the token unchanged. We could transform the token text before returning it, and we'll do that for more complicated tokens like numbers, where we may want to change the string '1.2' into the number 1.2 for example.

Specifying Tokens Quiz


Write the code for the LANGLESLASH token. Use the interpreter to define a procedure t_LANGLESLASH() that matches it.

Token Values Quiz


It's not enough to know that a string contains a token, just like it's not enough to know that a sentence contains a verb. We need to know which token it is. Formally, we refer to that as the value of the token. By default, the value of a token is the value of the string it matched. However, we can transform it. Here is a definition for a slightly more complicated token. A number consisting of one or more copies of the digits 0-9
def t_NUMBER(token): r"[0-9]+" token.value = int(token.value) return token

If the input text is "1368", what will the value of the token be? Check all that apply. "1368" 1368 "1" 1111111

Quoted Strings Quiz


When reasoning about HTML, it's critical that we understand quoted strings. They come up in almost every anchor tag, and anchor tags are the essence of hypertext. <a href="www.google.com"> link!</a> They're the interlinks between documents, so we really need them. They rely on quoted strings and that means we're going to need to understand quoted strings. Its just as well that we had plenty of practice with them in the last unit. Suppose that all the strings we care about start with a double quote, end with a double quote, and in the middle they can contain any number of characters except double quotes. Write a definition for the Python function t_STRING() and submit it via the interpreter.

Whitespace
We're using these token definitions in regular expressions to break down HTML and JavaScript into important words. As we've seen before, there can be lots of extra space between the various tokens. We really want to skip or pass over spaces (and possibly also newline characters and tabs - but more on that later). We do that using the same sort of token definitions as before. Here is a regular expression that matches a single space. In this case, instead of returning the token, we pass it by. def t_WHITESPACES(token): r" " pass

Whitespace Quiz
We've seen how to do left and right angle bracket tokens. We've looked at strings before. Now let's tackle words, which are almost everything else on a web page. Let's say that a word is any number of characters except a left angle bracket, a right angle bracket, or a space. Submit a definition for the function t_WORD() using the interpreter. The function should return the value unchanged.

Lexical Analyzer
We've now seen a number of these token definitions, one for words, one for strings and so on. A lexical analyzer, or 'lexer', is a collection of token definitions. We specify what makes a word, what makes a string, what makes a number, what makes white space and then put it all together. The result is a lexer, something that splits a string into exactly the token definitions that you have given it. As an example, when we put these three rules together, they become a lexer:
def t_WHITESPACES(token): r" " pass def t_WORD(token): r"[^ <>]+" return token def t_NUMBER(token): r"[0-9]+" token.value = int(token.value) return token

Lexical Analyzer Quiz


Suppose we pass to the above lexical analyzer the input string: 33 is less than 55 Which of these 3 possible output lists could correspond to the values of the tokens extracted from this input string using these rules. [33, 'is', 'less', 'than', 55] [33, 'is', 'less', ' ', 'than', 55] ['33', 'is', 'less', 'than', '55']

Ambiguity Quiz
We're going to assume that NUMBER is preferred to WORD. Whenever we can match something as a number, we will match it as a number instead of a word. Which of these 4 inputs could break down into WORD, WORD, NUMBER using the rules that we've been going over so far? grace hopper 1906 grace hopper 1 9 0 6 grace "hopper" 1906 grace hopper onenineohsix

Rule Order
As we saw in the last quiz, its not entirely clear what we should do when our token definitions overlap. When two token definitions can match the same string, the behaviour of our lexical analyzer may be ambiguous. The seven character sequence: hello matches our definition for WORD, but also matches our definition for STRING. We need to have definitive rules to specify which definition we prefer. We will use a very simple rule. In our implementation we will favour the token definition listed first. If we are making a lexical analyzer for HTML and JavaScript, ordering is of prime importance.

Rule Order Quiz


Suppose we want the input string hello, world to yield WORD STRING. Which rule must come last?
def t_WORD(token): r'[^ <>]+' def t_STRING(token) r'"[^"]*"' def t_WHITESPACE(token) r' ' pass

String Snipping
You may have noticed a bit of redundancy in the way we handled quoted strings. We return the entire matched quote, including the double-quotes at the ends. We dont actually need those double-quotes, they are really just markers to tell us when the string starts and ends. Here is a token definition that removes the quotes before returning the string: def t_STRING(token): r'"[^"]*"' token.value = token.value[1:-1] return token Now we will make a lexical analyzer in Python. The first thing we need to do is to import the lexical analyzer libraries that we are going to build on:
import ply.lex as lex

Next, we give a list of the tokens that we are going to define:


tokens = ( 'LANGLE', #< 'LANGLESLASH, # </ 'RANGLE', #> 'EQUAL', #= 'STRING', # "hello" 'WORD', # Welcome ) t_ignore = ' ' # shortcut for whitespace

For now, we are just going to make use of these 6 tokens. The last line is a handy shortcut which means we dont need to use the WHITESPACE token.. Using t_ignore (token ignore) will implicitly ignore everything that matches the given expression.

Next, we define our tokens:


def t_LANGLESLASH(token): r'</' return token def t_LANGLE(token): r'<' return token def t_RANGLE(token): r'>' return token def t_EQUAL(token): r'=' return token def t_STRING(token): r'"[^"]*"' token.value = token.value[1:-1] return token def t_WORD(token): r'[^ <>]+' return token

Notice that we defined the LANGLESLASH token ahead of the LANGLE token. This is because we want the LANGLESLASH to be given priority. We can now use these token definitions to break up a webpage. To keep the results manageable, we will define a simple webpage to be:
webpage = "This is <b>my</b> webpage!"

Now we need to tell the lexical analysis library to use the token definitions we have given to make a lexical analyzer and break up strings:
htmllexer = lex.lex()

and which string to break up:


htmllexer.input = (webpage)

Now, we just want to print out the elements in the list of tokens produced by the lexical analyzer:
while True: tok = thmllexer.token() if not tok: break print tok

The complete code now looks like this:


import ply.lex as lex tokens = ( 'LANGLE', #< 'LANGLESLASH, # </ 'RANGLE', #> 'EQUAL', #= 'STRING', # "hello" 'WORD', # Welcome ) t_ignore = ' ' # shortcut for whitespace def t_LANGLESLASH(token): r'</' return token def t_LANGLE(token): r'<' return token def t_RANGLE(token): r'>' return token def t_EQUAL(token): r'=' return token def t_STRING(token): r'"[^"]*"' token.value = token.value[1:-1] return token def t_WORD(token): r'[^ <>]+' return token webpage = "This is <b>my</b> webpage!" htmllexer = lex.lex() htmllexer.input = (webpage) while True: tok = htmllexer.token() if not tok: break print tok

When we run this through the Python interpreter we get the result:
LexToken(WORD,'This',1,0) LexToken(WORD,'is',1,5) LexToken(LANGLE,'<',1,8) LexToken(WORD,'b',1,9) LexToken(RANGLE,'>',1,10) LexToken(WORD,'my',1,11) LexToken(LANGLESLASH,'',1,13) LexToken(WORD,'b',1,15) LexToken(RANGLE,'>',1,16) LexToken(WORD,'webpage!,1,18)

This numbers in the brackets refer to the line number and character number on that line.

Tracking Line Numbers


As we have just seen, lexers often track line number information (and sometimes column number or character count information as well). This can be very handy information to keep track of. You have probably written an incorrect Python programme. Most of us have written lots! It is really nice if the interpreter tells you which line the mistake is on. Unfortunately, the lexer wont do this for us entirely automatically. It will keep track of columns, but it wont keep track of lines unless we Here is a token definition for newlines:
def t_NEWLINE(token): r'\n' token.lexer.lineNo +=1 pass

You may be unfamiliar with the definition used in the regular expression for newline, but \n is the string equivalent of pressing the Return or Enter key on your keyboard. We also need to update our token definition for WORD to ignore newlines:
def t_WORD(token): r'[^ <>\n]+' return token

When we add these to our code, the lexer will keep track of newline information for the webpage.

Crafting Input Quiz


Define a variable called webpage that holds a string that causes our lexical analyzer to produce the exact output below:
LexToken(WORD,'This',1,0) LexToken(WORD,'is',1,5) LexToken(LANGLE,'<',2,11) LexToken(WORD,'b',2,12) LexToken(RANGLE,'>',2,13) LexToken(WORD,'webpage!',2,14)

Commented Html
Just as we have to separate words in HTML and JavaScript based on white space, we also have to take into account comments. Comments in HTML serve the same purpose that they do in Python. They provide documentation or remove functionality. You can add a comment containing English text to explain what the HTML or JavaScript should be doing, or you could comment-out a function or a line to see how things behave without it. In HTML, comments are indicated like this: <!-- comments --> They look a little like tags. Here I have an HTML fragment: I think therefore <!-- this comment not printed --> I am! We're going to see how to implement this in our lexical analyzer, but remember that our lexical analyzer is just based on regular expressions. We could recognize these comments with another finite-state machine, and then all we need to do is merge the two finite-state machines together. If we could have one set of rules describing comments, and another set of rules describing all of my other tokens, we could just combine them together into one big machine. This machine might have too many states for us to be comfortable with, but it is entirely fine for a computer. When we process a comment, the normal rules don't apply. Let's consider a very tricky comment example: Welcome to <b>my <!-- careful </b> --> webpage.</b> The question is, how will this render? Which of these words will be emboldened?

When something is in a comment, we ignore it entirely. It's as if it wasn't there. So even though this might look as if it's closing the bold tag, it isn't. The words "my" and "webpage" will both be rendered as bold. Welcome to my webpage. In practice, it's almost as if everything in the comments were entirely erased and had no impact on the final rendering of the webpage at all.

Commented Html Quiz


Here is another HTML fragment that includes a comment: Hello <!-- "world" <b> --> confusing </b> Which of the following HTML tokens could be found by our lexer, assuming that we've added the right rules for comments to our lexer? LANGLE LANGLESLASH RANGLE STRING WORD

HTML Comments
So let's add HTML style comments to our lexer. Just as we had to list all of the tokens, we're now going to have to list all the possible things we could be doing. We are either breaking down the input into tokens, or we're handling comments. These two possible choices are called lexer states:
states = ( ('htmlcomment', 'exclusive'), )

This particular syntax for specifying them is arbitrary. It's just a function of the library we're using. We are going to declare a new state called "htmlcomment" that is exclusive. If we're in the middle of processing an HTML comment, we can't be doing anything else. We won't be finding strings or words. Now we add our rule for entering this special mode for HTML comments.
def t_htmlcomment(token): r'<!--' token.lexer.begin('htmlcomment')

Now, when we see the beginning marker, <!--, we'll enter this special HTML comment state and won't find words or strings or numbers or tags. Instead we'll ignore everything. When we reach the end marker for an HTML comment, we want to jump back to the initial mode (whatever we were doing before):
def t_htmlcomment_end(token): r'-->' token.lexer.lineno += token.value.count('\n') token.lexer.begin('INITIAL')

By default, that mode is called "INITIAL" - all in capital letters. Notice that we've called the string.count function to count the number of newline characters that occur in the entire comment and add those to the current line number. This is because we're in our special HTML comment mode, so we aren't using any of our other rules - even the newline rule that's counting the line numbers for us. This is a little tricky, so don't worry if this doesn't quite make sense the first time. Now, we've said what to do when an HTML comment begins, and we've said what to do when it ends. However, any other character we see while in this special HTML comment mode isn't going to match one of those two rules. Anything that doesn't match one of our rules counts as an error. What we will do is to create a rule that says, in case of an error, skip over that single character in the input:
def t_htmlcomment_error(token) token.lexer.skip(1)

This is similar to writing "pass" except that it gathers up all of the text into one big value so that we can count the newlines later. Now if we submit the webpage: "hello <!-- comment --> all" to our lexer, we get the result with just two tokens, as we should expect:
LexToken(WORD,'hello',1,0) LexToken(WORD,'all',1,23)

Token Counting Quiz


Given our definition of HTML comments plus a rule for word tokens:
def t_WORD(token): r'[^ <>]+' return token

How many word tokens are we going to find in the following HTML input fragment? ob <!-- fuscation -->tuse angle

Five Factorial
Now that we've mastered lexing HTML, let's turn our attention to JavaScript - the other language we'll be considering in this class. We'll introduce you to JavaScript and its syntax by jumping straight into an example. Let's say we want to create a webpage that displays some welcome text and then computes five factorial (5 times 4 times 3 times 2 times 1) and displays it on the screen. In mathematics we would write 5! to denote five factorial. Now, this is exactly the sort of thing that a programming language like JavaScript could help us out with. Like Python, JavaScript can carry out computations. This lets us do work in the middle of a webpage. Let's write our first JavaScript script together:
<script type= "text/javascript"> document.write("Hello world"); <!script>

These three lines are a JavaScript program which might be embedded inside an HTML webpage. JavaScript programs always begin with the special script tag, and the script tag has an argument because there might be multiple types of tags out there in the universe. JavaScript's name for the print function is document.write, which we'll sometimes just abbreviate as write. But the semantics, or meaning, is largely the same. It's also worth noting that we've put parentheses around the argument to document.write, almost as if it were a mathematical function. We can also do that in Python. It's allowed, but often we don't bother. We ended the line with a semi-colon. The JavaScript program ends with the closing <!script> tag. Let's use the power of JavaScript to compute five factorial. To do this, we'll make a recursive function called factorial that will compute the value:
<script type= "text/javascript"> function factorial(n) { if (n == 0) { return 1; }; return n * factorial(n-1); } document.write(factorial(5)); <!script>

Let's walk through the JavaScript code together. The word function means we're declaring a function. This is similar to def in Python. This is followed by the function name and arguments just like in Python. The punctuation in JavaScript is a little different from Python. Python would have a colon at the end of the line, but JavaScript requires an open curly brace. In this regard it's more like languages such as C or C++ or Java or C#. Our factorial function is going to be recursive, and every recursive function needs a base case - a stopping condition. Our stopping condition is when n is 0, so we have an if statement to test for that. Once again his would look very similar in Python, except that we'd use a colon instead of an open curly brace. If n is 0, we return 1. Note that we have a semi-colon at the end of all our statements. Now, in Python we would use the tabbing to indicate that we've finished with the if statement. JavaScript doesn't use readable tabbing in that way. Instead, we explicitly close off the open curly brace (just like we'd close off a tag in HTML or close off parentheses once we start them) and complete the if statement with a semi-colon. Next, we have a new return statement. This is our recursive function call calculating n times the factorial of n - 1, just like you'd expect to see in Python. We end the whole thing with a semi-colon. This is the end of our function definition, and close off the open curly brace. Finally, we print out factorial of 5 using document.write. Our HTML/JavaScript for our webpage would now be:
<p> Welcome to <b>Our Webpage</b>. Five factorial (aka <i>5!</i>) is: <script type= "text/javascript"> function factorial(n) { if (n == 0) { return 1; }; return n * factorial(n-1); } document.write(factorial(5)); <!script> </p>

And would be rendered as: Welcome to Our Webpage. Five factorial (aka 5!) is: 120

Note: Wes was using JSFiddle to manipulate the JavaScript code. JsFiddle is a playground for web developers, a tool which may be used in many ways. One can use it as an online editor for snippets build from HTML, CSS and JavaScript. The code can then be shared with others, embedded on a blog, etc. Using this approach, JavaScript developers can very easily isolate bugs.

Honor And Honour


Many of the differences between JavaScript and Python are similar to the differences between American English and British English. Sometimes the spelling changes a bit, or the pronunciation changes a bit, but in general, they are mutually intelligible. It should look very similar, but some differences. JavaScript is big on curly braces. Python is big on tabbing. JavaScript is big on semicolons. Python not so much. JavaScript really loves parentheses. Python can take them or leave them.

Honor And Honour Quiz


Here is a new JavaScript function called ironcrutchli: function ironcrutchli(x) { if (x < 2) { return 0; } return (x + 1); } It takes an argument x, does some reasoning based on x, and returns a value as a result. What would the output from ironcrutchli be for each of these inputs? 0 1 2 9

Identifier
An identifier is a name for a program concept, such as a function or variable. In the previous example, we highlighted ironcrutchli and x in red. They were identifiers. An identifier is a name for a program concept, such as a function or variable. They're textual string descriptions that refer to program elements. So, factorial, x and tmp are all identifiers. We often call them identifiers because they identify a particular value or storage location. We're going to allow our identifiers to start with either an uppercase or lowercase letter. They can contain any number of upper or lowercase letters and can also include numbers and underscores to aid readability. We don't allow identifiers to start with an underscore or a number.

Identifier Quiz
Write an identifier token rule for these sorts of JavaScript identifiers, as we have defined them.

Numbers Quiz
JavaScript also supports numbers. Just like in Python, they can be integers, decimal fractions or negative numbers. Write an identifier token rule, t_number() for numbers. Set token.value to float.

The End Of The Line


In Python, we can write a comment that continues to the end of the line by prefacing it with the hash sign, #. JavaScript allows comments in a similar manner, but JavaScript comments begin with two slashes: //. Here is a rule for comments that go to the end of the line in JavaScript:
def t_eolcomment(token): r'//[^\n]*' pass

This will accept anything except a newline, and then discard it.

The End Of The Line Quiz


Which of these sequences would yield IDENTIFIER IDENTIFIER NUMBER, given the rules we have discussed for JavaScript? irene joliot_curie 9.1897//1956 ralph emerson 1803 henry thoreau //1817.0 //marie curie 1867

Wrap Up
Just as English sentences can be broken up into words, HTML and JavaScript can be broken up into tokens. Some of the more complex tokens we have looked at include numbers words and strings. Since they correspond to sets of strings we use regular expressions to specify them We introduced you to JavaScript, and showed you how to use regular expressions to specify core parts of HTML and JavaScript. The homework for this unit will give you a chance to expand on that knowledge. In the next unit, we will continue our progress towards an HTML and JavaScript interpreter. We will learn how to combine the tokens we have recovered into sentences which we can check using rules of syntax, just like the rules we have in natural languages.

Answers
Really Quiz
George Orwell was <b>really</b> Eric Blair. George Orwell was </b>really<b> Eric Blair. George Orwell was <b> really</b> Eric Blair. <b>George Orwell was really Eric Blair.</b>

Interpreting HTML Quiz


href Mary wikipedia i Vindication wrote

Taking HTML Apart Quiz


</b> salvador dali < b>salvador dali <b>salvadordali <b> salvador dali </b> <b> salvador </b> dali <b> salvador dali

Specifying Tokens Quiz


def t_LANGLESLASH(token): r'</' return token

Token Values Quiz


"1368" 1368

"1" 1111111

Quoted Strings Quiz


def t_STRING(token): r'"[^"]*"' return token

Whitespace Quiz
def t_WORD(token): r'[^ <>]+' return token

Lexical Analyzer Quiz


[33, 'is', 'less', 'than', 55] [33, 'is', 'less', ' ', 'than', 55] ['33', 'is', 'less', 'than', '55']

Ambiguity Quiz
grace hopper 1906 grace hopper 1 9 0 6 grace "hopper" 1906 grace hopper onenineohsix

Rule Order Quiz


def t_WORD(token): r'[^ <>]+' def t_STRING(token) r'"[^"]*"' def t_WHITESPACE(token) r' ' pass

Crafting Input Quiz


webpage = "This is \n <b>webpage!"

Commented Html Quiz


Here is another HTML fragment that includes a comment: Hello <!-- "world" <b> --> confusing </b> Which of the following HTML tokens could be found by our lexer, assuming that we've added the right rules for comments to our lexer? LANGLE LANGLESLASH RANGLE STRING WORD

Token Counting Quiz


3

Honor And Honour Quiz


0 0 3 10

Identifier Quiz
def t_IDENTIFIER(token): r'[a-zA-Z][a-zA-Z0-9_]*' return token

Numbers Quiz
def t_NUMBER(token): r'-?[0-9]+\.?[0-9]*' # or use r'-?[0-9]+(?: \.?[0-9]*)?' token.value = float(token.value) return token

The End Of The Line Quiz


irene joliot_curie 9.1897//1956 ralph emerson 1803 henry thoreau //1817.0 //marie curie 1867

S-ar putea să vă placă și