Documente Academic
Documente Profesional
Documente Cultură
language) into another computer language (the target language, often having a binary form known as object code). The most common reason for wanting to transform source code is to create an executable program. The name "compiler" is primarily used for programs that translate source code from a high-level programming language to a lower level language (e.g., assembly language or machine code). If the compiled program can run on a computer whose CPU or operating system is different from the one on which the compiler runs, the compiler is known as a cross-compiler. A program that translates from a low level language to a higher level one is a decompiler. A program that translates between high-level languages is usually called a language translator, source to source translator, or language converter. A language rewriter is usually a program that translates the form of expressions without a change of language. A compiler is likely to perform many or all of the following operations: lexical analysis, preprocessing, parsing, semantic analysis (Syntax-directed translation), code generation, and code optimization. Program faults caused by incorrect compiler behavior can be very difficult to track down and work around; therefore, compiler implementors invest significant effort to ensure the correctness of their software. The term compiler-compiler is sometimes used to refer to a parser generator, a tool often used to help create the lexer and parser
Structure of a compiler
Compilers bridge source programs in high-level languages with the underlying hardware. A compiler requires 1) determining the correctness of the syntax of programs, 2) generating correct and efficient object code, 3) run-time organization, and 4) formatting output according to assembler and/or linker conventions. A compiler consists of three main parts: the frontend, the middle-end, and the backend. The front end checks whether the program is correctly written in terms of the programming language syntax and semantics. Here legal and illegal programs are recognized. Errors are reported, if any, in a useful way. Type checking is also performed by collecting type information. The frontend then generates an intermediate representation or IR of the source code for processing by the middle-end. The middle end is where optimization takes place. Typical transformations for optimization are removal of useless or unreachable code, discovery and propagation of constant values, relocation of computation to a less frequently executed place (e.g., out of a loop), or specialization of computation based on the context. The middleend generates another IR for the following backend. Most optimization efforts are focused on this part. The back end is responsible for translating the IR from the middle-end into assembly code. The target instruction(s) are chosen for each IR instruction. Register allocation assigns processor registers for the program variables where possible. The backend utilizes the hardware by figuring out how to keep parallel execution units busy, filling delay slots, and so on. Although most algorithms for optimization are in NP, heuristic techniques are well-developed.
Phases of compiler:
The main task of lexical Analyzer is to read a stream of characters as an input and produce a sequence of tokens such as names, keywords, punctuation marks etc.. for syntax analyzer. It discards the white spaces and comments between the tokens and also keep track of line numbers. <fig: 3.1 pp. 84>
Tokens, Patterns, Lexemes Specification of Tokens o Regular Expressions o Notational Shorthand Finite Automata o Nondeterministic Finite Automata (NFA).
o o o
Deterministic Finite Automata (DFA). Conversion of an NFA into a DFA. From a Regular Expression to an NFA.
Type token (id, num, real, . . . ) Punctuation tokens (IF, void, return, . . . ) Alphabetic tokens (keywords)
Example of non-tokens:
Patterns
There is a set of strings in the input for which the same token is produced as output. This set of strings is described by a rule called a pattern associated with the token. Regular expressions are an important notation for specifying patterns. For example, the pattern for the Pascal identifier token, id, is: id letter (letter | digit)*.
Lexeme
A lexeme is a sequence of characters in the source program that is matched by the pattern for a token. For example, the pattern for the RELOP token contains six lexemes ( =, < >, <, < =, >, >=) so the lexical analyzer should return a RELOP token to parser whenever it sees any one of the six.
Languages A language is a set of strings over some fixed alphabet. The language may contain a finite or an infinite number of strings. Let L and M be two languages where L = {dog, ba, na} and M = {house, ba} then
Union: LUM = {dog, ba, na, house} Concatenation: LM = {doghouse, dogba, bahouse, baba, nahouse, naba} Expontentiation: L2 = LL By definition: L0 ={ } and L` = L
The kleene closure of language L, denoted by L*, is "zero or more Concatenation of" L. L* = L0 U L` U L2 U L3 . . . U Ln . . . For example, If L = {a, b}, then L* = { , a, b, aa, ab, ab, ba, bb, aaa, aba, baa, . . . } The positive closure of Language L, denoted by L+, is "one or more Concatenation of" L. L+ = L` U L2 U L3 . . . U Ln . . . For example, If L = {a, b}, then L+ = {a, b, aa, ba, bb, aaa, aba, . . .
Code generation:
<fig: 9.1 - page 513> Since code generation is an "undecidable problem (mathematically speaking), we must be content with heuristic technique that generate "good" code (not necessarily optimal code). Code generation must do following things:
1. Memory Management Mapping names in the source program to address of data object is cooperating done in pass 1 (Front end) and pass 2 (code generator). Quadruples address Instruction.
Local variables (local to functions or procedures ) are stack-allocated in the activation record while global variables are in a static area. 2. Instruction Selection The nature of instruction set of the target machine determines selection. -"Easy" if instruction set is regular that is uniform and complete. Uniform: all triple addresses all stack single addresses. Complete: use all register for any operation. If we don't care about efficiency of target program, instruction selection is straight forward. For example, the address code is: a := b + c d := a + e Inefficient assembly code is:
1. 2. 3. 4. 5. 6.
R0 b R0 c + R0 a R0 R0 a R0 e + R0 d R0
Here the fourth statement is redundant, and so is the third statement if 'a' is not subsequently used. 3. Register Allocation Register can be accessed faster than memory words. Frequently accessed variables should reside in registers (register allocation). Register assignment is picking a specific register for each such variable. Formally, there are two steps in register allocation: 1. Register allocation (what register?) This is a register selection process in which we select the set of variables that will reside in register. 2. Register assignment (what variable?) Here we pick the register that contain variable. Note that this is a NP-Complete problem. Some of the issues that complicate register allocation (problem). 1. Special use of hardware for example, some instructions require specific register. 2. Convention for Software: For example
Register R6 (say) always return address. Register R5 (say) for stack pointer. Similarly, we assigned registers for branch and link, frames, heaps, etc.,
3. Choice of Evaluation order Changing the order of evaluation may produce more efficient code. This is NP-complete problem but we can bypass this hindrance by generating code for quadruples in the order in which they have been produced by intermediate code generator. ADD x, Y, T1 ADD a, b, T2 is legal because X, Y and a, b are different (not dependent).
Typical Architecture: 1. 2. 3. 4. Target machine is : Bit addressing (factor of 1). Word purpose registers. Three address instruction of forms: Op source 1, source 2, destination e.g., ADD A, B, C Byte-addressable memory with 4 bytes per word and n general-purpose registers, R0, R1, . . . , Rn-1. Each integer requires 2 bytes (16-bits). Two address instruction of the form mnemonic source, destination MODE Absolute register Index Indirect register FORM M R c (R) *R ADDRESS M R c + contents (R) contents (R) contents (c + contents (R) constant c EXAMPLE ADD R0, R1 ADD temp, R1 ADD 100(R2), R1 ADD * R2, R1 ADD * 100(R2), R1 ADD # 3, R1 ADDEDCOST 1 0 1 0 1 1
Each instruction has a cost of 1 plus added costs for the source and destination. => cost of instruction = 1 + cost associated the source and destination address mode. This cost corresponds to the length (in words ) of instruction. Examples
1. Move register to memory R0 M.
MOV * 4 (R0), M cost = 1 plus indirect index plus instruction word =1+1+1=3 3. Indexed mode: MOV 4(R0), M cost = 1 + 1 + 1 = 3 4. Litetral mode: MOV #1, R0 cost = 1 + 1 = 2 5. Move memory to memory MOV m, m cost = 1 + 1 + 1 = 3