Advanced Processor: B.E. Semester V (CE)

Advanced Processor
B.E. Semester V (CE)

Advanced Processor
Introduction to 16-bit microprocessors
8086 architecture
Segments
Flags
8086 pin functions
Minimum and maximum mode operations
Memory banks
80286/386/486 register set
Data types
Overview of instruction set
Memory segmentation with descriptor tables including LDT and GDT
Privilege levels, Changing privilege levels
Paging including address translation
Page level protection
MMU cache
Virtual memory
Paging and segmentation
Multitasking with TSS
Context switching
Task scheduling
Extension and I/O permission
Managing interrupts with IDT

8086 Chipset

I " -

8086 Pin Diagram

37 A1J7/S4
A &/SS
A- 9/$!
J4
H V$7
J.J
80
!:2

CPU RO/Gl ( HOl.C}

-oft:T1 (H -rtlA)
(WR)
(MJio)
(CT/R)
(O)l
(t\\LE)
(INfA.)
AH A L..

81-t Bl
CH Cl
OH DL
SP
8
1
P
s
01

-
c
I

r
I

\
I
I
I

MEMORY
_
IINTERFAICE
8086 Block Diagram

._.
- ---------- -- ---- ---------

I

I
E
\
e
INST

!!!8!!!!----...

I
I
I
I
I

'
..

AUCTION

..
('-

5
STREAM I
8VTE
I
3 QUEUE
I
e.us I 1\
...
2
I

1

ES
I

cs
ss
IDS

IP
l
-----------------
-------
--------1

<
I
CONTROL

--- '"'!'!"!""" ---------- ---J

"""_,.,
I
$YSTEM
I
-
EU \ A-BUS
,f
/

I '
";
.....)Jtl
....... ?

ARITHMETIC
LOGIC UNIT

l

II
I
1
'
..

,,
OPEAANOS
FL.AGS
._:....- ....

-- - --------- ---------------------------- ---J

Features of 8086 Microprocessor

Intel 8086 was launched in 1978.
It was the first 16-bit microprocessor.
This microprocessor had major improvement over the execution
speed of 8085.
It is available as 40-pin Dual-Inline-Package (DIP).
It is available in three versions:
8086 (5 MHz)
8086-2 (8 MHz)
8086-1 (10 MHz)
It consists of 29,000 transistors.

The 8086 has two functional units aimed to work in parallel : the
BIU (Bus Interface Unit) and the EU (Execution Unit).

Principle

The two units (BIU and EU) work in parallel and pre-fetch the
instructions into the stack of the interface unit. This stack is a
FIFO (First-In First-Out) aimed to contain up to 6 pre-fetched
bytes, thus optimizing the bus usage. To execute a jump
instruction the queue has to be flushed since the pre-fetched
instructions do not have to be executed.

The pre-fetch mechanism of the 8086 can make system
debugging quite complicated, the bus activity being not
directly related to the execution unit activity. The pre-fetched
instructions are generally executed several memory cycles
later, or may never be executed (in case of a jump).

Execution Unit

The EU is where the actual processing of data takes
place inside the 8086 MPU. It is here that the
arithmetic and logic unit (ALU) is located, along with
the registers used to manipulate data and store
immediate results. The EU accepts instructions and
data that have been fetched by the BIU and then
processes the information. Data processed by the EU
can be transmitted to the memory or peripheral
devices through the BIU. EU has no direct connection
with the outside world and relies solely on the BIU to
feed it with instructions and data

Inside Execution Unit

The EU is made up of two parts known as the
ALU and the general registers. It is here that
instructions are received, decoded, and
executed from the instruction queue portion of
BIU. The instructions are taken from the top of
the instruction queue on the first-in, first-out,
or FIFO, basis.

Inside Execution Unit: ALU

The ALU is the calculator part of the
execution unit. It consists of electronic
circuitry that performs arithmetic operations or
logical operations on the binary represented
electrical signals. The control system for the
execution unit can also be thought of as part of
ALU. It provides a path for the flow of
instructions into the ALU, the general
registers, and the flag register.

Inside Execution Unit: Flag Register

A flag is a flip-flop which indicates some
condition produced by the execution of an
instruction or controls certain operations of the
EU. The Flag Register is a special register
associated with the ALU. A 16-bit flag register
in the EU contains nine active flags.
Inside Execution Unit: Flag Register

IS
0

1\T OF OF I f- TF Sf :t.f AF CF

FIGURE 2.4 Lower word of flag regisel r

(m1e1jlow j/agi, NT (tli!Sted ta.rk). und IOPL (illpwloutpctl privilc!ic level). Most of the
instructions that rcquin:the use of the ALU affect the flags. Remember that the llugs allow
ALU instructions to be followed hy conditional instructions.
The contcnt/operution of each flag is as foll1'1ws:

CF: Contains carry out of MSB of resul!
PF: Indicates if result has even parity
AF: Certains carry out of hit 3 in AL
ZF: Indicates if result equal '"'"'
SF: Indicates if resuh is negative
OF: Indicates that an overflow occurcd in resul!
IF: Enables/Disables interrupL
OF: Comrols pointer updates during String operations
TF: PrO\'idcs sin :le
IOPL: Priority level of current task
NT: lndi:ates i r current aisknested

The upper 1 6 bits of the !lag register arc used for protected modoperation . Sec Chapter II
for details.
Inside Execution Unit: General Purpose Registers

General-purpose registers 16-bit 32-bit
31 1615 87 0

AH AL
BH BL
CH CL
DH DL
BP

Sl
Dl
SP
AA EAA

BX EBX
ex ECX
ox EDX
ESI
EDI
EBP
ESP

EU has eight general purpose registers labeled AH, AL, BH, BL, CH,
CL, DH and DL. These registers are a set of data registers, which are
used to hold intermediate results. The H represents the high- order or
most- significant byte and the L represents the low- order or least-
significant byte. Each of these registers may be used separately as 8-
bit storage areas or combined to form one 16-bit (one word) storage
area.
The acceptable register pairs are AH and AL, BH and BL, CH and CL
and DH and DL. The AH-AL pair is referred to as the AX register, the
BH-BL pair is referred to as the BX register, the CH-CL pair is
referred to as the CX register, and the BH-BL pair is referred to as the
DX register.
The AL register is also called as the Accumulator. For 16-bit
operations, AX is called the accumulator.
The 8086 register set is very similar to those of earlier generation
8080 and 8085 microprocessors. Many programs written for the 8080
and 8085 could easily be translated to run on the 8086.

Stack Pointer Register
A Stack, is a section of memory set aside to store addresses and data while a
subprogram is being executed. An entire 64 K bytes segment is set aside as Stack in
8086 MPU. The upper 16 bits of the starting address for this segment is kept in the
stack segment register. The Stack Pointer (SP) register contain the 16-bit offset from
the start of the segment to the memory location where a word was most recently
stored on the Stack. The memory location where a word was most recently stored is
called the top of Stack.
Other pointer and Index Registers
In addition to the Stack Pointer register, SP, the EU contains a 16-bitbase pointer
(BP) register. It also contains a 16-bit Source index (SI) register and a 16-bit
destination index (DI) register. These three registers can be used for temporary
storage of data just as the general purpose registers. However, their main use is to
hold the 16-bit offset of a data word in one of the segments. That is, the pointer and
index registers are usually used to point to or index to an address in memory. When
used in this manner, these registers are address registers that designate a specific
location in the memory that may be frequently used by the program. The addresses
contained in these registers can be combined with information from the BIU to
physically locate the data in the memory.

Bus Interface Unit

The BIU is made up of the address generation and
bus-control unit, the instruction queue, and the
instruction pointer. It has the task of making sure that
the bus is used to its fullest capacity in order to
speedup operations. This function is carried in two
ways. First, by fetching the instructions before they
are needed by the execution unit and storing them in
the instruction queue, the 8086 MPU is able to
increase computing speed. Second, by taking care of
all bus-control functions, the EU is free to
concentrate on processing data and carrying out the
instructions. The instruction pointer contains the
location or address of the next instruction to be
executed.

Inside Bus Interface Unit: Bus Control

The bus-control unit performs the bus operations for
the MPU. It fetches and transmits instructions, data
and control signals between MPU and the other
devices of the system.

Inside Bus Interface Unit: Instruction Queue

The instruction queue is used as a temporary memory storage area for
data instructions that are to be executed by the MPU. The BIU,
through the bus-control unit, prefetches instructions and stores them in
the instruction queue. This allows the execution unit to perform its
calculations at maximum efficiency. Because the BIU and EU
essentially operate independently, the BIU concentrates on loading
instructions into the instruction queue. This usually takes more time to
do than the calculations performed by the execution unit. In effect, the
BIU and the EU work in parallel. The instruction queue is a first- in,
first- out (FIFO) memory. This means that the first instruction loaded
into the instruction queue by the bus control unit will be the first
instruction to be used the ALU.

Inside Bus Interface Unit: Address Control

The address-control unit is used to generate the 20-bit
memory address that gives the physical or actual location of
the data or instruction in memory. This unit consists of the
instruction pointer, the segment registers and the address
generator
Inside Bus Interface Unit: Instruction Pointer

The Instruction Pointer (IP) is a 16- bit register that is used
to point to, or tell the MPU, the instruction to execute next.
Therefore, the instruction pointer is used to control the
sequence in which the program is executed. Each time the
execution unit accepts an instruction, the instruction pointer,
is incremented to point to the next instruction in the
program.
Inside Bus Interface Unit: Segment Registers

There are four segment registers. They are the code segment (CS), the data segment
(DS), the stack segment (SS), and the extra segment (ES). These registers are used
to define a logical memory space or memory segment that is set aside for a
particular function.

The CS register points to the current code segment. Instructions are fetched from
this segment. The DS register points to the current data segment. Program variables
and data are held in this area. The SS register points to the current stack segment,
stack operations are performed on locations in the SS segment. The ES register
points to the current extra segment, which is also used for data storage. Each of the
segment registers can be upto 64 kilo bytes long. Each segment is made up of an
uninterrupted section of memory locations. Each segment can be addressed
separately using the base address that is contained in its segment register. The base
address is the starting address for that segment.
Inside Bus Interface Unit: Address Generator

The address-generator unit is used with the segment
registers to generate the 20- bit physical address required to
identify all the possible memory addresses. The 20 address
lines give a maximum physical memory size of 20 address
locations, or 1,048,576 bytes of memory. But all the
registers in the MPU are only 16 bits wide. The physical
address is obtained by shifting the segment base value four
bit positions (one hexadecimal position) and adding the
offset or logical address of the segment.

8086: Pin Functions

Vcc is on pin 40 and ground on pins
1 and 20. 8086 requires +5v supply. Clock
input labeled CLK is on pin 19. Different
versions of the 8086 have maximum clock
frequencies ranging from MHz to 10 MHz.
Pins 2 through 16 and pins 35
through 39 are used for the address bus. Pins 35
through 38 are used by multiplexing to provide
information or status about the MPU. The
status signals are labeled S3, S4, S5 and S6 as
shown. The data bus lines AD0 through AD15
are used at the start of the machine cycle to
send out addresses, and later in the machine
cycle they are used to send or receive data. The
8086 sends out a signal called address latch
enable or ALE on pin 25 to let external
circuitry know that an address is on the data
bus. The upper 4 bits of an address are sent on
the lines labeled A16/ S 3 through A19/ S 6.

8086: Pin Functions

Some of the control bus lines on a
microprocessor usually have mnemonics such
as RD, WR and M/ IO. Pin 32 of the 8086 is
labeled RD. A tri-state active-low output signal
on pin 32 indicates that the 8086 is reading data
from memory or from a port. Pin 29 has a label
WR next to it. However, pin 29 also has a label
LOCK next to it, because this pin has two
functions. The function of this pin and the
functions of the pins between 24 and 31 depend
on the mode in which the 8086 is operating.

8086: Pin Functions

The operating mode of the 8086 is determined by
the logic level applied to the MN / MX input on
pin 33. If pin 33 is asserted high, then the 8086
will function in minimum mode, and pins 24
through 31 will have functions shown in
parentheses next to the pins. If the MN / MX pin
is asserted low, then the 8086 is in maximum
mode. In this mode pins 24 through 31 will have
the functions described by the mnemonics next to
the pins in fig. 8. A tri-state active-low output
signal on pin 29 indicates that MPU has put valid
and stable data on the data bus. Pin 28 will
function as M / IO. The 8086 will assert this
signal high if it is reading from or writing to a
memory location, and it will assert a signal low if
it is reading from or writing to a port. In the
maximum mode the control bus signals (S0, S1,
S2) are sent out in encoded form on pins 26,27
and 28. An external bus controller device decodes
these signals to produce the control bus signals
required for a system, which has two or more
microprocessors sharing the same buses.

8086: Pin Functions
If pin 21, the RESET input is made high, the 8086
will, no matter what it is doing, reset its DS, SS, ES,
IP and flag registers to all 0's. It will set its CS
register to FF. When the RESET signal is removed
from pin 21, the 8086 will then fetch its next
instruction from physical address (FFFF0H). This
address is produced in the 8086 Bus Interface unit
(BIU) by shifting the FFFFH in the CS register 4 bits
left by adding the0000H in the instruction pointer to
it. The first instruction that has to be executed after a
reset is put at this address FFF0H.

8086 has two interrupt inputs, non-maskable interrupt
(NMI) input on pin 17 and the interrupt (INTR) input
on pin 18. An active-high on any one of these pins
will cause the 8086 to stop execution of its current
program and go execute a specified procedure. At the
end of the procedure it can return to executing the
interrupted program. The NMI cannot be ignored, or
masked, by the MPU. The INTR (interrupt request) is
maskable and can be made to be ignored by the MPU
through software control.

8086: Pin Functions
A tri-state active-low output signal on pin 26
DEN (data enable) determines whether the data
buffer is enabled or disabled. A tri-state output
signal on pin 27 DT / R (data transmit receive)
is used to control the direction of data flow. A
logic level 1 indicates data bits are being
transmitted from the MPU. A logic level 0
indicates that data bits are being received into
the MPU.

All microprocessors use an oscillator to
generate a master frequency clock to
synchronize or time operations. For the 8086
microprocessor the oscillator frequency, or
clock frequency is typically 5 MHz.

8086: Minimum and Maximum Modes

Minimum mode: The 8086 processor works in a single
processor environment. All control signals for memory and I/O
are generated by the microprocessor.
Maximum mode is designed to be used when a coprocessor
exists in the system. 8086 works in a multiprocessor
environment. Control signals for memory and I/O are generated
by an external BUS Controller.
Minimum mode operation is similar to that of the Intel 8085A
microprocessor, while maximum mode operation is new &
specially designed for the operation of the 8087 arithmetic
coprocessor.
8086: Memory Banks
FFFFE
FFfFC

4

2
0

,M BYTES

FFFFF
12K IYT5
512K IYT$
FFFFF
ffFFO

5
..

2 3
1

1

0
o---
rl

o--

(a)

0
7
-0
0

-
IHE
'

Figure (a) Logical memory organization, and (b) Physical memory organization
(high and low memory banks) of the 8086 microprocessor.
8086: Memory Banks

The memory address space of the 8086-based microcomputers has different
logical and physical organizations.

Logically, memory is implemented as a single 1M 8 memory chunk. The byte-
wide storage locations are assigned consecutive addresses over the range from
0000016 through FFFFF16.

Physically, memory is implemented as two independent 512Kbyte banks: the low
(even) bank and the high (odd) bank. Data bytes associated with an even address
(0000016, 0000216, etc.) reside in the low bank, and those with odd addresses
(0000116, 0000316, etc.) reside in the high bank.

Address bits A1 through A19 select the storage location that is to be accessed.
They are applied to both banks in parallel. A0 and bank high enable (BHE) are
used as bank-select signals.

Each of the memory banks provides half of the 8086's 16-bit data bus. The lower
bank transfers bytes of data over data lines D0 through D7, while data transfers
for a high bank use D8 through D15.

(a) Even-address byte transfer
by 8086.
(b) Odd-address byte transfer by
8086.
(c) Even-address word transfer
by 8086.
(d) Odd-address word transfer
by 8086.
(a) Even-address byte transfer by 8086.

X+1

D,s D, IHf (HIGH)

(a)
AoCLOW)
(b) Odd-address byte transfer by 8086.

v

IX +tt

o,,..,_ ....CLOW)

(b)
(c) Even-address word transfer by 8086.

A tt At

(c)
(d) Odd-address word transfer by 8086.

PIMI IU8 CYCLI

ltlltLOW)

coo usCYCLl

x+J
)C + 1

IHII CMtOHt

(d
Ae(L.OWJ
80286 Architecture

Intel 80286 architecture

'
-
A
-
dd
-
r
-
es
-
s
-
Un
-
i
-
t (
-
A
-
U
-
)
-----------------------------------

r-----------------------------

Address la.1x:hes
and Drivers
A23 _ AO
BHEN, MII ON

Prefotcher
Processor 1-....L+- PEACKII
E)(tension
fnterface

Bus Con..ol

Data Tranceiwrs
PEREQ

READYII, H OLD

S II, SOli, CO 011 NTAll
LOCKN, HLDA

i- ...
I
8 By!e
PrefelCh
Queue

Bus Unti (BU)o

.........................................

Con..ol

- -- ------ - - - -- -, .., RESET
I
. --CLK
3 Deeodd
lnsv-uction

lns..uetion
Decoder

1---'Iss

1
Execution Unti (EU)
Queu . --- Vee

'------L----"i,,. ..u1etlc1n Unit (IU) .. ,_ _.CAP
'---------------------------

-------------------

---- f
- -
rr-
I < 3-0ecoded
11
Stream

,
80386 Architecture

,a..
nput
'

Paglng Unll
Bus Concrol

Adder
Roqt.lt$1

ERROR, BUSY,
'
Adder
'r rr--
Priorilin-r
RESET,HLDA
32

..

.a..
OlsctfptOf Page

Rcglstett c.....

Protection

Umlt and

w Control and

Test Unit
-;1

Attribute Atttlbvlt
PLA PLA
-

i'"
!
I

I

lJ,
Addi'0$1
---,1
Orivor

BEO. BEll,
A2 -A31

Brr
,
e
..
l
,
S
.
h
.
i
.
fter,
r

r-v

....

I
-

Pipeline /
Bussae
C':f'lntmJ

Multiplox.tr I
M/10.,D/Cf ,
W/Rfl, LOCKf ,
ADSil, NA#,
8516#, READYI

DO - 031

Status
Flags

Deeodoon<l
Sequencing
_j
w l,t

1 8
Transcelve-ra

Muttiply I
lnstrue lon Prefetcher I
O.Codor Umit ChOcke'
Divide

Code
Regls.ter File
'--- Control
lnstruoc:Uoo
.-
16Byte Code

ALU Control
ROM
I '
Queue 'r-
Queue

i'"ALU Instruction
32
tns-tfUC.tlon Pl"efCiteh

Prtdoeodt
32

PLA: Progr. mmabJe Logic Array
..
.
:5
f
..
1
80486 Architecture

Intel 80486DX2 Architecture 64 Bit lnterunit Transf er Bus
Core
Clock
Clock Clock

I.

...
PCO
"'
..

Muk
iplier
I

". 2

r
A3lA2

Barrel
r"

Cache Address
.;;.-aeoN

Shifter
Unit Unit
...

Drivers

.
P"!!ing
l"
T

Register

Descriptor Unit
Physical
Write Buffers

Fiel -. Registers
Address

8 KByte

fT 4x32
Umt and 1rans1auon Cache
DataBus 031 DO

ALU
A e
Lookaside
Transeivers

Buf fer

Bus C0111Tol
AD'SN, WIRN, 0/CII, MilO H ,
H,

PCD,P,RDYH, LOCK
l a: CXII, 80FFII, A20M
#,

HOLD,HLDA, RES ET,

t t'
Displacement Bus
SRESET,INTR,NMI,SMIII,
...

Micro nstruction
Prefetcher
Request

Sequenzer
SMIACTII,FERRI#, IGNN
STPCLKII
Ell,
32Byte Code
Burst Bus
BROYII,8LA5TH

Queue

FPU

Control &

Protection
llnstruction ......

Control

Tes1 Unil

.....

Decode IC de Stream
(2 x 16 Byte)

Bus Size

851611, 85811

Decoded
Control
Floaring InstructionP.-h
Point Control Cache
KEN#,

Register ROM Control
Fie
AHOLD,
Parity Gener.-ion
OP3 DPO, PCHKII

and Control

<>UUI>UOIY
Scan
Control

TCX, TMS, TDI,TOO
EAX

EBX
ECX
EDX

ESI

EDI
ESP
ESP

cs

OS

ss

ES

FS

GS

MMO
MM1
MM2
MM3
MM4

MM5
MM6
MM7

XMMO
XMM1
XMM2
XMM3
XMM4
XMM5
XMM6
XMM7

AL

BL

EIIX

EBX
Cl
DL
ECX
EDX

t.S
EDI
EBP
ESP

GeneralPurpose- Registers
80x86 Register Set

3' 24 Z3 1S I 5o E- 7
(]

AH .IIJ(

BH 3-)(
CH

Genera-lPurpose
Regislers (32 bs)
Segment
Floating-Poni t
Regislers

1------- -.

DH
Sl

01
;tP

SP
(16-bs) Regislers (80-b s)

FP contro,l
status, and
lag registers
(16-bfts)

Segment Reglaters
15 a
------1cs ment
SS Slack Susimofll

-------iDS

1-------jES
FS

1---------ioo

Instruction Pointer
31 16 15
I)

Instruction Pointer (32-bfts)
Flags Register (32-s)

MMX Registers
XMMRegisters

(64-bs)
(128 bs)

MX C()ntrol
and slatus
Regisler
(32-btts)
I MXCSR I

-------- ----
IP
----
IEIP

FilipRegister
[
I FLAGS
EFLAG

31 22 21 20 1918 17 16 15 14 13 12 11 1 i0 9 8 i 6 5 4 3 2 1 0

1

C

F

ReseiVed (Set tO 0)-

Dl
I
CA VM. R F 0 N. T n
p F
P
l

0

F

D

F

I

F

T

F

S

F

Z

F

O

A

F

O

F
p

I: D -Identification Flag
VIIP -Virt.ual ln1errupt Pending
VlrF -Virtual lnterru:pt Flag .
AC -Alignment Check _..
VM -Virtual-8086 Mode ........
RF -Hesum.e Flag .
NT -Nested Task Flag .
IIOPL---- UO Privilege Level
IIF ----- Interrupt Enable Flag--------------

TF -Trap Flag _.
DReseNed
EFLAG

CF Carry Flag CF is set if the operation resulted in a carryout of MSB, otherwise CF=0.
PF (bit 2) Parity flag Set if the least-significant byte of the result contains an even number
of 1 bits; cleared otherwise.
AF (bit 4) Adjust flag Set if an arithmetic operation generates a carry or a borrow out of
bit 3 of the result; cleared otherwise. This flag is used in binary-coded decimal (BCD)
arithmetic.
ZF (bit 6) Zero flag Set if the result is zero; cleared otherwise.
SF (bit 7) Sign flag Set equal to the most-significant bit of the result, which is the sign bit
of a signed integer. (0 indicates a positive value and 1 indicates a negative value.)
OF (bit 11) Overflow flag Set if the integer result is too large a positive number or too
small a negative number (excluding the sign-bit) to fit in the destination operand; cleared
otherwise. This flag indicates an overflow condition for signed-integer (twos complement)
arithmetic.
Of these status flags, only the CF flag can be modified directly, using the STC, CLC, and
CMC instructions. Also the bit instructions (BT, BTS, BTR, and BTC) copy a specified bit
into the CF flag.
DF Flag - The direction flag (DF, located in bit 10 of the EFLAGS register) controls string
instructions (MOVS, CMPS, SCAS, LODS, and STOS). Setting the DF flag causes the string
instructions to auto-decrement (to process strings from high addresses to low addresses).
Clearing the DF flag causes the string instructions to auto-increment (process strings from
low addresses to high addresses).
System Flags and IOPL Field The system flags and IOPL field in the EFLAGS register
control operating-system or executive operations. They should not be modified by application
programs. The functions of the system flags are as follows:
EFLAG

TF (bit 8) Trap flag Set to enable single-step mode for debugging; clear to disable single-step mode.
IF (bit 9) Interrupt enable flag Controls the response of the processor to maskable interrupt requests.
Set to respond to maskable interrupts; cleared to inhibit maskable interrupts.
IOPL (bits 12 and 13) - I/O privilege level field Indicates the I/O privilege level of the currently
running program or task. The current privilege level (CPL) of the currently running program or task must
be less than or equal to the I/O privilege level to access the I/O address space. This field can only be
modified by the POPF and IRET instructions when operating at a CPL of 0.
NT (bit 14) Nested task flag Controls the chaining of interrupted and called tasks. Set when the
current task is linked to the previously executed task; cleared when the current task is not linked to
another task.
RF (bit 16) Resume flag Controls the processors response to debug exceptions.
VM (bit 17) Virtual-8086 mode flag Set to enable virtual-8086 mode; clear to return to protected
mode without virtual-8086 mode semantics.
AC (bit 18) Alignment check flag Set this flag and the AM bit in the CR0 register to enable alignment
checking of memory references; clear the AC flag and/or the AM bit to disable alignment checking.
VIF (bit 19) Virtual interrupt flag Virtual image of the IF flag. Used in conjunction with the VIP flag.
(To use this flag and the VIP flag the virtual mode extensions are enabled by setting the VME flag in
control register CR4.)
VIP (bit 20) Virtual interrupt pending flag Set to indicate that an interrupt is pending; clear when no
interrupt is pending. (Software sets and clears this flag; the processor only reads it.) Used in conjunction
with the VIF flag.
ID (bit 21) Identification flag The ability of a program to set or clear this flag indicates support for the
CPUID instruction.
x86 Register Set

Base Architecture Registers (or Application Register Set)
General Purpose Registers
Instruction Pointer
Flag Registers
Segment Registers
System Registers
Memory Management Registers
Control Registers
Floating Point Registers
Data Registers
Tag word
Status word
Control word
Instruction and data pointers
Debug Registers

System Registers

GDTR (Global Descriptor Table Register)
IDTR (Interrupt Descriptor Table Register)
TR (Task Register)
LDTR (Local Descriptor Table Register)
Control Registers
CR0
CR1
CR2
CR3
CR4

These registers specify the locations of the data structures which control segmented memory
management. The GDTR and IDTR can be loaded with instructions which get a 6-byte data
item from memory. The LDTR and TR can be loaded with instruction which take a 16-bit
segment selector as an operand. The remaining bytes of these registers are then loaded
automatically by the processor from the descriptor referenced by the operand. (More details
will be given later.)
Global Descriptor Table Register: GDTR

The contents of the global table register define a table in the processor's physical memory
address space called the Global Descriptor Table (GDT). This global descriptor table is one
important element of the processor's memory management system.

GDTR is a 48-bit register that is located inside the processor. The lower two bytes of this
register, which are identified as LIMIT, specify the size in byte of the GDT. The decimal
value of LIMIT is one less than the actual size of the table. For instance, if LIMIT equals
00FFh the table is 256 bytes in length. Since LIMIT has 16 bits, the GDT can be up to
65,536 bytes long. The upper four bytes of the GDTR, which are labeled BASE, locate the
beginning of the GDT in physical memory. This 32-bit base address allows the table to be
positioned anywhere in the processor's address space.

The GDT provides a mechanism for defining the characteristics of the processor's global
memory address space. Global memory is a general system resource that is shared by many
or all software tasks. That is, storage locations in global memory are accessible by any task
that runs on the microprocessor. This table contains what are called system segment
descriptors. It is these descriptors that identify the characteristics of the segments of global
memory. For instance, a segment descriptor provides information about the size, starting
point, and access rights of a global memory segment. Each descriptor is eight bytes long,
thus our earlier example of a 256-byte table provides enough storage space for just 32
descriptors. Remember that the size of the global descriptor table can be expanded simply
by changing the value of LIMIT in the GDTR under software control. If the table is
increased to its maximum size of 65,563 bytes, it can hold up to 8,192 descriptors.

Interrupt Descriptor Table Register: IDTR

Just like the global descriptor table register, the interrupt descriptor table
register (IDTR) defines a table in physical memory. However, this table
contains what are called interrupt descriptors, not segment descriptors. For this
reason it is known as the Interrupt Descriptor Table (IDT).This register and
table of descriptors provide the mechanism by which the microprocessor passes
program control to interrupt and exception routines.

Just like the GDTR, the IDTR is 48 bits in length. Again, the lower two bytes of
the register (LIMIT) define the table size. That is, the size of the table equals
LIMIT+1 bytes. Since two bytes define the size, the IDT can also be up to
65,536 bytes long. But the processor only supports up to 256 interrupts and
exceptions; therefore, the size of the IDT should not be set to support more than
256 interrupts. The upper three bytes of IDTR (BASE) identify the starting
address of the IDT in physical memory. The type of descriptor used in the IDT
are what are called interrupt gates. These gates provide a means for passing
program control to the beginning of an interrupt service routine. Each gate is
eight bytes long and contains both attributes and a starting address for the
service routine.
Local Descriptor Table Register : LDTR

The Local Descriptor Table Register (LDTR) is also part of the processor's memory
management support mechanism. Each task can have access to its own private table
descriptor table in addition to the global descriptor table.

This private table is called the local descriptor table (LDT) and defines a local
memory address space for use by the task. The LDT holds segment descriptors that
provide access space for use by the task. The LDT holds segment descriptors that
provide access to code and data in segments of memory that are reserved for the
current task. Since each task can have its own segment of local memory, the
protected-mode software system may contain local descriptor tables. Whenever a
selector is loaded into the LDTR, the corresponding descriptor is transparently read
from global memory and loaded into the local descriptor table cache within the
processor. It is this descriptor that defines the local descriptor table.

Assume that every time a selector is loaded into the LDTR, a local descriptor table
descriptor is cached and a new LDT is activated.

Task Register: TR

The task register is one of the key elements in the protected mode task
switching mechanism of the processor microprocessor. This register holds a
16-bit index value called a selector. The initial selector must be loaded into
TR under software control. This starts the initial task. After this is done, the
selector is changed automatically whenever the processor executes an
instruction that performs a task switching.

TR is used to locate a descriptor in the global descriptor table. Notice that
when a selector is loaded into TR, the corresponding task state segment
(TSS) descriptor automatically gets read from memory and loaded into on-
chip task descriptor cache. This descriptor defines a block of memory
called the task called the task state segment (TSS). It does this by providing
the starting address base (BASE) and the size (LIMIT) of the segment.
Every task has it own TSS. The TSS holds the information needed to
initiate the task, such as initial values for the user-accessible registers.
Control Registers
E

31(63)
14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Reserved (set to 0)
s v
p p
M
p p
T
p
v

M M
0 0
c G c A s
D
s v M
X X
E E E E E
E
D I E

E

CR4
I

OSXMMEXCPT
OSFXSR

31(63) 12 11 5 4 3 2

. Page-Di rectory Base . . . .

Ll
-------------------- --- --- owr l c(PRo3sR)

31(63)
0

Page-Faul t Li near Add ress CR2

31(63)
0

CR1

31 30 :2l8 11111ti 1b o o43:l 1U

DReserved
Control Registers

Notice that the lower five bits of CR0 are system control flags. These bits make up what
are known as the machine status word (MSW). The most significant bit of CR0 and registers
CR2 and CR3 are used by the processor's paging mechanism. CR0 contains information about
the processor's protected-mode configuration and status.
The protected-mode enable/protection enable (PE) bit determines if the processor is in
the real or protected mode. At reset, PE is cleared. This enables the real mode of operation. To
enter the protected mode, we simply switch PE to 1 through software. Once in the protected
mode, the processor can be switched back to real mode under software control by clearing the PE
bit. It can also be returned to real mode by hardware reset.
The math present/Monitor coProcessor (MP) bit is set to 1 to indicate that a numeric
coprocessor is present in the microcomputer system. On the other hand, if the system is to be
configured so that a software emulator is used to perform numeric operations instead of a
coprocessor, the emulate (EM) bit is set to 1. Only one of these two bits can be set at a time.
Finally, the extension type (ET) is used to indicate whether an 80387DX or 80287
numeric coprocessor is in use. Logic 1 indicates that an 80387DX is installed.
The last bit in the MSW, task switched (TS), automatically gets set whenever the
processor switched from one task to another. It can be cleared under software control.
The protected mode software architecture of the processor also supports paged memory
operation. Paging is turned on by switching the PG bit in CR0 to logic 1. Now addressing of
physical memory is implemented with an address translation mechanism that consists of a page
directory and page table that are both held in the physical memory. This register holds a 20-bit
page directory base address that points to the beginning of the page directory. A page fault error
occurs during the page translation process if the page is not present in memory. In this case, the
processor saves the address at which the page fault occurred in register CR2. This address is
denoted as page fault linear address.
Control Registers
CR0

PG - Paging If 1, enable paging and use the CR3 register, else disable paging
CD - Cache disable Globally enables/disable the memory cache
NW- Not-write through Globally enables/disable write-back caching
AM -Alignment mask Alignment check enabled if AM set, AC flag set (in EFLAGS register, and
privilege level is 3)
WP - Write protect Determines whether the CPU can write to pages marked read-only
NE - Numeric error Enable internal x87 floating point error reporting when set, else enables PC
style x87 error detection
ET - Extension type On the 386, it allowed to specify whether the external math coprocessor
was an 80287 or 80387
TS - Task switched Allows saving x87 task context only after x87 instruction used after task
switch
EM - Emulation If set, no x87 floating point unit present, if clear, x87 FPU present
MP - Monitor co-processor Controls interaction of WAIT/FWAIT instructions with TS flag in CR0
PE - Protected Mode Enable If 1, system is in protected mode, else system is in real mode

CR1
Reserved

CR2
Contains a value called Page Fault Linear Address (PFLA). When a page fault occurs, the address the
program attempted to access is stored in the CR2 register.
Control Registers
CR4

Used in protected mode to control operations such as virtual-8086 support, enabling I/O breakpoints,
page size extension and machine check exceptions.
SMXE SMX Enable

VMXE VMX Enable
OSXMMEXCPT Operating System Support for Unmasked SIMD Floating-Point Exceptions If set,

enables unmasked SSE exceptions.

OSFXSR Operating system support for FXSAVE and FXSTOR instructions If set, enables SSE
instructions and fast FPU save & restore
PCE Performance-Monitoring Counter enable If set, RDPMC can be executed at any
privilege level, else RDPMC can only be used in ring 0.
PGE Page Global Enabled If set, address translations (PDE or PTE records) may be shared
between address spaces.
MCE Machine Check Exception If set, enables machine check interrupts to occur.
PAE Physical Address Extension If set, changes page table layout to translate 32-bit
virtual addresses into extended 36-bit physical addresses.
PSE Page Size Extensions If unset, page size is 4 KB, else page size is increased to 4 MB
(or with PAE set, 2 MB).
DE Debugging Extensions
TSD Time Stamp Disable If set, RDTSC instruction can only be executed when in ring 0,
otherwise RDTSC can be used at any privilege level.
PVI Protected-mode Virtual Interrupts If set, enables support for the virtual
interrupt flag (VIF) in protected mode.
VME Virtual 8086 Mode Extensions If set, enables support for the virtual interrupt flag
(VIF) in virtual-8086 mode.
Floating Point Unit (FPU) Registers
Lll r
D I

0
! I I
Stg
1
rf
11
-
\i!
,
Q
nd

,-

2

I I
0 0


The x87 Floating-Point Unit (FPU) provides high-performance
floating-point processing capabilities for use in graphics
processing, scientific, engineering, and business applications. It
supports the floating-point, integer, and packed BCD integer data
types and the floating-point processing algorithms and exception
handling architecture.

x87 FPU Data Registers:
The x87 FPU data registers consist of eight 80-bit registers. Values
are stored in these registers in the double extended-precision
floating-point format. When floating-point, integer, or packed BCD
integer values are loaded from memory into any of the x87 FPU
data registers, the values are automatically converted into double
extended-precision floating-point format (if they are not already in
that format).

The x87 FPU instructions treat the eight x87 FPU data registers as
a register stack. All addressing of the data registers is relative to
the register on the top of the stack. The register number of the
current top-of-stack register is stored in the TOP (stack TOP) field
in the x87 FPU status word. Load operations decrement TOP by
one and load a value into the new top-of-stack register, and store
operations store the value from the current TOP register in memory
and then increment TOP by one. (For the x87 FPU, a load
operation is equivalent to a push and a store operation is equivalent
to a pop.) Note that load and store operations are also available that
do not push and pop the stack.
x87 FPU Status Word
'

FPU B
UI SY

Top of .Stack Pointer

15 14 131 111 11 11 9 8 7 6
.
4 3 1 1
B C 3 TOP C2C1 co ESSF PEUEOE E DE IE

ConditJion
Co de
Error SUJmmary StatUI s
Stack FaUJ It
ExceptJion Rlags
Preci sI O fl
Underflow
orerflo w
Zero 0iv1de
Denorr11al Operand
Invalid Operand
x87 FPU Control Word

Infin1ty Contro I
Rounding Control
Precision C ontrol

1,
14 13 12 11 10 9 8' 7 6 5 4 3 1 1 0
I f I
X RC PC
II
I f -,PM
UM OMZM O M IM

Exc eption Masks
Precision
Underflow
Overflow
Zero Divide
D enormal Operand

.

Invalid Operand

Modes of operation: Real/Protected

The 80286 incorporated two modes: A backward
compatible 8086 operating mode called Real-Mode
and a secondary advanced mode called Protected-
Mode.
Protected-mode allowed the 80286 to exploit its 24-
bit address bus and thus access up to 16MB of
physical memory. Unfortunately, DOS applications
could not be easily ported to protected-mode since it
was incompatible with the 8086 implementation for
which DOS had been developed. This fact among
others made protected-mode unattractive to software
developers of the time.

Real Mode of operation

Real mode, also called real address mode, is an
operating mode of 80286 and later x86-compatible
CPUs. Real mode is characterized by a 20 bit
segmented memory address space (giving just over 1
MB of addressable memory) and unlimited direct
software access to all memory and I/O addresses and
peripheral hardware. Real mode provides NO support
for memory protection, multitasking, or code
privilege levels. 80186 CPUs and earlier, back to the
original 8086, have only one operational mode, which
is equivalent to real mode in later chips. All x86
CPUs in the 80286 series and later start in real mode
when reset.
Protected Mode of operation
In computing, protected mode, also called protected virtual address mode, is
an operational mode of x86-compatible central processing units (CPU). It
allows system software to utilize features such as virtual memory, paging,
safe multi-tasking, and other features designed to increase an operating
system's control over application software.
When a processor that supports x86 protected mode is powered on, it begins
executing instructions in real mode, in order to maintain backwards
compatibility with earlier x86 processors. Protected mode may only be
entered after the system software sets up several descriptor tables and
enables the Protection Enable (PE) bit in the Control Register 0 (CR0).
Protected mode was first added to the x86 architecture in 1982, with the
release of Intel's 80286 (286) processor, and later extended with the release
of the 80386 (386) in 1985. Due to the enhancements added by protected
mode, it has become widely adopted and has become the foundation for all
subsequent enhancements to the x86 architecture.
Protected Mode
When the processor is running in protected-mode, two mechanisms
are involved in the memory translation process: Segmentation and
Paging. Although working in tandem, these two mechanisms are
completely independent of each other. In fact, the paging unit can be
disabled by clearing a single bit in an internal processor register. In
this case, the linear addresses which are generated by the
segmentation unit pass transparently through the paging unit and
straight to the processor address bus.

Figure - Protected-mode address translation process
Segmentation

The memory is subdivided into parts called segments.
A segment can be from 1B to 4GB long.
Segment can start from any base address in memory.
Overlapping between segments is allowed.
Segmentation: Real Mode

In real mode, the 16-bit segment selector was interpreted as the most
significant 16 bits of a linear 20-bit address, with the remaining four bits
being all zeros. The segment selector is always added with a 16-bit offset to
yield a linear address. For instance, the segmented address 6EFh:1234h has a
segment selector of 6EFh, which corresponds to the 20-bit linear address
6EF0h. To this we add the offset, yielding the linear address 6EF0h + 1234h
= 8124h (cf. hexadecimal).

A single linear address can be mapped to many segmented addresses. For
instance, the linear address above (8124h) can have the segmented addresses
6EFh:1234h, 812h:4h and 0h:8124h (and many more). This could be
confusing to programmers accustomed to unique addressing schemes.

The effective 20-bit address space of real mode limited the addressable
memory to 2
20
bytes, or 1,048,576 bytes.
Segmentation: Protected Mode

When 32-bit x86 processor is reset or powered up, it is initialized in real
mode (same as 8086) but allows access to 32-bit register set.
The default operand size in real mode is 16-bit.
However, the regular mode of operation of a 32-bit x86 architecture
processor is in protected virtual address mode or protected mode.
In protected mode, the 16-bit selector is used to specify an index in an OS
defined table. The table contains 32-bit base address of a given segment. The
physical address is formed by adding this base address to the offset. (See
figure on the next slide).
Segmentation: Protected Mode

Segment 1

Segment 2

-

Selector Offset
-I
I
I

Desc
'
riptor

Table

Linear
Address
Space
Base
4GB

Limit

Base
-

Limit
--
Base

Limit

+ +
+ +

Base
Limit
0

Paging

x86 architecture support paging in protected mode. It provides efficient mechanism for handling
virtual memory. Paging mechanism is optional. (Can be enabled or disabled by PG bit in CR0)

root page
table
user page
table
4Kb page

..
.
entry

..
.
data

... entry

..-

31 22 21 12 11 0

root page user page
page offset
table offset table offset

x86 32-bit linear virtual address for 4Kb pages
The address at the output of segmentation mechanism is called the linear address
which is input to paging mechanism. The output from the paging mechanism is the
physical address. When paging is disabled, linear address is same as the physical
address. When both segmentation and paging are disabled, the virtual and the
physical addresses are identical.
Segmentation cab be disabled by choosing the segment size equal to the size of
whole physical memory, up to 4GB.

Each segment has segment descriptor (8B) associated with it. It contains (a) 32-bit
segment base linear address (b) 20-bit segment limit (c) Access rights byte (d)
Control bits.
Segment limit field (20 bit): The segment size does not have a byte granularity for
all segment sizes
Segments have a byte granularity (i.e. segments may differ in size by a single byte)
for segments sizes up to 1 MB (2
20
).
For segments above 1MB and up to 4GB, there is a page granularity (i.e. segment
sizes may differ by a page size, which is 4KB)

Base 31;24

G

D
I
B

0
,
.
,

L

Se .

Limit
191 6

p

D
p
L

s

Type

Base 2316

Segment Descriptor

31 24 23 22 21 20 19 16 15 14 13 12 11 8 7 0

A
4

16 15 0

Base Address 1 5:00 Segment Limit 15:00
0

AVL - Ava'il able lor use by system soft,Nare
BASE - Segment base address
DIB -Default operatii on si' ze (0 = 16-bh seg1rnent; 1 = 32.-bn segnent)
DPl -Descriptor pr vUege llevel
G -Granularity
Ll iUT - Segment L mu
P - Segment present
S - oescr ptor type (0 = system i = code or data)
TYPE - Se grnent type
Segment Descriptors

The segment base and segment limit fields are not contiguous but distributed
in number of subfields.
Segment descriptors are stored in descriptor tables in memory. The
descriptor table defines all the segments which are used in the system.
As seen earlier, there are three types of descriptor tables.
(1) GDT (2) LDT (3) IDT
GDT contains descriptors that are possibly available to all the tasks in the
system.
LDT contains descriptors associated with a given task. Each task may have a
separate LDT. A segment can not be accessed by a task if its segment
descriptor does not exist in either the LDT or the GDT.
IDT contains descriptors that point to the location of up to 256 ISRs. IDT is
basically the interrupt vector table. Each interrupt vector is a descriptor.
These descriptor tables are variable-length memory arrays (8B to 64KB).
Protection Mechanism

Protection mechanism was started with the introduction of protected mode
operation on the 16-bit i286 and subsequently expanded for a 32-bit systems
on i386, i486 and Pentium.
The x86 architecture has four levels of protection called privilege levels
(PL). They support multitasking OS to isolate and protect user programs
from each other and the OS from unauthorized access. The control the use of
privileged instructions, I/O instructions, and access to segments and segment
descriptors.
The x86 architecture offers an additional type of protection on page basis,
when paging is enabled.
Applications
1

CPU
enforced

+---------------------------------------+

+---------------------------------+
OS extensions

+-------------------------+
System services

+------------------+
Kernal
software
interfaces
-----------x--x---x

I

PL= 0
(Most privileged)

High-speed
operating
system
interface
0----------
+
o
-------
P
-
L
--
=
-
1
-------+

+- --------------------+
PL= 2

+--------- -------------------------+

0 PL= 3

---------------------------------------+


The PLs are numbered 0, 1, 2, 3. Level 0 is the most privileged level. Level
3 is the least privileged level and used for regular user applications. Level 2
is used for OS extensions, level 1 for system services and level 0 is used for
the kernel OS.
The x86 architecture controls access to both data and control between levels
of a task, according to following rules.
Data stored in a segment with PL=p can be accessed only by code
executing at a PL at least as privileged as p.
A code segment (a procedure) with PL=p can be called only by a task
executing at the same or a lower PL than p.
Keep in mind that the CPU privilege level has nothing to do with operating
system users. Whether youre root, Administrator, guest, or a regular user, it
does not matter. All user code runs in ring 3 and all kernel code runs in ring
0, regardless of the OS user on whose behalf the code operates.
Task Management

The x86 architecture handles tasks in a multitasking environment.
A task can be defined as an instance of the execution of a program.
Rapid switching between task is a very important attribute of multitasking.
The x86 supports the task switching operation in hardware.
The task switch operation
Saves the entire state of the machine
Loads a new execution state
Performs detection checks
Begins execution of the new tasks
The task switch operation is invoked by executing an intersegment JMP and
CALL instruction, which refers to task state segment (TSS) or a task gate
descriptor in the GDT or LDT.
An INT instruction, exception, trap or external interrupt may also invoke the
task switch operation.
StGrt
'
OxlOOO

ll., Si,;e
'
OxlOOO

Protection rings ensure ousr
rings can not see inner rings

OODE
DATA
STACK

gisters,etc

Process
h
._

Ring
'
0

Type
'
CODE

Type : GA.TE

StGrt : Ox?. 000
: OxlOOO

' 3
: CODE

StGrt : Ox'3 000
Si,;e : Oxl 000
Ring : '3
Type : DATA

StGrt : Ox4 000
Si,;e : Oxl 000
Ring : '3
Type : STAC'I<

StGrt : OxS 000
: OxlOOO

' 3
: TSS

"fJl'' <.&II invoa <.&II g e
..tich redrec.ts b .another '11egms1t

'-t-1 Backing sto"' foo poocoss
1
- state on context switch

TEC'T
:'ODE

AL

-f-

I I
I I
I I
I I
I I
I I
I I
-----------------

GlobalDescriptor Table
Call gate
Since the processor knows what segments of memory the currently running process can access, it can enforce
protection and ensure the process doesn't touch anything it is not supposed to. If it does go out of bounds, you receive a
segmentation fault, which most programmers are familiar with.

The interesting bit comes when you want to make calls into code that resides in another segment. To implement
a secure system, we can give segments a certain permission value. x86 does this with rings, where ring 0 is the highest
permission, ring 3 is the lowest, and inner rings can access outer rings but not vice-versa.
Like any good nightclub, once you're inside "club ring 0" you can do anything you want.

Consequently there's a bouncer on the door, in the form of a call gate. Call gates are intended to allow less
privileged code to call code with a higher privilege level. When ring 3 code wants to jump into ring 0 code, you have to go
through the call gate. If you're on the door list, the processor gets bounced to a certain offset of code within the ring 0
segment. This allows a whole hierarchy of segments and permissions between them. Call gates use a special selector value to
reference a descriptor accessed via the Global Descriptor Table or the Local Descriptor Table, which contains the information
needed for the call across privilege boundaries. This is similar to the mechanism used for interrupt gates.

The problem with this scheme is that it is slow. It takes a lot of effort to do all this checking, and many registers
need to be saved to get into the new code. And on the way back out, it all needs to be restored again.

How to use call gate:
Assuming a call gate has been set up already by the OS kernel, code simply does a CALL FAR with the
necessary segment selector. The processor will perform a number of checks to make sure the entry is valid and the code was
operating at sufficient privilege to use the gate. Assuming all checks pass, a new CS/EIP is loaded from the segment
descriptor, and continuation information is pushed onto the stack of the new privilege level (old SS, old ESP, old CS, old EIP
in that order). Parameters may also be copied from the old stack to the new stack if needed. The number of parameters to
copy is located in the call gate descriptor.

The kernel may return to the user space program by using a RET FAR instruction which pops the continuation
information off the stack and returns to the outer privilege level.
TSS & TR
Each task has a Task State Segment (TSS) associated with it.
The TSS is a special structure on x86-based computers which holds information about a task.
It is used by the operating system kernel for task management. Specifically, the following
information is stored in the TSS:
Processor register state
I/O Port permissions
Inner level stack pointers
Previous TSS link
The TSS may reside anywhere in memory. The current TSS is identified by a special CPU
register called the TSS register (TR). This special segment Task Register (TR) holds a
segment selector that points to a valid TSS segment descriptor which resides in the GDT (a
TSS descriptor may not reside in the LDT).
TR may be loaded through the Load Task Register (LTR ) instruction. LTR is a privileged
instruction and acts in a manner similar to other segment register loads. The task register has
two parts: a portion visible and accessible by the programmer and an invisible one that is
automatically loaded from the TSS descriptor
The TSS may contain saved values of all the x86 registers. This is used for task switching.
The operating system may load the TSS with the values of the registers that the new task
needs and after executing a hardware task switch (such as with an IRET instruction) the x86
CPU will load the saved values from the TSS into the appropriate registers. Note that some
modern operating systems such as Windows and Linux do not use these fields in the TSS as
they implement software task switching.
\J
..
----
U..tl

1
Jl 16
's
TSSBASE
0000000000000000 BACK UN
ESPO
0000000000000000

0000000000000000
sso
e

ESPI
c

I
SS1
10

ESP2
14

STACKS
CPL 0.1. 2
0000000000000000 SSl
1 &
CRl
1C
E1P
20

EfLA<;S
24

[AX
28

ECX
zc
EOX
JO

EBX
34

ESP
J8

EBP
JC
ESI
40
[01

l SK
STATE
0000000000000000 ES
4&

0000000000000000 cs
c

0000000000000000 ss
0

0000000000000000 DS
54

0000000000000000 rs
sa

0000000000000000 GS
5C

0000000000000000 LOT
60

BIT_MAP_OFFSET AVAILABL[

IN lnto14&1
111
CPU TSS

I)[BIJG

'
"
24 2J
"
IS a 7 0

IS

17 ao 78 72 71 ..

'
I TSS
+
I-

SJ 56 55 48 47 40 J8 l2 BIT_ WAP_ orrsrT

OFHE
'
------------- ts
RIGHTS liNIT

OHSET + C
l' .. 10
'
...

...

' B SE ;.
1/0 PI:RWISSIOH BITWAP r

'
'
15407
(ONE BIT PEABffi 1/0 OHST + lf[C
'
----
I
.
N
.
VI
-
SIB
-
L
-
E
-.. --
'
15439

PORT. BITWAP YAY 8(
TRUNCATED USINC TSS LIWIT.)
OI'FSEr trro
TASK REGISTER
ISH!
_j orrsr1' + 1H4

TR SElECTOR
S5SOJ
I
15HZ
OffSET trra
ISSlS

I uso

OHSEr + trrc

15 0
.,,...

OI'FSET 2000
_j TSS LIWIT orrsn zoooH
lt lntoi41S
111
CPIJ TSSDSCRtPlOII(IN GOT)
0
SEGIIENT BASE I 5...0 SECWEHT liiiiT I 5..0
BASE lt ..uJc
+H . PII TY.PE, I
IIASI

T,p& 9: A\lllable lnt.w88Tiot CPU TSS.
T,p& 8: Buoy ln181486TN CPU TSS

Figure 4.168. lntei486'N Ml<:roproceaaor TSS and TSS Regletera

2 -19
Memory Management Unit (MMU)

First of all, it needs to introduce basics of physical and virtual memory
addressing.
Physical memory space refers to actual size of operating memory
installed plus PCI address range
Virtual is some imaginary space available to software tasks.
In a matter of fact, virtual memory space is larger than or equal to
physical.
Every task running is allocated with some virtual memory which is
mapped onto physical memory in some way, so that several virtual
addresses may refer to the same physical address. Both virtual and
physical memory spaces use pages for addressing needs.
Processor functional units operate with virtual addresses, but cache
and operating memory controllers have to deal with physical
addresses.

A memory management unit (MMU) is a computer hardware component
responsible for handling accesses to memory requested by the CPU. Its
functions include translation of virtual addresses to physical addresses (i.e.,
virtual memory management), memory protection, cache control, bus
arbitration, and, in simpler computer architectures (especially 8-bit systems),
bank switching.
Modern MMUs typically divide the virtual address space (the range of
addresses used by the processor) into pages, each having a size which is a
power of 2, usually a few kilobytes, but they may be much larger. The
bottom n bits of the address (the offset within a page) are left unchanged.
The upper address bits are the (virtual) page number. The MMU normally
translates virtual page numbers to physical page numbers via an associative
cache called a Translation Look-aside Buffer (TLB). When the TLB lacks a
translation, a slower mechanism involving hardware-specific data structures
or software assistance is used. The data found in such data structures are
typically called page table entries (PTEs), and the data structure itself is
typically called a page table. The physical page number is combined with
the page offset to give the complete physical address.

P Y1. 1 p
1

8

p

B

....

Memory Management in Protected Mode

Memory management is a mechanism which provides
operating systems powerful capabilities such as
segmentation and paging.
Under protected-mode there are no longer fixed sized
segments equally spaced in memory, but instead, the
size and location of each segment is set in an associated
data structure called a Segment Descriptor. When
accessing memory, all memory references are relative to
the base address of their corresponding segment. This
makes relocation of program modules fairly easy since
there is no need for the operating system to perform
code fix-ups when it loads applications into memory.
With paging enabled, the processor adds an extra level of
indirection to the memory translation process. By using
special look-up tables in memory, the processor fakes each
application into thinking as if it owns the entire 4GB address
space. Instead of serving as a physical address, an
application-generated address is used by the processor to
index one of its look-up tables. The corresponding entry in
the table contains the actual physical address which is sent to
the processor address bus (This is a rather simplified
description of the process). The name "paging" was chosen
since this indirection mechanism cannot be applied to
individual bytes but rather to 4KB chunks (or pages) of
memory. Through the use of paging, operating systems can
create distinct address spaces for each running application
thus simplifying memory access and preventing potential
conflicts.

Virtual-memory allows applications to allocate
more memory than is physically available. This
is done by keeping memory pages partially in
RAM and partially on disk. When a program
tries to access an on-disk page, an Exception is
generated (an exception is a processor-generated
interrupt signaling a critical event) and the
operating system reloads the page to allow the
faulting application resume its execution.
Multitasking in Protected Mode

Multitasking refers to the ability of the operating system to
run multiple tasks concurrently. True multitasking can only
be achieved on a multiprocessor machine where each task is
scheduled for execution on a different processor.
Conventional operating systems such as Windows emulate
true multitasking by quickly switching between pending
tasks giving each a time-slice to execute.

When running in protected-mode, a task switch makes the
processor save the current Context Information (notably
register values) in a Task State Segment. When the original
task is rescheduled for execution, the processor uses the
saved information to set its internal registers to allow the
original task resume its execution.
Protection in Protected Mode

Real-mode does not include support for protection and therefore
cannot offer a secure and reliable execution environment. Buggy
and hostile applications can shake the operating system integrity
by overwriting various system data structures. When applied,
protection can guard against software bugs and help the operating
system in performing reliable multitasking. Protection checks are
made before any memory cycle is started; A protection violation
terminates the offending memory cycle and generates an
exception.

Numerous benefits can also be seen during the software
development process. Any illegal memory reference made by the
developed application can be blocked and analyzed by a debugger
while ensuring the stability of all other software development
tools. (compiler, profiler etc.)
Virtual Mode
The desire to allow execution of MS-DOS applications under the
control of a protected-mode environment, (such as Windows) has led
for the inclusion of virtual-mode to all of Intel's 32 bit processors. When
the processor is running in virtual-mode, it behaves as if it were an 8086
equipped with protection, multitasking and paging support. Note that
virtual-mode is not an entirely new processor operating environment
(thank god) but instead a property which can be applied on a per-task
basis. A virtual-mode task can be executed along-side other tasks on the
system including those which were written to fully utilize protected-
mode features. Unfortunately, MS-DOS applications were not designed
to run under a multitasking environment and therefore assume full
ownership of the system. Such applications could bring the entire
system to a halt if, for instance, they clear the processor interrupt flag
(disabling hardware interrupts). To prevent such disruptions,
instructions that affect the state of the interrupt flag (such as CLI, STI,
POPF etc.) cause an exception when executed by a virtual-mode task.
An operating system piece of code known as the Virtual Machine
Monitor handles these exceptions and emulates the offending
instructions. This ensures a smooth fail-safe operation of both virtual-
mode and protected-mode tasks running on the system.
Debugging Support

When debugging applications, the 80386 comes to your aid by
providing a set of configurable debug registers. Setting a
breakpoint is done by updating one of the debug registers with the
desired memory address and specifying the type of processor
cycle which should trigger the breakpoint. When the breakpoint is
hit, an exception is generated and the debugger can gain control to
display information regarding the developed application and the
processor internal state.

The debugging support on the 80386 supersedes the old 8086
mechanism which required a modification to the instruction
stream in order to set a breakpoint inside application code.
Task Scheduling
Scheduling is a key concept in computer multitasking,
multiprocessing operating system and real-time operating system
designs. Scheduling refers to the way processes are assigned to
run on the available CPUs, since there are typically many more
processes running than there are available CPUs. This
assignment is carried out by softwares known as a scheduler and
dispatcher.
The scheduler is concerned mainly with:
CPU utilization - to keep the CPU as busy as possible.
Throughput - number of processes that complete their execution
per time unit.
Turnaround - total time between submission of a process and its
completion.
Waiting time - amount of time a process has been waiting in the
ready queue.
Response time - amount of time it takes from when a request was
submitted until the first response is produced.
Fairness - Equal CPU time to each thread.

Advanced Processors

(!
Pentium

inside'"

infel
PENTIUMePRO

--
pentium"'

Pentium

he Pentium family of processors, which has its roots in the Intel486
(TM)
processor,
uses the Intel486 instruction set (with a few additional instructions). The term
''Pentium processor'' refers to a family of microprocessors that share a common
architecture and instruction set. The first Pentium processors (the P5 variety) were
introduced in 1993.
The Intel Pentium processor, like its predecessor the Intel486 microprocessor, is
fully software compatible with the installed base of over 100 million compatible
Intel architecture systems. In addition, the Intel Pentium processor provides new
levels of performance to new and existing software through a reimplementation of
the Intel 32-bit instruction set architecture using the latest, most advanced, design
techniques. Optimized, dual execution units provide one-clock execution for "core"
instructions, while advanced technology, such as superscalar architecture, branch
prediction, and execution pipelining, enables multiple instructions to execute in
parallel with high efficiency. Separate code and data caches combined with wide
128-bit and 256-bit internal data paths and a 64-bit, burstable, external bus allow
these performance levels to be sustained in cost-effective systems. The application
of this advanced technology in the Intel Pentium processor brings "state of the art"
performance and capability to existing Intel architecture software as well as new and
advanced applications.
Pentium

The Pentium processor has two primary operating modes and a "system management mode. The
operating mode determines which instructions and architectural features are accessible. These
modes are:
Protected Mode: In this mode all instructions and architectural features are available, providing
the highest performance and capability. This is the recommended mode that all new applications
and operating systems should target. Among the capabilities of protected mode is the ability to
directly execute "real-address mode" 8086 software in a protected, multi-tasking environment. This
feature is known as Virtual-8086 "mode" (or "V86 mode"). Virtual-8086 "mode" however, is not
actually a processor "mode," it is in fact an attribute which can be enabled for any task (with
appropriate software) while in protected mode.
Real-Address Mode (also called "real mode"): This mode provides the programming
environment of the Intel 8086 processor, with a few extensions (such as the ability to break out of
this mode). Reset initialization places the processor in real mode where, with a single instruction, it
can switch to protected mode.
System Management Mode: The Pentium microprocessor also provides support for System
Management Mode (SMM). SMM is a standard architectural feature unique to all new Intel
microprocessors, beginning with the Intel386 SL processor, which provides an operating-system
and application independent and transparent mechanism to implement system power management
and OEM differentiation features. SMM is entered through activation of an external interrupt pin
(SMI#), which switches the CPU to a separate address space while saving the entire context of the
CPU. SMM-specific code may then be executed transparently. The operation is reversed upon
returning.
Pentium

Advanced Features
o Superscalar Execution: The Intel486 processor can execute only one instruction at a time. With superscalar execution,
the Pentium processor can sometimes execute two instructions simultaneously.
o Pipeline Architecture: Like the Intel486 processor, the Pentium processor executes instructions in five stages. This
staging, or pipelining, allows the processor to overlap multiple instructions so that it takes less time to execute two
instructions in a row. Because of its superscalar architecture, the Pentium processor has two independent processor
pipelines.
o Branch Target Buffer: The Pentium processor fetches the branch target instruction before it executes the branch
instruction.
o Dual 8-KB On-Chip Caches: The Pentium processor has two separate 8-kilobyte (KB) caches on chip--one for
instructions and one for data--which allows the Pentium processor to fetch data and instructions from the cache
simultaneously.
o Write-Back Cache: When data is modified; only the data in the cache is changed. Memory data is changed only when
the Pentium processor replaces the modified data in the cache with a different set of data
o 64-Bit Bus: With its 64-bit-wide external data bus (in contrast to the Intel486 processor's 32-bit- wide external bus) the
Pentium processor can handle up to twice the data load of the Intel486 processor at the same clock frequency.
o Instruction Optimization: The Pentium processor has been optimized to run critical instructions in fewer clock cycles
than the Intel486 processor.
o Floating-Point Optimization: The Pentium processor executes individual instructions faster through execution
pipelining, which allows multiple floating-point instructions to be executed at the same time.
o Pentium Extensions: The Pentium processor has fewer instruction set extensions than the Intel486 processors. The
Pentium processor also has a set of extensions for multiprocessor (MP) operation. This makes a computer with multiple
Pentium processors possible.
A Pentium system, with its wide, fast buses, advanced write-back cache/memory subsystem, and powerful processor,
will deliver more power for today's software applications, and also optimize the performance of advanced 32-bit
operating systems (such as Windows 95) and 32-bit software applications.
Pentium-Pro

infel
PENTIUMPRO

PENTIUMPRO

PR OCESSOR

Pentium-Pro

The Pentium Pro is a sixth-generation x86 microprocessor developed
and manufactured by Intel introduced in November 1995. It
introduced the P6 micro-architecture (sometime referred as i686)
While the Pentium and Pentium MMX had 3.1 and 4.5 million
transistors, respectively, the Pentium Pro contained 5.5 million
transistors. Later, it was reduced to a more narrow role as a server and
high-end desktop processor and was used in supercomputers. The
Pentium Pro was capable of both dual- and quad-processor
configurations. The Pentium Pro was succeeded by the Pentium II
Xeon in 1998.
Pentium-Pro

The Pentium Pro differs from the Pentium in having an on-chip Level 2 cache of between 256KB and
1MB operating at the internal clock speed. The sitting of the secondary cache on the chip, rather than on
the motherboard, enables signals to get between the two on a 64-bit data path, rather than the 32-bit path
of Pentium system buses.

An even bigger factor in the Pentium Pro's performance improvement is down to the combination of
technologies known as "dynamic execution". This includes branch prediction, data flow analysis and
speculative execution.

The Pentium Pro was also the first processor in the x86 family to employ super-pipelining, its pipeline
comprising 14 stages, divided into three sections. The in-order front-end section, which handles the
decoding and issuing of instructions, consists of eight stages.

The other, more critical distinction of the Pentium Pro is its handling of instructions. It takes the
Complex Instruction Set Computer (CISC) x86 instructions and converts them into internal Reduced
Instruction Set Computer (RISC) micro-ops.

There are drawbacks in using the RISC approach. The first is that converting instructions takes time,
even if calculated in nano or micro seconds. As a result, the Pentium Pro inevitably takes a performance
hit when processing instructions. A second drawback is that the out-of-order design can be particularly
affected by 16-bit code, resulting in stalls. These tend to be caused by partial register updates that occur
before full register reads and they can impose severe performance penalties of up to seven clock cycles.
Pentium-Pro

The Pentium Pro achieves performance approximately 50% higher than a Pentium of the same clock
speed. In addition to its new way of processing instructions, the Pentium Pro incorporates several other
technical advances that contribute to this increased performance:
Super pipelining: The Pentium Pro dramatically increases the number of execution steps, to 14, from the
Pentium's 5.
Integrated Level 2 Cache: The Pentium Pro features a dramatically higher-performance secondary
cache compared to all earlier processors. Instead of using motherboard-based cache running at the speed
of the memory bus, it uses an integrated level 2 cache with its own bus, running at full processor speed,
typically three times the speed that the cache runs at on the Pentium. The Pentium Pro's cache is also
non-blocking, which allows the processor to continue without waiting on a cache miss.
32-Bit Optimization: The Pentium Pro is optimized for running 32-bit code (which most modern
operating systems and applications use) and so gives a greater performance improvement over the
Pentium when using the latest software.
Wider Address Bus: The address bus on the Pentium Pro is widened to 36 bits, giving it a maximum
addressability of 64 GB of memory.
Greater Multiprocessing: Quad processor configurations are supported with the Pentium Pro compared
to only dual with the Pentium.
Out of Order Completion: Instructions flowing down the execution pipelines can complete out of order.
Superior Branch Prediction Unit: The branch target buffer is double the size of the Pentium's and its
accuracy is increased.
Register Renaming: This feature improves parallel performance of the pipelines.
Speculative Execution: The Pro uses speculative execution to reduce pipeline stall time in its RISC core.
Pentium-Pro
rfetch
Instruction
-
z
PARTO

I
System Bus
I I
L2 256Kl51
2K Cache
l
r
Bus Interface Unit
l

I

'Ill
t
Store

Next
1-
Station

IP

-
z
Cache

L...._
Point EU
Reorder
Integer EU
Buffer

1- =
z

l-6
u
=
.
u.
=
l

Instruction
BTB
=
a:
I
=
-
!
Address
=u.-z

=
Decode xl
=
PART1
Generation
=:u.
Retirement

z
-=
-
X86

Microcode
-= PART2
=

Address
z
-
l
u
-
.l
File
=u.l
w
Qu
..>
.
<
l

Store
==
-

Register
Instruction =
=:

1-
Instructions Sequence
Pentium-Pro
,
=!;:
Generation
1.1. cr..
X86

To Register
=

j

Registers
Table
Load

t
r

Instruction Pool
Pentium-Pro

Processor Clock
Speed
Intro.
Dates
Mfg.
Process/
Transistors
Cache Bus Speed Typical
Use
Intel 200 - 150 November, 0.6 and 256 kB, 66 MHz High-end
Pentium MHz 1995 0.35 micron 512 kB and 60 MHz desktops,
Pro

1 MB L2

workstation
Processor

5.5 million Cache

s and

servers
Intel 200 MHz January, 0.35- 256KB 66 MHz High-end
Pentium 180 MHz 1996 micron 512KB

desktops,
Pro

1MB

workstation
Processor

5.5 million L2

s and

servers.
Intel 200 MHz Nov. 1, 0.6-micron 256 KB 66 MHz High-end
Pentium 180 MHz 1995 5.5 million 512 KB

desktops,
Pro 166 MHz

L2

workstation
Processor 150 MHz

s and

servers.
Pentium Processors with MMX Technology


Commercially introduced in January 1997, the MMX technology is an
extension of the Intel architecture that uses a single-instruction,
multiple-data execution model that allows several data elements to be
processed simultaneously. It is a 32-bit, two-issue superscalar CISC
general-purpose processor with support for fixed-point and floating-
point arithmetic. Applications that benefit from the MMX technology
are those that do many parallelizable computations using small integer
numbers. Examples of these kinds of applications are 2-D/3-D
graphics, image processing, virtual reality, audio synthesis and data
compression.
The MMX technology extends the Intel architecture by adding eight
64-bit registers and 57 instructions. The new registers are named
MM0 to MM7. Depending on which instructions we use, each register
may be interpreted as one 64-bit quadword, two packed 32-bit double
words, four packed 16-bit words, or eight packed 8-bit bytes.

As in the original Pentium, the MMX Pentium provides both a fixed-
point integer data path that allows up to two operations to be
executed simultaneously, and a floating-point data path that allows
one operation to be performed at a time. In addition, the MMX
Pentium provides a new MMX data path that allows up to two MMX
operations to execute simultaneously, or up to one MMX operation
and one integer operation (in the integer data path) to execute
simultaneously. The integer data path includes two ALUs and
supports operations on 8-, 16-, and 32-bit integers. The integer data
path uses eight registers for operands and results, but imposes several
restrictions on which registers can be used by the various operations.
The floating-point data path supports 32-, 64-, and 80-bit operands
and is fully IEEE 754 and 854 compliant. The floating-point data path
uses eight dedicated floating-point registers that are organized as a
stack. In general, using the floating-point data path for DSP operations
yields faster implementations than using the integer data path.

Processor Clock
Speed(s)
Intro
Date(s)
Mfg. Transistors
Process
Cache Adderssable
Memory
Bus Typical
Speed Use
Intel
Pentium
Processor with
MMX
Technology
233 - 166
MHz
Oct-96 0.35 4.5 million
micron
16 kB L1
Cache
4 GB 66 MHz High
performanc
e desktops
and servers
Intel
Pentium
Notebook
Processor with
MMX
Technology
300 - 200
MHz
Sep-97 0.25 4.5 million
micron
16 kB L1
Cache
4 GB 66 MHz Mobile PC
60 MHz and mini-
notebooks
Hyper Threading

Hyper Threading

Hyper-threading is used to improve parallelization of computations (doing multiple
tasks at once) performed on PC microprocessors. For each processor core that is
physically present, the operating system addresses two virtual processors, and shares
the workload between them when possible. Hyper-threading requires not only that
the operating system support multiple processors, but also that it be specifically
optimized for HTT.
Hyper-Threading Technology makes a CPU processor able to have the capability of
Thread-Level-Parallelism(TLP) and greatly enhance the performance of CPU
computing resources. This is a sort of technology called Simultaneous Multi
Threading(SMT) which enables a single processor to run multi-threaded tasks
parallelly and also run lots of softwares simultaneously.
HT Technology allows you to:
Run demanding applications simultaneously while maintaining system
responsiveness
Keep systems more secure, efficient, and manageable while minimizing impact
on productivity
Provide headroom for future business growth and new solution capabilities
Hyper Threading

Hyper-threading works by duplicating certain sections of the processorthose that store the
architectural statebut not duplicating the main execution resources. This allows a hyper-
threading processor to appear as two "logical" processors to the host operating system,
allowing the operating system to schedule two threads or processes simultaneously. When
execution resources would not be used by the current task in a processor without hyper-
threading, and especially when the processor is stalled, a hyper-threading equipped processor
can use those execution resources to execute another scheduled task. (The processor may stall
due to a cache miss, branch misprediction, or data dependency.)
This technology is transparent to operating systems and programs. All that is required to take
advantage of hyper-threading is symmetric multiprocessing (SMP) support in the operating
system, as the logical processors appear as standard separate processors.
It is possible to optimize operating system behavior on multi-processor hyper-threading
capable systems. For example, consider an SMP system with two physical processors that are
both hyper-threaded (for a total of four logical processors). If the operating system's process
scheduler is unaware of hyper-threading it will treat all four processors as being the same. If
only two processes are eligible to run it might choose to schedule those processes on the two
logical processors that happen to belong to one of the physical processors; that processor
would become extremely busy while the other would be idle, leading to poorer performance
than is possible with better scheduling. This problem can be avoided by improving the
scheduler to treat logical processors differently from physical processors; in a sense, this is a
limited form of the scheduler changes that are required for NUMA systems.

Stage 1

Stage 2

Stage 3
Core 2 Duo

inter

(!_ntij)
Careen
vo

Core 2 Duo

ntegrated Heat Spreader (IHS):
The integrated metalheat spreader
conducts heat from the silicon chip
and protects it. The IHS serves as
contact for the heatsink and
provides more surface area leading
to better cooling.

.-- lll,conchip (die):The die inside the
Intel Core"' 2 Duo processor is
143 mm2 in size and utilizes a
whopping 291 million transistors.
The two cores are based on the
IntelCore'" Microarchitecture with
innovative features like Wide
Dynamic Execution, Advanced Digital
Media Boost Smart Memory Access
and Intelligent Power Capability.The
Advanced Smart Cache has 4 MByte
capacity.

Substrate: The die is mounted
directly to the substrate which
facilitates the contact to the socket
on the motherboard. 775 contacts
deliver power,ground,address and
data signals.
Core 2 Duo

Dual-Core Processing: Two independent processor cores in one physical package run at the same
frequency, and share up to 6 MB of L2 cache as well as up to a 1333 MHz Front Side Bus, for truly
parallel computing.
Wide Dynamic Execution: Improves execution speed and efficiency, delivering more instructions per
clock cycle. Each core can complete up to four full instructions simultaneously.
Smart Memory Access: Optimizes the use of the data bandwidth from the memory subsystem to
accelerate out-of-order execution. A newly designed prediction mechanism reduces the time in-flight
instructions have to wait for data. New pre-fetch algorithms move data from system memory into fast L2
cache in advance of execution. These functions keep the pipeline full, improving instruction throughput
and performance. 45nm versions further improve this feature, with more efficient methods of loading and
storing data in main memory.
Advanced Smart Cache The shared L2 cache is dynamically allocated to each processor core based on
workload. This efficient, dualcore optimized implementation increases the probability that each core can
access data from fast L2 cache, significantly reducing latency to frequently used data and improving
performance.
Advanced Digital Accelerates the execution of Streaming SIMD Extension (SSE) instructions to
significantly improve the performance on a broad range of applications, including video, audio, and
image processing, and multimedia, encryption, financial, engineering, and scientific applications. The
128-bit SSE instructions are now issued at a throughput rate of one per clock cycle effectively doubling
their speed of execution on a per clock basis over previous generation processors. 45nm versions include
a new Super Shuffle Engine, which improves existing SSE instructions while enabling significant gains
on the latest SSE4 instruction set. SSE4-optimized applications, such as video editing and encoding in
high-definition resolution, will see additional performance improvements.
Core 2 Duo

Virtualization Technology: VT allows one hardware platform to function as multiple
virtual platforms. For businesses, Intel VT offers improved manageability, limiting
downtime and maintaining worker productivity by isolating computing activities into separate
partitions.
Trusted Execution Technology (TXT): It provides hardware-based mechanisms to help
protect against software-based attacks and help protect the confidentiality and integrity of
data stored or created on the system. It does this by enabling a trusted environment where
applications can run within their own space, protected from all other software on the system.
64 Architecture Enables the processor to access larger amounts of memory. With appropriate
64-bit supporting hardware and software, platforms based on an Intel processor supporting
Intel 64 architecture can allow the use of extended virtual and physical memory.
Execute Disable Bit Provides enhanced virus protection when deployed with a supported
operating system. The Execute Disable Bit allows memory to be marked as executable or
non-executable, allowing the processor to raise an error to the operating system if malicious
code attempts to run in non-executable memory, thereby preventing the code from infecting
the system.
Intel Designed Thermal Solution for Boxed Processors: Includes a 4-pin connector for fan
speed control to help minimize the acoustic noise levels generated from running the fan at
higher speeds for thermal performance.7 Fan speed control technology is based on actual
CPU temperature and power usage.
RISC

Reduced instruction set computing, or RISC, is a CPU design strategy based on the insight
that simplified (as opposed to complex) instructions can provide higher performance if this
simplicity enables much faster execution of each instruction. A computer based on this
strategy is a reduced instruction set computer (also RISC).

Some aspects attributed to the first RISC-labeled designs around 1975 include the
observations that the memory-restricted compilers of the time were often unable to take
advantage of features intended to facilitate manual assembly coding, and that complex
addressing modes take many cycles to perform due to the required additional memory
accesses. It was argued that such functions would be better performed by sequences of
simpler instructions if this could yield implementations small enough to leave room for
many registers, reducing the number of slow memory accesses. In these simple designs,
most instructions are of uniform length and similar structure, arithmetic operations are
restricted to CPU registers and only separate load and store instructions access memory.
These properties enable a better balancing of pipeline stages than before, making RISC
pipelines significantly more efficient and allowing higher clock frequencies.
RISC

The circuitry that performs the actions defined by the microcode in many (but not all) CISC
processors is, in itself, a processor which in many ways is reminiscent in structure to very
early CPU designs. This gave rise to ideas to return to simpler processor designs in order to
make it more feasible to cope without (then relatively large and expensive) ROM tables, or
even without PLA structures, for sequencing and/or decoding. At the same time, simplicity
and regularity, would make it easier to implement overlapping processor stages (pipelining)
at the machine code level (i.e. the level seen by compilers). The first RISC-labeled processor
(IBM 801) was therefore a tightly pipelined machine originally intended to be used as an
internal microcode kernal, or engine, in a CISC design. At the time, pipelining at the
machine code level was already used in some high performance CISC computers, in order to
reduce the instruction cycle time, but it was fairly complicated to implement within the
limited component count and wiring comlexity that was feasible at the time. (Microcode
execution, on the other hand, could be more or less pipelined, depending on the particular
design.)
RISC: Characteristics
Simple instruction set: In a RISC machine, the instruction set contains simple, basic
instructions, from which more complex instructions can be composed.
Same length instructions: Each instruction is the same length, so that it may be fetched in a
single operation.
1 machine-cycle instructions: Most instructions complete in one machine cycle, which
allows the processor to handle several instructions at the same time. This pipelining is a key
technique used to speed up RISC machines.

The advantages of RISC:
RISC: Advantages
Implementing a processor with a simplified instruction set design provides several advantages over
implementing a comparable CISC design:
Speed. Since a simplified instruction set allows for a pipelined, superscalar design RISC
processors often achieve 2 to 4 times the performance of CISC processors using comparable semiconductor
technology and the same clock rates.
Simpler hardware. Because the instruction set of a RISC processor is so simple, it uses up
much less chip space; extra functions, such as memory management units or floating point arithmetic units,
can also be placed on the same chip. Smaller chips allow a semiconductor manufacturer to place more parts
on a single silicon wafer, which can lower the per-chip cost dramatically.
Shorter design cycle. Since RISC processors are simpler than corresponding CISC processors,
they can be designed more quickly, and can take advantage of other technological developments sooner
than corresponding CISC designs, leading to greater leaps in performance between generations.

The disadvantages of RISC
RISC: Disadvantages
The transition from a CISC design strategy to a RISC design strategy isn't without its problems. Software
engineers should be aware of the key issues which arise when moving code from a CISC processor to a
RISC processor.

Code Quality : The performance of a RISC processor depends greatly on the code that it is executing. If
the programmer (or compiler) does a poor job of instruction scheduling, the processor can spend quite a bit
of time stalling: waiting for the result of one instruction before it can proceed with a subsequent instruction.

Debugging : Unfortunately, instruction scheduling can make debugging difficult. If scheduling (and other
optimizations) are turned off, the machine-language instructions show a clear connection with their
corresponding lines of source. However, once instruction scheduling is turned on, the machine language
instructions for one line of source may appear in the middle of the instructions for another line of source
code.

Code expansion : Since CISC machines perform complex actions with a single instruction, where RISC
machines may require multiple instructions for the same action, code expansion can be a problem. Code
expansion refers to the increase in size that you get when you take a program that had been compiled for a
CISC machine and re-compile it for a RISC machine. The exact expansion depends primarily on the quality
of the compiler and the nature of the machine's instruction set.

System Design : Another problem that faces RISC machines is that they require very fast memory systems
to feed them instructions. RISC-based systems typically contain large memory caches, usually on the chip
itself. This is known as a first-level cache.
CISC
A complex instruction set computer (CISC), is a computer where single instructions can execute several
low-level series of operations (such as a load from memory, an arithmetic operation, and a memory store)
and/or are capable of multi-step operations or addressing modes within single instructions. This reduces the
number of instructions required to implement a given program, and allows the programmer to learn a small
but flexible set of instructions. Examples of CISC instruction set architectures are System/360 through
z/Architecture, PDP-11, VAX, Motorola 68k, and x86.

Before the RISC philosophy became prominent, many computer architects tried to bridge the so called
semantic gap, i.e. to design instruction sets that directly supported high-level programming constructs such as
procedure calls, loop control, and complex addressing modes, allowing data structure and array accesses to
be combined into single instructions. Instructions are also typically highly encoded in order to further
enhance the code density. The compact nature of such instruction sets results in smaller program sizes and
fewer (slow) main memory accesses, which at the time (early 1960s and onwards) resulted in a tremendous
savings on the cost of computer memory and disc storage, as well as faster execution. It also meant good
programming productivity even in assembly language, as high level languages such as Fortran or Algol were
not always available or appropriate.

In the 70's, analysis of high level languages indicated some complex machine language implementations and
it was determined that new instructions could improve performance. Some instructions were added that were
never intended to be used in assembly language but fit well with compiled high level languages. Compilers
were updated to take advantage of these instructions. The benefits of semantically rich instructions with
compact encodings can be seen in modern processors as well, particularly in the high performance segment
where caches are a central component. This is because these fast, but complex and expensive, memories are
inherently limited in size, making compact code beneficial.
CISC Philosophy
Philosophy 1: Use Microcode
The earliest processor designs used dedicated (hardwired) logic to decode and execute each
instruction in the processor's instruction set. This worked well for simple designs with few
registers, but made more complex architectures hard to build, as control path logic can be hard to
implement. So, designers switched tactics --- they built some simple logic to control the data
paths between the various elements of the processor, and used a simplified microcode instruction
set to control the data path logic. This type of implementation is known as a micro-programmed
implementation.

In a micro-programmed system, the main processor has some built-in memory (typically ROM)
which contains groups of microcode instructions which correspond with each machine-language
instruction. When a machine language instruction arrives at the central processor, the processor
executes the corresponding series of microcode instructions.

Because instructions could be retrieved up to 10 times faster from a local ROM than from main
memory, designers began to put as many instructions as possible into microcode. In fact, some
processors could be ordered with custom microcode which would replace frequently used but
slow routines in certain application.
CISC Philosophy
Philosophy 2: Build "rich" instruction sets
One of the consequences of using a micro-programmed design is that designers could build more
functionality into each instruction. This not only cut down on the total number of instructions
required to implement a program, and therefore made more efficient use of a slow main memory,
but it also made the assembly-language programmer's life simpler.
Soon, designers were enhancing their instruction sets with instructions aimed specifically at the
assembly language programmer. Such enhancements included string manipulation operations,
special looping constructs, and special addressing modes for indexing through tables in memory.

For example:
ABCD Add Decimal with Extend
ADDA Add Address
ADDX Add with Extend
ASL Arithmentic Shift Left
CAS Compare and Swap Operands
NBCD Negate Decimal with Extend
EORI Logical Exclusive OR Immediate
TAS Test Operand and Set
CISC Philosophy
Philosophy 3: Build high-level instruction sets
Once designers started building programmer-friendly instruction sets, the logical next step was to
build instruction sets which map directly from high-level languages. Not only does this simplify
the compiler writer's task, but it also allows compilers to emit fewer instructions per line of
source code.
Modern CISC microprocessors, such as the 68000, implement several such instructions,
including routines for creating and removing stack frames with a single call.

For example:
DBcc Test Condition, Decrement and Branch
ROXL Rotate with Extend Left RTR
Return and Restore Codes SBCD
Subtract Decimal with Extend SWAP
Swap register Words
CMP2 Compare Register against Upper and Lower Bounds
CISC: Design Issue
While many designs achieved the aim of higher throughput at lower cost and also allowed
high-level language constructs to be expressed by fewer instructions, it was observed that
this was not always the case. For instance, low-end versions of complex architectures (i.e.
using less hardware) could lead to situations where it was possible to improve performance
by not using a complex instruction (such as a procedure call or enter instruction), but instead
using a sequence of simpler instructions.

One reason for this was that architects (microcode writers) sometimes "over-designed"
assembler language instructions, i.e. including features which were not possible to
implement efficiently on the basic hardware available. This could, for instance, be "side
effects" (above conventional flags), such as the setting of a register or memory location that
was perhaps seldom used; if this was done via ordinary (non duplicated) internal buses, or
even the external bus, it would demand extra cycles every time, and thus be quite inefficient.

Even in balanced high performance designs, highly encoded and (relatively) high-level
instructions could be complicated to decode and execute efficiently within a limited
transistor budget. Such architectures therefore required a great deal of work on the part of
the processor designer in cases where a simpler, but (typically) slower, solution based on
decode tables and/or microcode sequencing is not appropriate. At the time where transistors
and other components were a limited resource, this also left fewer components and less area
for other types of performance optimizations.
CISC: Advantages & Disadvantages
The advantages of CISC.
At the time of their initial development, CISC machines used available technologies to optimize computer
performance.
Microprogramming is as easy as assembly language to implement, and much less expensive than
hardwiring a control unit.
The ease of microcoding new instructions allowed designers to make CISC machines upwardly
compatible: a new computer could run the same programs as earlier computers because the new
computer would contain a superset of the instructions of the earlier computers.
As each instruction became more capable, fewer instructions could be used to implement a given
task. This made more efficient use of the relatively slow main memory.
Because microprogram instruction sets can be written to match the constructs of high-level
languages, the compiler does not have to be as complicated.
The disadvantages of CISC
Still, designers soon realized that the CISC philosophy had its own problems, including:
Earlier generations of a processor family generally were contained as a subset in every new version -
-- so instruction set & chip hardware become more complex with each generation of computers.
So that as many instructions as possible could be stored in memory with the least possible wasted
space, individual instructions could be of almost any length---this means that different instructions
will take different amounts of clock time to execute, slowing down the overall performance of the
machine.
Many specialized instructions aren't used frequently enough to justify their existence ---
approximately 20% of the available instructions are used in a typical program.
CISC instructions typically set the condition codes as a side effect of the instruction. Not only does
setting the condition codes take time, but programmers have to remember to examine the condition
code bits before a subsequent instruction changes them.
RISC Vs. CISC

RISC CISC
Single word instructions Variable length instructions
Fixed field decoding Variable format
Load/store architecture Memory operands
Simple operations Complex operations
Hardwired control unit Micro-coded control unit
Large uniform register set (32+) Small register set (2-16)
Minimal number of addressing modes (1-2) Large number of addressing modes (5-20)
No/Minimal support for misaligned accesses Support for misaligned accesses
Faster Relatively slower

Instructions < 100 Instructions < 200

Instruction Format (1-2) Instruction Format (3+)

Average cycle per instruction (~1) Average cycle per instruction (3-10)
.
Sun
s

, 'o:s @ QI(Ott,Stoll
Ultra SP4RCII
C920950S 99SO
PC2.0 CS2 USA
S'fP I 03LCA
980
Sun's SPARC

!:>ltn

'3 6 SM

UP090 OCDS:'
9709K9004
AP"
TP1030t.BG>' . 20000I
Suns SPARC

SPARC (Scalable Processor Architecture) was initiated by Sun Microsystems, Inc.
The concept of scalability is the wide spectrum of its possible price/performance implementation, ranging from
microcomputers to supercomputers.
It follows RISC design philosophy by stressing of the importance of the relatively large CPU register file.

Suns SPARC

The SPARC processor usually contains as many as 160 general purpose registers. At any
point, only 32 of them are immediately visible to software - 8 are a set of global registers (one
of which, g0, is hard-wired to zero, so only 7 of them are usable as registers) and the other 24
are from the stack of registers. These 24 registers form what is called a register window, and
at function call/return, this window is moved up and down the register stack. Each window
has 8 local registers and shares 8 registers with each of the adjacent windows. The shared
registers are used for passing function parameters and returning values, and the local registers
are used for retaining local values across function calls.

The group of subdivision of the windows
register is ins, locals and outs. Ins
contain parameters passed to the
procedures by the calling procedures.
Locals contain local parameters of the
procedure. Outs contain parameters
passed to the called procedure.
Suns SPARC

The group of subdivision of the windows register is ins, locals and outs. Ins contain
parameters passed to the procedures by the calling procedures. Locals contain local
parameters of the procedure. Outs contain parameters passed to the called procedured.

These 24 registers form what
is called a register window,
and at function call/return,
this window is moved up and
down the register stack. Each
window has 8 local registers
and shares 8 registers with
each of the adjacent windows.
The shared registers are used
for passing function
parameters and returning
values, and the local registers
are used for retaining local
values across function calls.

Advanced Processor: B.E. Semester V (CE)

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Advanced Processor: B.E. Semester V (CE)

Încărcat de

Drepturi de autor:

Formate disponibile

Advanced Processor

B.E. Semester V (CE)

S-ar putea să vă placă și