Sunteți pe pagina 1din 90

Eserver pSeries

Section 2: The Technology

"Any sufficiently advanced technology will


have the appearance of magic."
Arthur C. Clarke

2003 IBM Corporation

^Eserver pSeries

Section Objectives
On completion of this unit you should be able to:
Describe the relationship between technology and
solutions.
List key IBM technologies that are part of the POWER5
products.
Be able to describe the functional benefits that these
technologies provide.
Be able to discuss the appropriate use of these
technologies.

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

IBM and Technology


Solutions
Products
Technology
Science

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Technology and innovation


Having technology available is a necessary first
step.

Finding creative new ways to use the technology


for the benefit of our clients is what innovation is
about.

Solution design is an opportunity for innovative


application of technology.

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

When technology wont fix the problem


When the technology is not related to the problem.
When the client has unreasonable expectations.

Concepts of Solution Design

2003 IBM Corporation

Eserver pSeries

POWER5 Technology

2003 IBM Corporation

^Eserver pSeries

POWER4 and POWER5 Cores


POWER4 Core

Concepts of Solution Design

POWER5 Core

2003 IBM Corporation

^Eserver pSeries

POWER5

Enhanced memory subsystem


Improved performance
Simultaneous Multi-Threading
Hardware support for Shared
Processor Partitions (MicroPartitioning)

Dynamic power management


Compatibility with existing

SMT Core

1.9 MB L2 Cache

Mem Ctrl

SMT Core

L3 Dir

servers

Enhanced distributed switch

Designed for entry and high-end

Chip-Chip / MCM-MCM / SMPLink

POWER4 systems

Enhanced reliability, availability,


serviceability

Concepts of Solution Design

GX+

2003 IBM Corporation

^Eserver pSeries

Enhanced memory subsystem

Larger L2 cache
1.9 MB, 10-way set associative

Improved L3 cache design


36 MB, 12-way set associative
L3 on the processor side of the fabric
Satisfies L2 cache misses more frequently
Avoids traffic on the interchip fabric

SMT Core

1.9 MB L2 Cache

Mem Ctrl

SMT Core

L3 Dir

Improved L1 cache design


2-way set associative i-cache
4-way set associative d-cache
New replacement algorithm (LRU vs. FIFO)

Enhanced distributed switch

Chip-Chip / MCM-MCM / SMPLink

On-chip L3 directory and memory controller


L3 directory on the chip reduces off-chip delays after an L2
miss
Reduced memory latencies

Improved pre-fetch algorithms

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Enhanced memory subsystem


POWER4 system structure
Processor
Processor

Processor
Processor

Processor
Processor

L2
L2
Cache

Fabric
Fabric
controller
controller

L3
Cache

Processor
Processor

Processor
Processor

Processor
Processor

L2
L2
Cache
Cache

L2
L2
Cache
Cache

Fabric
Fabric
controller
controller

Fabric
Fabric
controller
controller

Fabric
Fabric
controller
controller

L3
Cache

L3
Cache

Memory
Memory
controller
controller

Memory
Memory
controller
controller

Memory
controller

Memory
controller

Memory

Memory

Memory

Memory

Faster
access to
memory

Concepts of Solution Design

L3 Dir
Dir
L3

L2
L2
Cache
Cache

Processor
Processor

L3 Dir
Dir
L3

Processor
Processor

POWER5 system structure


Reduced
L3 latency

L3
Cache

Larger
SMPs
64-way

Number of
chips cut
in half

2003 IBM Corporation

^Eserver pSeries

Simultaneous Multi-Threading (SMT)


What is it?
Why would I want it?

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

POWER4 pipeline
Out-of-order processing

Branch redirects
Instruction Fetch
IF

IC

BP
D0

D1

D2

D3

Xfer

Instruction Crack and


Group Formation

GD

Branch
pipeline
Load/store
pipeline

MP

ISS

RF

EX

MP

ISS

RF

EA

MP

ISS

RF

EX

MP

ISS

RF

F6
F6
F6
F6
F6
F6

Interrupts & Flushes

DC

Fmt

Fixed-point
pipeline

WB Xfer
WB Xfer

CP

WB Xfer

WB Xfer

Floatingpoint pipeline

POWER4 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0
= decode stage 0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF =
register file read, EX = execute, EA = compute address, DC = data caches, F6 = six-cycle floatingpoint execution pipe, Fmt = data format, WB = write back, and CP = group commit)
POWER5 pipeline

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Multi-threading evolution
Memory
Instruction streams

Execution unit utilization is low in todays


microprocessors

25% of average execution unit utilization across


a broad spectrum of environments

i-Cache

FX0
FX1
LS0
LS1
FP0
FP1
BFX
CRL

Processor Cycles

Next evolution step

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Coarse-grained multi-threading
Two instruction streams, one thread at any instance

Swap requires several cycles

Swap

Swap

Hardware swaps in second thread when long-latency event


occurs

FX0
FX1
LS0
LS1
FP0
FP1
BFX
CRL

i-Cache

Instruction streams

Swap

Memory

Processor Cycles

Next evolution step

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Coarse-grained multi-threading (Cont.)


Processor (for example, RS64-IV) is able to store context for
two threads
Rapid switching between threads minimizes lost cycles due
to I/O waits and cache misses.
Can yield ~20% improvement for OLTP workloads.

Coarse-grained multi-threading only beneficial where

number of active threads exceeds 2x number of CPUs


AIX must create a dummy thread if there are insufficient
numbers of real threads.
Unnecessary switches to dummy threads can degrade
performance ~20%
Does not work with dynamic CPU deallocation

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Fine-grained multi-threading
Memory
Instruction streams

Variant of coarse-grained multi-threading


Thread execution in round-robin fashion
Cycle remains unused when a thread encounters a
long-latency event

i-Cache

FX0
FX1
LS0
LS1
FP0
FP1
BFX
CRL

Processor Cycles

Next evolution step

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

POWER5 pipeline
Out-of-order processing

Branch redirects
Instruction Fetch
IF
IF

IC

BP
D0

D1

D2

D3

Xfer

Instruction Crack and


Group Formation

GD

Branch
pipeline
Load/store
pipeline

MP

ISS

RF

EX

MP

ISS

RF

EA

MP

ISS

RF

EX

MP

ISS

RF

F6
F6
F6
F6
F6
F6

Interrupts & Flushes

DC

Fmt

Fixed-point
pipeline

WB Xfer
WB Xfer

CP
CP

WB Xfer

WB Xfer

Floatingpoint pipeline

POWER5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0
= decode stage 0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF =
register file read, EX = execute, EA = compute address, DC = data caches, F6 = six-cycle floatingpoint execution pipe, Fmt = data format, WB = write back, and CP = group commit)
POWER4 pipeline

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Simultaneous multi-threading (SMT)


Memory

Reduction in unused execution

units results in a 25-40% boost and


even more!
FX0
FX1
LS0
LS1
FP0
FP1
BFX
CRL

i-Cache

Instruction streams

Processor Cycles

First evolution step

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Simultaneous multi-threading (SMT)

(Cont.)

Each chip appears as a 4-way SMP to software


Allows instructions from two threads to execute simultaneously

Processor resources optimized for enhanced SMT


performance

No context switching, no dummy threads

Hardware, POWER Hypervisor, or OS controlled thread


priority

Dynamic feedback of shared resources allows for balanced


thread execution

Dynamic switching between single and multithreaded mode

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Dynamic resource balancing


Threads share many
resources

Global Completion Table,


Branch History Table,
Translation Lookaside Buffer,
and so on

Higher performance realized


when resources balanced
across threads

Tendency to drift toward


extremes accompanied by
reduced performance

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Adjustable thread priority


execution is desirable
No work for opposite thread
Thread waiting on lock
Software determined non
uniform balance
Power management

Control instruction decode rate


Software/hardware controls
eight priority levels for each
thread

2
2

Instructions per cycle

Instances when unbalanced

Single-threaded operation

1
1
1
1

Power
Save
Mode

1
0
0
0

0,7 2,7 4,7 6,7 7,7 7,6 7,4 7,2 7,0 1,1

Thread 0 Priority - Thread 1 Priority


Thread 0 IPC

Thread 1 IPC
Hardware thread priorities

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Single-threaded operation
Advantageous for execution unit
limited applications
Floating or fixed point intensive
workloads

Execution unit limited applications


Thread states

provide minimal performance leverage


for SMT
Extra resources necessary for SMT
provide higher performance benefit
when dedicated to single thread

Dormant
Software
Hardware
or Software
Active
Software

Determined dynamically on a per


processor basis
Software

Concepts of Solution Design

Null

2003 IBM Corporation

Eserver pSeries

Micro-Partitioning

2003 IBM Corporation

^Eserver pSeries

Micro-Partitioning overview
Mainframe inspired technology
Virtualized resources shared by multiple partitions
Benefits
Finer grained resource allocation
More partitions (Up to 254)
Higher resource utilization

New partitioning model

POWER Hypervisor
Virtual processors
Fractional processor capacity partitions
Operating system optimized for Micro-Partitioning exploitation
Virtual I/O

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Processor terminology
Shared processor
partition
SMT Off

Shared processor
partition
SMT On

Dedicated
processor partition
SMT Off

Logical (SMT)
Virtual
Shared
Dedicated

Entitled capacity

Inactive (CUoD)
Deconfigured
Installed physical
processors

Shared processor pool

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Shared processor partitions


Micro-Partitioning allows for multiple partitions to share
one physical processor

Up to 10 partitions per physical processor


Up to 254 partitions active at the same time
Partitions resource definition

Minimum, desired, and maximum values for each resource


Processor capacity
Virtual processors
Capped or uncapped
Capacity weight

CPU 0

CPU 1

CPU 3

CPU 4

LPAR 1

LPAR 2

LPAR 3

LPAR 4

LPAR 5

LPAR 6

Dedicated memory
Minimum of 128 MB and 16 MB increments

Physical or virtual I/O resources

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Understanding min/max/desired resource values


The desired value for a resource is given to a
partition if enough resource is available.

If there is not enough resource to meet the desired


value, then a lower amount is allocated.

If there is not enough resource to meet the min


value, the partition will not start.

The maximum value is only used as an upper limit


for dynamic partitioning operations.

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Partition capacity entitlement


Processing units
1.0 processing unit represents one
physical processor

Entitled processor capacity

Minimum requirement
0.1 processing units

Commitment of capacity that is


reserved for the partition
Set upper limit of processor utilization
for capped partitions

0.5 processing unit

0.4 processing unit

Each virtual processor must be granted


at least 1/10 of a processing unit of
entitlement

Shared processor capacity is always


delivered in terms of whole physical
processors

Concepts of Solution Design

Processing capacity
1 physical processor
1.0 processing units

2003 IBM Corporation

^Eserver pSeries

Capped and uncapped partitions


Capped partition
Not allowed to exceed its entitlement

Uncapped partition
Is allowed to exceed its entitlement

Capacity weight
Used for prioritizing uncapped partitions
Value 0-255
Value of 0 referred to as a soft cap

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Partition capacity entitlement example


Shared pool has 2.0 processing units
available

LPARs activated in sequence


Partition 1 activated
Min = 1.0, max = 2.0, desired = 1.5
Starts with 1.5 allocated processing units

Partition 2 activated
Min = 1.0, max = 2.0, desired = 1.0
Does not start

Partition 3 activated
Min = 0.1, max = 1.0, desired = 0.8
Starts with 0.5 allocated processing units

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Understanding capacity allocation An example

A workload is run under different configurations.


The size of the shared pool (number of physical
processors) is fixed at 16.

The capacity entitlement for the partition is fixed


at 9.5.

No other partitions are active.

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Uncapped 16 virtual processors


Uncapped (16PPs/16VPs/9.5CE)
15
10

5
0
1

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

Elapsed time

16 virtual processors.
Uncapped.
Can use all available resource.
The workload requires 26 minutes to complete.

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Uncapped 12 virtual processors


Uncapped (16PPs/12VPs/9.5CE)
15

10

0
1

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

Elapsed time

12 virtual processors.
Even though the partition is uncapped, it can only use 12
processing units.

The workload now requires 27 minutes to complete.

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Capped
Capped (16PPs/12VPs/9.5E)

15
10
5
0
1

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

Elapses time

The partition is now capped and resource utilization is


limited to the capacity entitlement of 9.5.

Capping limits the amount of time each virtual processor is


scheduled.
The workload now requires 28 minutes to complete.

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Dynamic partitioning operations


Add, move, or remove processor capacity

Remove, move, or add entitled shared processor capacity


Change between capped and uncapped processing
Change the weight of an uncapped partition
Add and remove virtual processors
Provided CE / VP > 0.1
Add, move, or remove memory
16 MB logical memory block

Add, move, or remove physical I/O adapter slots


Add or remove virtual I/O adapter slots
Min/max values defined for LPARs set the bounds within which
DLPAR can work

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Dynamic LPAR
Standard on all new systems
Part#1
Production
AIX
5L

Part#2

Part#3

Part#4

Legacy

Test/

File/
Print

Move Apps
resourcesDev
between live
partitions
AIX
AIX
5L

Linux

5L

Hypervisor

HMC

Concepts of Solution Design

2003 IBM Corporation

Eserver pSeries

Firmware

POWER Hypervisor

2003 IBM Corporation

^Eserver pSeries

POWER Hypervisor strategy


New Hypervisor for POWER5 systems
Further convergence with iSeries
But brands will retain unique value propositions
Reduced development effort
Faster time to market

New capabilities on pSeries servers


Shared processor partitions
Virtual I/O

New capability on iSeries servers


Can run AIX 5L

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

POWER Hypervisor component sourcing


pSeries

H-Call Interface

iSeries

Location codes

Nucleus (SLIC)

Load from flash

Bus recovery

Drawer concurrent maint.


Message passing
NVRAM

Shared processor LPAR


Capacity on Demand

I/O configuration
FSP

SCSI IOA

Virtual Ethernet

LAN IOA

HSC

HMC

Concepts of Solution Design

Dump

Slot/tower concurrent maint.

255 partitions

Partition on demand

Virtual I/O

VLAN IOA
VLAN

2003 IBM Corporation

^Eserver pSeries

POWER Hypervisor functions


Dynamic Micro-Partitioning
CPU 0

CPU 3

Shared processor pools

SMT
SMTCore
Core

MemCtrl
Ctrl
Mem

SMT
SMTCore
Core
1.9
1.9MB
MBL2
L2Cache
Cache

SMT
SMTCore
Core

Chip-Chip / MCM-MCM / SMPLink

MemCtrl
Ctrl
Mem

Chip-Chip / MCM-MCM / SMPLink


1.9
1.9MB
MBL2
L2Cache
Cache

L3Dir
Dir
L3

Chip-Chip / MCM-MCM / SMPLink


1.9
1.9MB
MBL2
L2Cache
Cache

SMT
SMTCore
Core

L3Dir
Dir
L3

SMT
SMTCore
Core

MemCtrl
Ctrl
Mem

Dynamic LPAR

SMT
SMTCore
Core

SMT
SMTCore
Core
1.9
1.9MB
MBL2
L2Cache
Cache

Enhanced distributed switch

Machine is always in LPAR mode.


Even with all resources dedicated to one OS

SMT
SMTCore
Core

MemCtrl
Ctrl
Mem

CPU 1

CPU 2

L3Dir
Dir
L3

New, active functions.


Dynamic Micro-Partitioning
Shared processor pool
Virtual I/O
Virtual LAN

L3Dir
Dir
Enhanced distributed
switch
L3

Enhanced distributed switch

Same functions as POWER4 Hypervisor.


Dynamic LPAR
Capacity Upgrade on Demand

Enhanced distributed switch

Chip-Chip / MCM-MCM / SMPLink

Virtual I/O

Capacity Upgrade on Demand


Client Capacity Growth

Planned
Actual

Concepts of Solution Design

Disk

LAN

2003 IBM Corporation

^Eserver pSeries

POWER Hypervisor implementation


Design enhancements to previous POWER4
implementation enable the sharing of processors
by multiple partitions
Hypervisor decrementer (HDECR)
New Processor Utilization Resource Register (PURR)
Refine virtual processor objects
Does not include physical characteristics of the processor

New Hypervisor calls

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

POWER Hypervisor processor dispatch

Manage a set of processors on the machine


(shared processor pool).

POWER5 generates a 10 ms dispatch window.


Minimum allocation is 1 ms per physical
processor.

Each virtual processor is guaranteed to get its


entitled share of processor cycles during each 10
ms dispatch window.
ms/VP = CE * 10 / VPs

CPU 0

A VP dispatched within 1 ms of the end of the


dispatch interval will receive half its CE at the
start of the next dispatch interval.

Concepts of Solution Design

SMT
SMTCore
Core

1.9
1.9MB
MBL2
L2Cache
Cache

1.9
1.9MB
MBL2
L2Cache
Cache

Chip-Chip / MCM-MCM / SMPLink

MemCtrl
Ctrl
Mem

Chip-Chip / MCM-MCM / SMPLink

SMT
SMTCore
Core

SMT
SMTCore
Core

MemCtrl
Ctrl
Mem

1.9
1.9MB
MBL2
L2Cache
Cache

MemCtrl
Ctrl
Mem

MemCtrl
Ctrl
Mem

1.9
1.9MB
MBL2
L2Cache
Cache

SMT
SMTCore
Core

L3Dir
Dir
L3

SMT
SMTCore
Core

L3Dir
Dir
L3

SMT
SMTCore
Core

CPU 1
L3Dir
Dir
L3

SMT
SMTCore
Core

Enhanced distributed switch

SMT
SMTCore
Core

L3Dir
Dir
L3

Once a capped partition has received its CE


within a dispatch interval, it becomes notrunnable.

Enhanced distributed switch

Enhanced distributed switch

The partition entitlement is evenly distributed


among the online virtual processors.

POWER
Hypervisors
processor
dispatch

Enhanced distributed switch

Virtual processor capacity entitlement for


six shared processor partitions

Chip-Chip / MCM-MCM / SMPLink

Chip-Chip / MCM-MCM / SMPLink

CPU 2

CPU 3

Shared processor pool

2003 IBM Corporation

^Eserver pSeries

Dispatching and interrupt latencies


Virtual processors have dispatch latency.
Dispatch latency is the time between a virtual
processor becoming runnable and being actually
dispatched.

Timers have latency issues also.


External interrupts have latency issues also.

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Shared processor pool


Processors not associated with
dedicated processor partitions.

Virtual processor capacity entitlement for


six shared processor partitions

No fixed relationship between virtual


processors and physical processors.

The POWER Hypervisor attempts to


use the same physical processor.

POWER
Hypervisors
processor
dispatch

Affinity scheduling

SMT
SMTCore
Core

1.9
1.9MB
MBL2
L2Cache
Cache

MemCtrl
Ctrl
Mem

MemCtrl
Ctrl
Mem

1.9
1.9MB
MBL2
L2Cache
Cache

SMT
SMTCore
Core

L3Dir
Dir
L3

SMT
SMTCore
Core

L3Dir
Dir
L3

MemCtrl
Ctrl
Mem

1.9
1.9MB
MBL2
L2Cache
Cache

SMT
SMTCore
Core

Enhanced distributed switch

SMT
SMTCore
Core

L3Dir
Dir
L3

MemCtrl
Ctrl
Mem

1.9
1.9MB
MBL2
L2Cache
Cache

SMT
SMTCore
Core

Enhanced distributed switch

SMT
SMTCore
Core

L3Dir
Dir
L3

SMT
SMTCore
Core

Enhanced distributed switch

Enhanced distributed switch

Home node

Chip-Chip / MCM-MCM / SMPLink

Chip-Chip / MCM-MCM / SMPLink

Chip-Chip / MCM-MCM / SMPLink

Chip-Chip / MCM-MCM / SMPLink

CPU 0

CPU 1

CPU 2

CPU 3

Shared processor pool

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Affinity scheduling
When dispatching a VP, the POWER Hypervisor attempts to
preserve affinity by using:

Same physical processor as before, or


Same chip, or
Same MCM

When a physical processor becomes idle, the POWER


Hypervisor looks for a runnable VP that:
Has affinity for it, or
Has affinity to no-one, or
Is uncapped

Similar to AIX affinity scheduling

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Operating system support


Micro-Partitioning capable operating systems need to be modified
to cede a virtual processor when they have no runnable work
Failure to do this results in wasted CPU resources

For example, an partition spends its CE waiting for I/O


Results in better utilization of the pool

May confer the remainder of their timeslice to another VP


For example, a VP holding a lock

Can be redispatched if they become runnable again during the


same dispatch interval

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Example
Physical LPAR 1 LPAR 3 LPAR 1
processor 0 VP 1
VP 2
VP 1

Physical LPAR 2
processor 1 VP 0
0

LPAR 1
VP 0
2

LPAR 3
VP 0

IDLE

LPAR 3 LPAR 3 LPAR 3


VP 0
VP 1
VP 2
5

LPAR 1
VP 0

LPAR 1
VP 1

IDLE

LPAR 3 LPAR 2
VP 1
VP 0

10 11 12 13 14 15 16 17 18 19 20

POWER Hypervisor dispatch interval pass 1 (msec)

POWER Hypervisor dispatch interval pass 2 (msec)

LPAR1
Capacity entitlement = 0.8 processing units; virtual processors = 2 (capped)
LPAR2
Capacity entitlement = 0.2 processing units; virtual processors = 1 (capped)
LPAR3
Capacity entitlement = 0.6 processing units; virtual processors = 3 (capped)

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

POWER Hypervisor and virtual I/O


I/O operations without dedicating resources to an individual
partition

POWER Hypervisors virtual I/O related operations


Provide control and configuration structures for virtual adapter images
required by the logical partitions
Operations that allow partitions controlled and secure access to
physical I/O adapters in a different partition
The POWER Hypervisor does not own any physical I/O devices; they
are owned by an I/O hosting partition

I/O types supported


SCSI
Ethernet
Serial console
Disk

Concepts of Solution Design

LAN

2003 IBM Corporation

^Eserver pSeries

Performance monitoring and accounting


CPU utilization is measured against CE.
An uncapped partition receiving more than its CE will record 100% but
will be using more.

SMT
Thread priorities compound the variable speed rate.
Twice as many logical CPUs.

For accounting, interval may be incorrectly allocated.


New hardware support is required.

Processor utilization register (PURR) records actual clock ticks spent


executing a partition.
Used by performance commands (for example, new flags) and
accounting modules.
Third party tools will need to be modified.

Concepts of Solution Design

2003 IBM Corporation

Eserver pSeries

Virtual I/O Server

2003 IBM Corporation

^Eserver pSeries

Virtual I/O Server

Provides an operating environment for virtual I/O administration


Virtual I/O server administration
Restricted scriptable command line user interface (CLI)

Minimum hardware requirements


POWER5 VIO capable machine
Hardware management console
Storage adapter
Physical disk
Ethernet adapter
At least 128 MB of memory

Capabilities of the Virtual I/O Server


Ethernet Adapter Sharing
Virtual SCSI disk

Virtual I/O Server Version 1.1 is addressed for selected configurations, which include specific models of
EMC, HDS, and STK disk subsystems, attached using Fiber Channel

Interacts with AIX and Linux partitions

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Virtual I/O Server (Cont.)


Installation CD when Advanced POWER
Virtualization feature is ordered

Configuration approaches for high availability


Virtual I/O Server
LVM mirroring
Multipath I/O
EtherChannel

Second virtual I/O server instance in another partition

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Virtual SCSI
Allows sharing of storage devices
Vital for shared processor partitions
Overcomes potential limit of adapter slots due to MicroPartitioning
Allows the creation of logical partitions without the need for
additional physical resources

Allows attachment of previously unsupported storage


solutions

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

VSCSI server and client architecture overview


Virtual SCSI is based on a
client/server relationship.

The virtual I/O resources are assigned


using an HMC.

Virtual I/O
Client
Client
Server partition partition partition
AIX
Linux
LVM

Virtual SCSI enables sharing of


adapters as well as disk devices.

Dynamic LPAR operations allowed.

Logical
volume 1

Logical
volume 2

hdisk

hdisk

VSCSI server
adapter

VSCSI server
adapter

VSCI client
adapter

VSCI client
adapter

Dynamic mapping between physical


and virtual resources on the virtual
I/O server.

POWER Hypervisor
Physical adapter

Physical disk
(SCSI, FC)

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Virtual devices
Are defined as LVs in the I/O server
partition

Normal LV rules apply

Appear as real devices (hdisks) in the


hosted partition

Can be manipulated using Logical

Volume Manager just like an ordinary


physical disk

Client partition
Virtual
disk

hdisk
LVM
VSCI client
adapter

POWER Hypervisor
VSCSI server
adapter

LVM

Can be used as a boot device and as a


NIM target

Can be shared by multiple clients

LV
hdisk
Virtual I/O Server partition

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

SCSI RDMA and Logical Remote Direct Memory Access


SCSI transport protocols define the

rules for exchanging information


between SCSI initiators and targets.

Virtual SCSI uses the SCSI RDMA

Protocol (SRP).
SCSI initiators and targets have the
ability to directly transfer information
between their respective address
spaces.

SCSI requests and responses are

sent using the Virtual SCSI adapters.

The actual data transfer, however, is


done using the Logical Redirected
DMA protocol.

Concepts of Solution Design

Virtual I/O Server


partition

Physical
adapter device
driver

VSCI device
driver (target)
Device
Mapping

Client partition AIX

VSCI device
driver (initiator)

Data Buffer

Reliable Command / Response Transport


Logical Remote Direct Memory Access

POWER Hypervisor
Physical adapter

2003 IBM Corporation

^Eserver pSeries

Virtual SCSI security


Only the owning partition has access to its data.
Data-information is copied directly from the PCI
adapter to the clients memory.

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Performance considerations
Twice as many processor cycles to do VSCSI as a locally attached
disk I/O (evenly distributed on the client partition and virtual I/O
server)
The path of each virtual I/O request involves several sources of
overhead that are not present in a non-virtual I/O request.
For a virtual disk backed by the LVM, there is also the performance
impact of going through the LVM and disk device drivers twice.

If multiple partitions are competing for resources from a VSCSI


server, care must be taken to ensure enough server resources
(CPU, memory, and disk) are allocated to do the job.

If not constrained by CPU performance, dedicated partition


throughput is comparable to doing local I/O.

Because there is no caching in memory on the server I/O partition,


it's memory requirements should be modest.

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Limitations
Hosting partition must be available before hosted
partition boot.

Virtual SCSI supports FC, parallel SCSI, and SCSI


RAID.

Maximum of 65535 virtual slots in the I/O server


partition.

Maximum of 256 virtual slots on a single partition.


Support for all mandatory SCSI commands.
Not all optional SCSI commands are supported.

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Implementation guideline
Partitions with high performance and disk I/O
requirements are not recommended for
implementing VSCSI.

Partitions with very low performance and disk I/O


requirements can be configured at minimum
expense to use only a portion of a logical volume.

Boot disks for the operating system.


Web servers that will typically cache a lot of data.

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

LVM mirroring
This configuration

protects virtual disks in a


client partition against
failure of:
One physical disk
One physical adapter
One virtual I/O server

Virtual I/O
Server
partition

Client
partition

Virtual I/O
Server
partition

LVM

LVM

LVM

VSCSI server
adapter

VSCSI
client
adapter

VSCSI
client
adapter

VSCSI server
adapter

POWER Hypervisor
Physical SCSI
adapter

Physical SCSI
adapter

Physical disk
(SCSI)

Physical disk
(SCSI)

Many possibilities exist


to exploit this great
function!

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Multipath I/O
This configuration protects

virtual disks in a client


partition against failure of:
Failure of one physical FC
adapter in one I/O server
Failure of one Virtual I/O
server

Physical disk is assigned as a

Virtual I/O
Server
partition

Client
partition

Virtual I/O
Server
partition

LVM
(hdisk)

LVM

LVM
(hdisk)

VSCSI server
adapter

VSCSI
client
adapter

VSCSI
client
adapter

VSCSI server
adapter

POWER Hypervisor
Physical FC adapter

Physical FC adapter

whole to the client partition

Many possibilities exist to

SAN Switch

exploit this great function!


Physical disk
ESS

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Virtual LAN overview


VLAN 1

Virtual network segments on top of VLAN 2


physical switch devices.

All nodes in the VLAN can

communicate without any L3


routing or inter-VLAN bridging.

Node A-1

VLANs provides:

Node A-2

Switch A

Increased LAN security


Flexible network deployment over
traditional network devices

Switch B

Switch C

VLAN support in AIX is based on

the IEEE 802.1Q VLAN


implementation.
VLAN ID tagging to Ethernet
frames
VLAN ID restricted switch ports

Node B-1 Node B-2 Node B-3

Concepts of Solution Design

Node C-1 Node C-2

2003 IBM Corporation

^Eserver pSeries

Virtual Ethernet
Enables inter-partition communication.
In-memory point to point connections

Physical network adapters are not needed.


Similar to high-bandwidth Ethernet connections.
Supports multiple protocols (IPv4, IPv6, and ICMP).
No Advanced POWER Virtualization feature required.
POWER5 Systems
AIX 5L V5.3 or appropriate Linux level
Hardware management console (HMC)

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Virtual Ethernet connections


VLAN technology implementation
Partitions can only access data directed to them.

Virtual Ethernet switch provided by the POWER


Hypervisor

AIX
AIX
Linux
partition partition partition

Virtual LAN adapters appears to the OS as


physical adapters
MAC-Address is generated by the HMC.

1-3 Gb/s transmission speed


Support for large MTUs (~64K) on AIX.

Up to 256 virtual Ethernet adapters


Up to 18 VLANs.

Virtual
Ethernet
adapter

Virtual
Ethernet
adapter

Virtual
Ethernet
adapter

Virtual Ethernet switch

POWER Hypervisor

Bootable device support for NIM OS


installations

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Virtual Ethernet switch


Based on IEEE 802.1Q VLAN standard
OSI-Layer 2
Optional Virtual LAN ID (VID)
4094 virtual LANs supported
Up to 18 VIDs per virtual LAN port

Switch configuration through HMC

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

How it works
Virtual Ethernet adapter

Virtual VLAN switch port

PHYP caches source MAC

IEEE VLAN
header?

Check VLAN header

N
Insert VLAN header

Y
Port allowed?

N
Dest. MAC in
table?
Configured associated switch
port

Y
N
Trunk adapter
defined?

Match for
VLAN Nr. in
table?

N
Deliver

Pass to Trunk
adapter

Concepts of Solution Design

Drop packet

2003 IBM Corporation

^Eserver pSeries

Performance considerations
Throughput per 0.1 entitlement

Virtual Ethernet performance

Throughput/0.1
entitlement
[Mb/s]

1000

Throughput scales nearly linear with the


allocated capacity entitlement

800
600
400
200
0
0.1

0.3

0.5

0.8

65394
9000
1500 MTU

size

CPU entitlements

Throughput, TCP_STREAM

Virtual LAN vs. Gigabit Ethernet throughput


Virtual Ethernet adapter has higher raw
throughput at all MTU sizes
In-memory copy is more efficient at larger
MTU

Concepts of Solution Design

Throughput
[Mb/s]
10000
8000
6000
VLAN

4000

Gb Ethernet

2000
0
MTU
Simpl./Dupl.

1
1500
S

1500 9000
D
S

9000 65394 65394


D
S
D

2003 IBM Corporation

^Eserver pSeries

Limitations
Virtual Ethernet can be used in both shared and
dedicated processor partitions provided with the
appropriate OS levels.

A mixture of Virtual Ethernet connections, real network


adapters, or both are permitted within a partition.

Virtual Ethernet can only connect partitions within a


single system.

A systems processor load is increased when using


virtual Ethernet.

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Implementation guideline
Know your environment and the network traffic.
Choose a high MTU size, as it makes sense for the
network traffic in the Virtual LAN.

Use the MTU size 65394 if you expect a large amount of


data to be copied inside your Virtual LAN.

Enable tcp_pmtu_discover and udp_pmtu_discover in


conjunction with MTU size 65394.

Do not turn off SMT.


No dedicated CPUs are required for virtual Ethernet
performance.

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Connecting Virtual Ethernet to external networks

Routing
The partition that routes the traffic to the external work does not necessarily have to be
the virtual I/O server.

AIX partition

1.1.1.100

3.1.1.1

AIX
Linux
partition partition
3.1.1.10

AIX partition

3.1.1.10

2.1.1.100

4.1.1.1

AIX
Linux
partition partition
4.1.1.10

4.1.1.11

Virtual Ethernet switch

Virtual Ethernet switch

POWER Hypervisor

POWER Hypervisor

Physical adapter

Physical adapter

IP subnet 1.1.1.X

IP Router
1.1.1.1
2.1.1.1

IP subnet 2.1.1.X

AIX
Server

Linux
Server

1.1.1.10

2.1.1.10

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Shared Ethernet Adapter


Connects internal and external VLANs using one physical
adapter.

SEA is a new service that acts as a layer 2 network switch.


Securely bridges network traffic from a virtual Ethernet
adapter to a real network adapter

SEA service runs in the Virtual I/O Server partition.


Advanced POWER Virtualization feature required
At least one physical Ethernet adapter required

No physical I/O slot and network adapter required in the


client partition.

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Shared Ethernet Adapter (Cont.)


Virtual Ethernet MAC are visible to outside systems.
Broadcast/multicast is supported.
ARP (Address Resolution Protocol) and NDP (Neighbor Discovery
Protocol) can work across a shared Ethernet.

One SEA can be shared by multiple VLANs and multiple subnets can
connect using a single adapter on the Virtual I/O Server.

Virtual Ethernet adapter configured in the Shared Ethernet Adapter


must have the trunk flag set.
The trunk Virtual Ethernet adapter enables a layer-2 bridge to a
physical adapter

IP fragmentation is performed or an ICMP packet too big message is


sent when the shared Ethernet adapter receives IP (or IPv6) packets
that are larger than the MTU of the adapter that the packet is
forwarded through.

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Virtual Ethernet and Shared Ethernet Adapter security


VLAN (virtual local area network) tagging description taken
from the IEEE 802.1Q standard.

The implementation of this VLAN standard ensures that the


partitions have no access to foreign data.

Only the network adapters (virtual or physical) that are


connected to a port (virtual or physical) that belongs to the
same VLAN can receive frames with that specific VLAN ID.

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Performance considerations
Virtual I/O-Server
performance
Adapters stream data at
media speed if the Virtual
I/O server has enough
capacity entitlement.
CPU utilization per Gigabit
of throughput is higher with
a Shared Ethernet adapter.

Virtual I/O Server Throughput, TCP_STREAM


Throughput
[Mb/s]

2000
1500
1000
500
0
MTU
Simplex/Duplex

CPU
Utilisation
[%cpu/Gb]

1500
simplex

1500
duplex

9000
simplex

9000
duplex

Virtual I/O Server


normalized CPU utilisation, TCP_STREAM

100
80
60
40
20
0
MTU
Simplex/Duplex

Concepts of Solution Design

1500
simplex

1500
duplex

9000
simplex

9000
duplex

2003 IBM Corporation

^Eserver pSeries

Limitations
System processors are used for all communication functions,
leading to a significant amount of system processor load.

One of the virtual adapters in the SEA on the Virtual I/O

server must be defined as a default adapter with a default


PVID.

Up to 16 Virtual Ethernet adapters with 18 VLANs on each can


be shared on a single physical network adapter.

Shared Ethernet Adapter requires:

POWER Hypervisor component of POWER5


systems
AIX 5L Version 5.3 or appropriate Linux level

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Implementation guideline
Know your environment and the network traffic.
Use a dedicated network adapter if you expect heavy
network traffic between Virtual Ethernet and local
networks.

If possible, use dedicated CPUs for the Virtual I/O


Server.

Choose 9000 for MTU size, if this makes sense for your
network traffic.

Dont use Shared Ethernet Adapter functionality for


latency critical applications.

With MTU size 1500, you need about 1 CPU per gigabit
Ethernet adapter streaming at media speed.

With MTU size 9000, 2 Gigabit Ethernet adapters can


stream at media speed per CPU.

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Shared Ethernet Adapter configuration


Virtual I/O Server

The Virtual I/O Server is


configured with at least one
physical Ethernet adapter.

One Shared Ethernet Adapter


can be shared by multiple
VLANs.

Multiple subnets can connect


using a single adapter on the
Virtual I/O Server.

Concepts of Solution Design

AIX
Linux
partition partition

Shared Ethernet Adapter

ent0

VLAN 2 VLAN 1

VLAN 1
10.1.1.11

VLAN 2
10.1.2.11

Virtual Ethernet switch

POWER Hypervisor
Physical adapter

VLAN 1

VLAN 2

AIX
Server

Linux
Server

10.1.1.14

10.1.2.15

2003 IBM Corporation

^Eserver pSeries

Multiple Shared Ethernet Adapter configuration


Virtual I/O Server

Maximizing throughput
Using several Shared Ethernet
Adapters

Shared Ethernet Adapter

ent0 ent1

VLAN VLAN
2
1

Concepts of Solution Design

VLAN 1
10.1.1.11

VLAN 2
10.1.2.11

Virtual Ethernet switch

More queues
More performance

AIX
Linux
partition partition

POWER Hypervisor
Physical adapter Physical adapter

VLAN 1

VLAN 2

AIX
Server

Linux
Server

10.1.1.14

10.1.2.15

2003 IBM Corporation

^Eserver pSeries

Multipath routing with dead gateway detection


This configuration protects
your access to the external
network against:
Failure of one physical
network adapter in one I/O
server
Failure of one Virtual I/O
server
Failure of one gateway

Virtual I/O
Server 2

AIX partition
Multipath routing
with
dead gateway
detection

Shared Ethernet Adapter

ent0

Virtual I/O
Server 2

VLAN 1
9.3.5.11

default route to 9.3.5.10 via 9.3.5.12


default route to 9.3.5.20 via 9.3.5.22

Shared Ethernet Adapter

VLAN 1 VLAN 2
9.3.5.12 9.3.5.22

VLAN 2
9.3.5.21

ent0

Virtual Ethernet switch

POWER Hypervisor
Physical adapter

Physical adapter

Gateway
9.3.5.10

Gateway
9.3.5.20

External
network

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Shared Ethernet Adapter commands


Virtual I/O Server commands
lsdev -type adapter: Lists all the virtual and physical adapters.
Choose the virtual Ethernet adapter we want to map to the physical
Ethernet adapter.
Make sure the physical and virtual interfaces are unconfigured (down or
detached).
mkvdev: Maps the physical adapter to the virtual adapter, creates a layer 2
bridge, and defines the default virtual adapter with its default VLAN ID. It
creates a new Ethernet interface (for example, ent5).
The mktcpip command is used for TCP/IP configuration on the new
Ethernet interface (for example, ent5).

Client partition commands


No new commands are needed; the typical TCP/IP configuration is done on
the virtual Ethernet interface that it is defined in the client partition profile on
the HMC.

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Virtual SCSI commands


Virtual I/O Server commands
To map a LV:
mkvg: Creates the volume group, where a new LV will be created using the mklv
command.
lsdev: Shows the virtual SCSI server adapters that could be used for mapping
with the LV.
mkvdev: Maps the virtual SCSI server adapter to the LV.
lsmap -all: Shows the mapping information.

To map a physical disk:


lsdev: Shows the virtual SCSI server adapters that could be used for mapping
with a physical disk.
mkvdev: Maps the virtual SCSI server adapter to a physical disk.
lsmap -all: Shows the mapping information.

Client partition commands


No new commands needed; the typical device configuration uses the
cfgmgr command.

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Section Review Questions


1. Any technology improvement will boost
performance of any client solution.

a. True
b. False

2. The application of technology in a creative way


to solve clients business problems is one
definition of innovation.

a. True
b. False

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Section Review Questions


3. Clients satisfaction with your solution can be
enhanced by which of the following?
a. Setting expectations appropriately.
b. Applying technology appropriately.
c. Communicating the benefits of the technology to the
client.
d. All of the above.

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Section Review Questions


4. Which of the following are available with
POWER5 architecture?
a. Simultaneous Multi-Threading.
b. Micro-Partitioning.
c. Dynamic power management.
d. All of the above.

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Section Review Questions


5. Simultaneous Multi-Threading is the same as
hyperthreading, IBM just gave it a different
name.
a. True.
b. False.

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Section Review Questions


6. In order to bridge network traffic between the
Virtual Ethernet and external networks, the
Virtual I/O Server has to be configured with at
least one physical Ethernet adapter.
a. True.
b. False.

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Review Question Answers


1. b
2. a
3. d
4. d
5. b
6. a

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Unit Summary
You should now be able to:
Describe the relationship between technology and
solutions.
List key IBM technologies that are part of the POWER5
products.
Be able to describe the functional benefits that these
technologies provide.
Be able to discuss the appropriate use of these
technologies.

Concepts of Solution Design

2003 IBM Corporation

^Eserver pSeries

Reference
You may find more information here:
IBM eServer pSeries AIX 5L Support for Micro-Partitioning
and Simultaneous Multi-threading White Paper
Introduction to Advanced POWER Virtualization on IBM
eServer p5 Servers SG24-7940
IBM eServer p5 Virtualization Performance
Considerations SG24-5768

Concepts of Solution Design

2003 IBM Corporation