SER2343BE EMEA 1507847000616001rZLS

SER2343BE
Extreme Performance Series:

t i o n
vSphere Compute & Memory i s tr ibu
or d
Schedulers t ion
bli c a
r p u
o t fo
nt: N
o n te
17 C
2 0
w o rld
V M
Haoqiang Zheng
Principal Engineer, VMware, Inc
#SER2343BE
Disclaimer
• This presentation may contain product features that are currently under development.
• This overview of new technology represents no commitment from VMware to deliver these
features in any generally available product. t i o n
tr ibu
r dis purchase orders, or
• Features are subject to change, and must not be included in contracts,
o
sales agreements of any kind.
a t i on
c u b li
• Technical feasibility and market demand will affect o p
r final delivery.
o t f
• n
Pricing and packaging for any new technologiest : N or features discussed or presented have not
ont e
been determined.
17 C
2 0
o r ld
VMw
#SER2343BE CONFIDENTIAL 2
Agenda
t i o n
1 CPU Scheduler
i s tr ibu
or d
t ion
2 Memory Management
bli c a
r p u
o t fo
3 t: N
NUMA Scheduler
n
n t e
Co
7 Sizing and Host Configuration
1VM
4 2 0
w o rld
V M
t i o n
i s tr ibu
or d
t ion
bli c a
r p u
o t fo
nt: N
o n te
17 C
2 0
w orld
V M
Agenda
t i o n
1 CPU Scheduler
i s tr ibu
or d
t ion
2 Memory Management
bli c a
r p u
o t fo
3 t: N
NUMA Scheduler
n
n t e
Co
7 Configuration and VM Sizing
1Host
4 2 0
w o rld
V M
CPU Scheduler Overview
• Goals
– High CPU utilization, high application throughput
– Ensure fairness (shares, reservation, limit)
t i o n
i s tr ibu
or d
t ion
bli c a
r p u
o t fo
nt: N
o n te
17 C
2 0
w o rld
V M
CPU Scheduler Overview
• When?
– The idle PCPUs has new runnable worlds (wakeup: VM power on, etc.)
– The running world voluntarily yields CPU (wait: idle/none-idle)
t i o n
t r i b u reached)
is
– The running world involuntarily gives up CPU (preemption: high priority/fair
r
share
d
on o
c a t i
• What?
r p ubli
o
f time / fair share)
otCPU
– World in ready queue with the least (consumed
: N
n t
C onte
• Where? 0 1 7
l d
rPCPUs2
w
– Balance load across
M o
V
– Preserve cache state, minimize migration cost
– Avoid HT/LLC contention, sibling vCPUs
– Close to worlds that have frequent communication pattern
Scheduling Through the Lens of esxtop
A Command Line Tool for Performance Monitoring
• For real time monitoring
t i o n
– Just type esxtop in to ESXi shell /
terminal
iibu
s tr
• For batchnmode o d
r collection
c a t io
u li
–besxtop –b –a –d $DELAY –n
or p $SAMPLES > $FILE_NAME.csv
o t f
e n t:N – Tools: perfmon, excel
on t
1 7 C
r ld 20
Mw o • A few changes in 6.5
V – Processor turbo or frequency scaling
efficiency
– More intuitive accounting on %SYS
Peeking into Virtual Machine using esxtop
t i o n
i s tr ibu
or d
t ion
bli c a
r p u
o t fo
nt: N
o n te
17 C
2 0
w orld
V M
A virtual machine consists of more than vCPU worlds.
CPU Scheduler Accounting: %USED vs. %RUN
t i o n
i s tr ibu
or d
t ion
bli c a
r p u
o t fo
nt: N
o n te
17 C
2 0
%USED vs. %RUN w o
(UTIL) rld
VM
– For 1 second, world achieves different amount of work
– RUN (UTIL) is based on wall clock time (TSC)
– USED reflects frequency scaling (power, turbo) and hyper-thread contention
CPU Scheduler: Throughput Gain due to Hyperthreading
SPEC CPU2006 in VMs (Haswell)
HT: workload + idle vs. workload + workload
1.6
t i o n
i s t r ibu
d
1.4
n or
1.2
c a t io
bli
Normalized Throughput Gain
r p u
1
o t fo
nt: N
0.8
o n te
1 7 C
2 0
rld
0.6
Mw o
0.4 V
0.2
0
perlbench bzip2 gcc mcf gobmk hmmer sjeng libquantum h264ref
Baseline HT
CPU Scheduler: Slowdown due to Hyperthreading
SPEC CPU2006 in VMs (Haswell)
HT: workload + idle vs. workload + workload
n
25
t i o
i s t r ibu
or d 1.8x
20
t ion 1.7x
bli c a
Runtime in minutes (min)
r p u
1.6x
15
1.5x
o t fo 1.6x
nt: N 1.6x
1.9x
o n te
1 7 C1.6x
0
10
rld 2 1.8x
Mw o
5
V
0
perlbench bzip2 gcc mcf gobmk hmmer sjeng libquantum h264ref
baseline HT
CPU Scheduler: Hyperthreading and %USED time
• Improve throughput
• Each vCPU might runs slower with contention from hyper-twin
t i o n
– 2 cores vs. 2 HT / single core
i s tr ibu
or d
t ion
• Hyperthreading-aware scheduling bli c a
r p u
fo
– Same %RUN translates into different amount oft%USED
o
• 100% %RUN may only give 70% %USED n : N of contention
intcase
o n te
1 7 C
d 2 0
• o rl for the extra throughput
Enabled HT by default
w
VM
• Be aware of HT contention. ☺
CPU Scheduler Accounting: Breakdown
%RDY
%USED = %RUN + %SYS - %OVRLP - E
t i o n
t r i b u loss from
r dispower mgmt, hyper-
CPU Efficiency
Time in ready Actual
Waiting scheduling Interrupted
on
execution o
cost
queue
c a t i threading, etc.
u bli
or p
o t f
:N
W A B Content D C E
0 7
1
rld 2
Mw o
%WAITV %OVRLP
%RUN
t0 t1 t2 t3 t4 t5 t6 t7 t8
%SYS += D if for this VM
CPU Scheduler Accounting: %SYS from Kernel Contexts
vSphere 6.0
t i o n
i s tr ibu
or d
t ion
bli c a
r p u
o t fo
vSphere 6.5
nt: N
o n te NEW!
17 C
2 0
w orld
V M
128 vCPUs!!
CPU Scheduler Accounting: Group vs. World
t i o n
i s tr ibu
or d
t ion
bli c a
r p u
o t fo
nt: N
o n te
17 C
2 0
w orld
V M
Group (VM) stats aggregate world stats.

%RDY Impact on Throughput (Java Workload)
1.20
n
1.00
u t i
-15% o
s t ri b
rd i
Throughput (bops)
0.80
on o
c a t i
0.60
u b li
or p
o t f
0.40
e n t:N
on t
1 7 C
20
0.20
o r ld
V Mw 0.00
0 2 4 6 8 10 12 14 16 18 20
%RDY
%RDY affects throughput
%RDY Impact on Latency (Redis Workload)
18
16
99.99 Percentile Latency (msec)

14
t i o n
ibu
(-7ms)
12
i s t r(-4ms)
10
or d
t io n
8
bli c a
6
r p u spiky
t f o
4
:N o flat
t e n t
on
2
0
1 7 C
r ld
0
20 5 10 15 20 25
Mw o %RDY
V
0 10 20 30 0 10 20 30
Latency depends on the competing workloads.
CPU Scheduler Co-scheduling
• *NOT* gang-scheduling
– Allows a subset of vCPUs to run simultaneously
t i o n
i s tr ibu
• Costop a leading vCPU if it advances too far ahead
or d
t ion
– Efficient in consolidated setup
bli c a
– IDLE time will not cause costop! r p u
o t fo
nt: N
o n te
• High %CSTOP? 17C
d 2 0
– o
Watch out for vCPU’s r l(WAIT – WAIT_IDLE), i.e. %VMWAIT from esxtop
Mw
– vCPU blocks Vdue to IO (to snapshot) or host level memory swap
Agenda
t i o n
1 CPU Scheduler
i s tr ibu
or d
t ion
2 Memory Management
bli c a
r p u
o t fo
3 t: N
NUMA Scheduler
n
n t e
Co
1Host
4 2 0
w o rld
V M
Memory Management Overview
• Goals
– Allow memory over-commitment
– Handle transient memory pressure well t i o n
i s tr ibu
or d
• Terminology t ion
bli c a
r p u
o t fo Active Memory
n t :
AllocatedeMemoryN
C ont
Total Memory Size
d 2017 Idle Memory
o r l
V Mw Free Memory
Memory Management Overview
• Reclaim memory if consumed > entitled
– Entitlement: shares, limit, reservation, active estimation
– Page sharing > Ballooning > Compression > Host swapping t i o n
• Breaks host large pages i s tr ibu
or d
t ion
bli c a
• Page sharing vs. large pages
r p u
o t fo
– Using large pages for both guest and ESXi improves performance by 10 – 30%
nt: N
n te
– Page sharing avoids ballooning and swapping
o
17 C
0
– vSphere 6.0 breaks large pages earlier and increase page sharing (clear state)
2
w o rld
V M
Transient Memory Pressure Example
• Six 4GB Swingbench VMs (VM-4,5,6 are idle) in a 16GB host
10000
VM1 VM2 VM3
Operations per Minutes
8000
t ino = 0%
t rib u
∆VM1
6000
r i s
d ∆VM2 = 0%
o n o
cati
4000
2000
u b li
or p
0
o t f
13 :N
21
28
36
43
51
6
10
11
12
13
14
15
16
17
18
18
19
20
22
22
23
24
25
26
27
27
29
30
31
32
32
33
34
35
37
38
38
39
40
41
42
43
44
45
46
47
48
48
49
50
52
53
53
54
55
56
0
1
2
3
3
4
5
7
7
8
9
t e n t Time (minutes)
C on
1 7
20
12
o r ld
Mw
10
V
Balloon
8
Size(GB)
Swap Used
6
Compressed
4
Shared
2
0
11
24
44
10
12
13
14
15
16
17
18
19
20
21
22
23
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
45
46
47
48
49
50
51
52
53
54
55
0
1
2
3
4
5
6
7
8
9
Time (minutes)
Constant Memory Pressure Example
• All six VMs run Swingbench workloads
10000
VM1 VM2 VM3 VM4 VM5 VM6
n
Operations per Minute
8000
t i o
u = -16%
t r i
∆VM1b
6000
r d is
on o ∆VM2 = -21%
i
4000
c a t
2000
r p ubli
tf o
0
: N o
ent
23
31
50
4
10
11
12
12
13
14
15
16
17
17
18
19
20
21
22
22
24
25
26
27
27
28
29
30
32
32
33
34
35
36
37
37
38
39
40
41
42
42
43
44
45
46
47
47
48
49
51
52
52
53
54
55
56
1
2
2
3
5
6
7
7
8
9
ont
Time (minutes)
1 7 C
20
4000
o r ld
Mw Swap-in Rate
KB per Second
3000
V
2000
1000
0
0
1
2
3
4
5
6
7
8
9
12
14
31
33
50
52
10
11
13
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
32
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
51
53
54
55
Time (minutes)
General Principles
• Two types of memory overcommitment
– “Configured” memory overcommitment: SUM (memory size of all VMs) / host memory size
– “Active” memory overcommitment: SUM (mem.active of all VMs) / host memory size
t i o n
i s tr ibu
or d
• Performance impact t ion
bli c a
p u
– “Active” memory overcommitment ≈ 1  high likelihood of performance degradation!
r
• Some active memory are not in physical RAM o t fo
nt: N
o n te
– “Configured” memory overcommitment > 1  zero or negligible impact
17 C
• Most reclaimed memory are free/idle guest memory
2 0
w orld
• V M
Aim for high consolidation while keeping down “active” memory overcommitment
Agenda
t i o n
1 CPU Scheduler
i s tr ibu
or d
t ion
2 Memory Management
bli c a
r p u
o t fo
3 t: N
NUMA Scheduler
n
n t e
Co
1Host
4 2 0
w o rld
V M
NUMA
• Non-Uniform Memory Access system architecture
– Each node consists of CPU cores, memory and possible devices
– Access time can be 30% ~ 200% longer across nodes n
t i o
i s ibu
tr
• NUMA node vs. Sockets
d
or NUMA node 0
t ion
– Multiple NUMA nodes per socket (Cluster-on-Die) bli c a
r p u
– Multiple sockets per NUMA node (less common)
o t fo
nt: N
o n te
• Small VMs scheduled on a0single
C
17 physical NUMA node
r ld
– 100% local memoryoaccesses
2
VMw
– “Fixes” scale-out apps that don’t scale-up well NUMA node 1
– Consider sizing databases to fit
NUMA Scheduler: Overview
• Goal
– To balance VMs across different NUMA nodes
– To probably expose virtualized NUMA topology to guest VMs for best performance t i o n
i s tr ibu
or d
t ion
• Initial placement and Rebalancing
bli c a
– Initial placement based on CPU/memory load + round-robin r p u
o t fo
nt: N
– At every 2 seconds, try incremental move (1 or 2 VMs)
o n te
17 C
– To improve load balance / memory locality / relation sharing / fairness
2 0
w o rld
• vNUMA V M
– For wide VMs (#vCPUs > #cores/NUMA node)
– Expose virtual NUMA topology to improve memory locality for better guest scheduling
NUMA Scheduler: Load Balancing
• Initial placement
– Initial placement based on CPU/memory load + round-robin
t i o n
i s tr ibu
• Periodic Rebalancing Algorithm
or d
t ion
– At every 2 seconds, try incremental move (1 or 2 VMs)
bli c a
r p u
– To improve load balance / memory locality / relation sharing / fairness
o t fo
nt: N
o n te
17 C
2 0
w o rld
V M
NUMA Rebalancing In Action (TPCx-V)
Group-1 Group-2
120 120
t i o n
PCPU Number
PCPU Number
ibu
90 90
i s t r
60 60
or d
t ion
a
30 30
u bli c
0
o r p 0
vm2 vm3 vm4

o t f vm5 vm6 vm7
e n t:N
Group-3
on t Group-4
120
1 7 C 120
ld 20
PCPU Number
PCPU Number
90
o r 90
60
VMw 60
30 30
0 0
vm8 vm9 vm10 vm11 vm12 vm13
NUMA Scheduler: Impact of Cluster-on-Die (Haswell)
• Cluster-on-Die
SPECjbb2015 Throughput
– Breaks each socket into 2 NUMA domain 1.10
t i o n
i s t r ibu
d
1.08
• Performance Considerations
n or
io
Normalized Throughput Gain

c a t
bli
1.06
– Lower LLC hit latency and local memory latency
r p u
– Higher local memory bandwidth
o t fo 1.04
– Not applicable for node-interleaving n t: N

on t e
17 C
1.02
d 2 0
orl 1.00
VMw 0.98
0.96
36Vx1 18Vx1 9Vx2
default CoD
NUMA Scheduler: vNUMA
• vNUMA
– Useful for wide VMs (#vCPUs > #cores/NUMA node)
– Expose virtual NUMA topology to improve memory locality for better guest scheduling t i o n
ist r ibu
• Example: 10-vCPU VM maps directly to 2 pNUMA nodes (out of 4)r
o d
t i o n
b l i c a
p u
t f or
: N o
t e n t
C on
2 0 17
o r ld
VMw
C0 C1 C2 C0 C1 C2 C0 C1 C2 C0 C1 C2
C3 C4 C5 C3 C4 C5 C3 C4 C5 C3 C4 C5
NUMA Scheduler: vNUMA Example
t i o n
i s tr ibu
or d
t ion
bli c a
r p u
o t fo
nt: N
o n te
17 C
2 0
w orld
V M
NEW!
NUMA Scheduler: vNUMA vs. vSocket
• New policy (vSphere 6.5)
– Number of vSocket and number of vNUMA are partially decoupled for optimal sizing
– ESXi will always try to pick the optimal vNUMA topology when possible t i o n
• vNUMA = N x vSocket or vice versa i s t r ibu
or d
n
• Blog posts: “Virtual Machine vCPU and vNUMA Rightsizingt–ioRules of Thumb”
b l i c a
• r pu
o t fo
n t : N
ont e
1 7 C
r ld 20
Mw
vSocket o vSocket vSocket vSocket
V
vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU
vNUMA node vNUMA node
NUMA Scheduler: Virtual CPU Topology Example
t i o n
i s tr ibu
or d
t ion
bli c a
r p u
o t fo
nt: N
o n te
17 C
2 0
w orld
V M
• numactl –hardware
• lstopo –s
• coreinfo –n –s
Agenda
t i o n
1 CPU Scheduler
i s tr ibu
or d
t ion
2 Memory Management
bli c a
r p u
o t fo
3 t: N
NUMA Scheduler
n
n t e
Co
1Host
4 2 0
w o rld
V M
Host Configuration: Power Management Policy
Impact of Throughput of Java Workload
High Perf Balanced (P-states+C-states)

t i o n
1.2
i s tr ibu
or d
t io n
1.15
bli c a
p u
Normalized Performance
fo r
1.1
N o t
n te nt:
C o
1.05
0 17
rld 2
Mw o
V 1
0.95
0.9
IvyBridge Haswell
VM Sizing: #vCPUs
• Cost of over-sizing
– Small CPU overhead per vCPU from periodic timer, etc.
– May hurt performance due to process migrations (e.g. Redis, up to ~40%) t i o n
• Note: No co-scheduling overhead from idle vCPUs i s tr ibu
or d
• Cost of under-sizing t ion
bli c a
– Internal CPU contention
r p u
o t fo
– Check CPU usage and processor queue length
nt: N
o n te
17 C
2 0
w o rld
V M
VM Sizing: vRAM
• Cost of over-sizing
– Some apps/OSes treat unused memory as cache
• e.g. SuperFetch
t i o n
• Increases active memory
i s tr ibu
• May suffer from memory reclamation or d
t ion
• Cost of under-sizing bli c a
r p u
– Guest level paging o t fo
nt: N
o n te
17 C
2 0
w o rld
V M
Summary
• Be aware of the different between per-VM %RDY and per-world %RDY
• Pay attention to ready time (%RDY) if the tail latency matters.
t i o n
– Consider high latency sensitivity option
i s tribu
• Sizing VM based on number of physical cores instead of hyperthreads. o r d
t io n
bli c a
p u
t f or
• In vSphere 6.5 and beyond
: N o
– Some %SYS  %RUN t e n t
C on
0 17 and vNUMA
– No more hassle between vSocket
2
o r ld
VMw
• Avoid active memory overcommit.
• Watch out for under-sizing a VM.
• Power policy matters!
Extreme Performance Series – Las Vegas
• SER2724BU Performance Best Practices • VIRT1397BU Optimize & Increase
Performance Using VMware NSX
• SER2723BU Benchmarking 101
• VIRT2550BU Reducing Latency
t i o n in Enterprise
ribu with VMware NSX
• SER2343BU vSphere Compute & Memory Applications
Schedulers
d i s t
• VIRT1052BU n r
oMonster VM Database
• SER1504BU vCenter Performance Deep Dive
c a t io
bli
Performance
• SER2734BU Byte Addressable Non-Volatile r p u
fo• VIRT1983BU Cycle Stealing from the VDI
Memory in vSphere
n t Not
: Estate for Financial Modeling
• SER2849BU ont e
Predictive DRS – Performance &
Best Practices017 C • VIRT1997BU Machine Learning and Deep
l d 2
r vMotion Architecture,
Learning on VMware vSphere
• SER1494BU w o
Encrypted
VM
Performance, & Futures
• FUT2020BU Wringing Max Perf from vSphere
for Extremely Demanding
• STO1515BU vSAN Performance Workloads
Troubleshooting • FUT2761BU Sharing High Performance
• VIRT1445BU Fast Virtualized Hadoop and Interconnects across Multiple
Spark on All-Flash Disks VMs
Extreme Performance Series – Barcelona
• SER2724BE Performance Best Practices • VIRT1397BE Optimize & Increase
Performance Using VMware NSX
• SER2343BE vSphere Compute & Memory
Schedulers • VIRT1052BE
o n
Monster VM Database
t i
• SER1504BE vCenter Performance Deep Dive i ribu
Performance
s t
r d
oWringing
• SER2849BE Predictive DRS – Performance &
• FUT2020BE
t ion Max Perf from vSphere
bli c a for Extremely Demanding
Best Practices
r p u Workloads
fo
• VIRT1445BE Fast Virtualized Hadoop and
Spark on All-Flash Disks ent:
Not
on t
17 C
d 2 0
orl
VMw
Extreme Performance Series - Hand on Labs
Don’t miss these popular Extreme Performance labs:
t i o n
• HOL-1804-01-SDC: vSphere 6.5 Performance Diagnostics & Benchmarking
t r i b u
r d is and optimizations using
– Each module dives deep into vSphere performance best practices, diagnostics,
o n o
various interfaces and benchmarking tools.
c a t i
u b l i
or p
o t f
• n : N
tLab
HOL-1804-02-CHG: vSphere Challenge
t e
on fictional scenario to fix common vSphere operational and
C
– Each module places you in a 7different
1
2 0
performance problems.
o r ld
VMw
Performance Survey
The VMware Performance Engineering
team is always looking for feedback
about your experience with the t i o n
performance of our products, our
i s tr ibu
various tools, interfaces and where or d
t ion
we can improve.
bli c a
r p u
fo
Scan this QR code to access a
n t : Not
short survey and provide us direct
ont e
feedback. 71C
2 0
rld
Mwo
Alternatively: www.vmware.com/go/perf
V
Thank you!
t i o n
i s tr ibu
or d
t ion
bli c a
r p u
o t fo
nt: N
o n te
17 C
2 0
w orld
V M
t i o n
i s tr ibu
or d
t ion
bli c a
r p u
o t fo
nt: N
o n te
17 C
2 0
w orld
V M
Backup slides
t i o n
i s tr ibu
or d
t ion
bli c a
r p u
o t fo
nt: N
o n te
17 C
2 0
w o rld
V M
t i o n
i s tr ibu
or d
Q&A t ion
bli c a
r p u
o t fo
nt: N
o n te
17 C
2 0
w orld
V M
• Old policy (vSphere 6.0 and before)
– Default: coresPerSocket == 1
• Number of vSockets == Number of vCPUs
t i o n
i s t r ibu
or d
t ion
bli c a
r p u
o t fo
nt: N
o n te
17 C
2 0
vSocket
w o rld
vSocket vSocket vSocket
M
vSocket vSocket vSocket vSocket
V
vNUMA node vNUMA node
• Old policy (vSphere 6.0 and before)
– Default: coresPerSocket == 1
• Number of vSockets == Number of vCPUs
t i o n
– Manual: coresPerSocket != 1 (say 2) i s t r ibu
or d
• Number of vSockets == Number of vCPUs / coresPerSocket
t ion
bli c a
• Number of vNUMA nodes == number of vSockets, regardless of host configuration
r p u
o t fo
nt: N
o n te
17 C
2 0
w o rld
V M
vSocket vSocket vSocket vSocket
vNUMA node vNUMA node vNUMA node vNUMA node

SER2343BE EMEA 1507847000616001rZLS

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

SER2343BE EMEA 1507847000616001rZLS

Încărcat de

Drepturi de autor:

Formate disponibile

SER2343BE

Extreme Performance Series:

A virtual machine consists of more than vCPU worlds.

Group (VM) stats aggregate world stats.

%RDY affects throughput

99.99 Percentile Latency (msec)

Latency depends on the competing workloads.

vm2 vm3 vm4

vm8 vm9 vm10 vm11 vm12 vm13

Normalized Throughput Gain

– Not applicable for node-interleaving n t: N

vNUMA node vNUMA node

Impact of Throughput of Java Workload

High Perf Balanced (P-states+C-states)

vNUMA node vNUMA node

vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU

vNUMA node vNUMA node vNUMA node vNUMA node

S-ar putea să vă placă și