Documente Academic
Documente Profesional
Documente Cultură
Larry Woodman Senior Consulting Engineering RHEL/VM Bill Gray Principal Performance Engineer Red Hat !une "# $%"#
Part I
Red Hat Enterprise Linux tuned profiles! top "enc#mar$ results %cala"ilty &'% %c#eduler tuna"les ( &groups Hugepages Transparent Hugepages! )*+(1,+ -on .niform *emory Access /-.*A0 and -.*A1 -et2or$ Performance and Latency3performance 1is$ and 'ilesystem I4 3 T#roug#put3performance %ystem Performance(Tools perf! tuna! systemtap
Part II
56A
%cale .p
%cale 4ut
"%%%s nodes
www.spec.org
www.tpc.org
"PC%& 3 of top 6 categories "PC%C "op virtuali'ation w( )*+ S",C -S) wor.loa/s % www.stacresearc0.com S,P Sales an/ 1istri!ution %
www.sap.com(campaigns(!enc0mar.
188=
"%%7
?7= ?8=
8(7
?8=
78=
>8=
#17
)8=
)7
8= cpu)887 9irt:sc)818
SPEC is a registered trademark of the Standard Performance Evaluation Corporation. For more information about SPEC and it's benchmarks see www.spec.org
%7
%7
%7
%7
%7
%7
;Enterprise)818
9irt:sc)81<
;"")81<
$%""
$%"$
$%"#
188=
188=
?8=
7A=
78=
@8=
>8=
<<=
)8=
8=
For more information about the PC and it's benchmarks see www.tpc.org.
TP&3&
$%"" $%"$ $%"#
TP&3H
%;/$1/"#
latency+ performance
#%
deadline
deadline @ff
deadline
deadline
Ailesystem Barriers @n CP& Go.ernor Cis3 Read+a-ead 1isa"le THP 1isa"le &3%tates ondemand
Des Des
-ttpsE//access=red-at=com/site/solutions/#)(%(#
Core 1
Soc.et 1
"0rea/ 0 "0rea/ 1
"0rea/ 0 "0rea/ 1
Soc.et 2
/proc/sys/3ernel/sc-ed2G Red Hat Enterprise Linu0 ) *uned+adm /ill increase Fuantum on par /it- Red Hat Enterprise Linu0 ;
ec-o "%%%%%%% H /proc/sys/3ernel/sc-ed2min2granularity2ns Minimal preemption granularity for CP& ,ound tas3s= See sc-ed2latency2ns for details= *-e default .alue is '%%%%%% 9ns:= ec-o ";%%%%%% H /proc/sys/3ernel/sc-ed2/a3eup2granularity2ns
*-e /a3e+up preemption granularity= <ncreasing t-is .aria,le reduces /a3e+up preemption reducing distur,ance of compute ,ound tas3s= Lo/ering it impro.es /a3e+up latency and t-roug-put for latency critical tas3s particularly /-en a s-ort duty cycle load component must compete /it- CP& ,ound components= *-e default .alue is ;%%%%%% 9ns:=
Load +alancing
Sc-eduler tries to 3eep all CP&s ,usy ,y mo.ing tas3s form o.erloaded CP&s to idle CP&s Cetect using Iperf statJ loo3 for e0cessi.e ImigrationsJ /proc/sys/3ernel/sc-ed2migration2cost
6mount of time after t-e last e0ecution t-at a tas3 is considered to ,e Icac-e -otJ in migration decisions= 6 I-otJ tas3 is less li3ely to ,e migrated so increasing t-is .aria,le reduces tas3 migrations= *-e default .alue is ;%%%%% 9ns:= <f t-e CP& idle time is -ig-er t-an e0pected /-en t-ere are runna,le processes try reducing t-is .alue= <f tas3s ,ounce ,et/een CP&s or nodes too often try increasing it=
Rule of t-um, ? increase ,y $+"%0 to reduce load ,alancing <ncrease ,y "%0 on large systems /-en many CGR@&Ps are acti.ely used 9e0E RHEV/ 4VM/RH@S:
%c#ed:*igration &ost
RHEL7B< Effect of sc#ed:migration cost on for$(exit
<ntel Westmere EP $'cpu/"$core $' GB mem
)@8B88
1>8B88= 1)8B88=
)88B88 188B88=
usec(call
1@8B88
188B88
@8B88 )8B88= 8B88 exit:18 exit:188 exit:1888 for$:18 for$:188 for$:1888 8B88=
for$/0 "e#a9ior
sc#ed:c#ild:runs:first &ontrols 2#et#er parent or c#ild runs first 1efault is 8: parent continues "efore c#ildren runB 1efault is different t#an RHEL@
Reser.e/free .ia
&sed .ia -ugetl,fs Reser.ed at ,oot time/no freeing &sed .ia -ugetl,fs @n ,y default .ia ,oot args or /sys &sed for anonymous memory
GB Hugepages "GB
1,+ Hugepages
*oot arguments % /efault_0ugepages'415, 0ugepages'415, 0ugepages42 K cat /proc/meminfo L more HugePages_Total: HugePages_Free: HugePages_Rsvd: HugePages_Surp: Kmount +t -ugetl,fs none /mnt 6 .(mmapwrite (mnt( un. 33 writing 2097152 pages of random junk to file /mnt/junk wrote 8589934592 bytes to file /mnt/junk K cat /proc/meminfo L more HugePages_Total: HugePages_Free: HugePages_Rsvd: HugePages_Surp: 8 0 0 0 8 8 0 0
Transparent Hugepages
ec-o ne.er H /sys/3ernel/mm/transparent2-ugepagesMne.er [root@dhcp-100-1 -!0 code"# t$me %/memory 1! 0 real 0m12.434s user 0m0.936s sys 0m11.416s
# cat /proc/meminfo MemTotal: 16331124 kB AnonHugePages: 0 kB
Boot argumentE transparent2-ugepagesMal/ays 9ena,led ,y default: K ec-o al/ays H /sys/3ernel/mm/red-at2transparent2-ugepage/ena,led # t$me %/memory 1!&B real 0m7.024s user 0m0.073s sys 0m6.847s
# cat /proc/meminfo MemTotal: 16331124 kB AnonHugePages: 15590528 kB
1)B7=
"$=%7
7se 2+ 826_69 page vs 9. page : ;&E<6, static use of 0ugepages Static pages wire/%/own
"%=%7
GB1=
1=%7 sun2-otspot 7gain
,ops
)=%7
,utomatically use 0uge pages -or all anonymous memory 1aemon to gat0er free /ynamically
8B8=
'=%7
$=%7
*emory Hones
#$+,it
&p to )' GB9P6E:
)'+,it
End of R6M
Hig-mem Oone
5ormal Oone
1() MB or #()1MB 'GB 5ormal Oone ")MB CM6 Oone % CM6#$ Oone ")MB CM6 Oone %
Separate page+lists for anonymous and pagecac-e Pre.ents mi0ing of anonymous and file+,ac3ed pages on acti.e and inacti.e LR& lists Eliminates long pauses /-en all CP&s enter direct reclaim during memory e0-austion Pre.ents s/apping /-en copying .ery large files Pre.ents s/apping of data,ase cac-e during ,ac3up=
anonLR. fileLR.
Page aging
A&TIEE
I-A&TIEE
Reclaiming
C#at is -.*AI
5on &niform Memory 6ccess 6 result of ma3ing ,igger systems more scala,le ,y distri,uting system memory near indi.idual CP&s==== 6ll multi+soc3et 01)2)' ser.er systems are 5&M6
Most ser.ers -a.e " 5&M6 node / soc3et Recent 6MC systems -a.e $ 5&M6 nodes / soc3et Else @S /ill see only "+5&M6 nodePPP
Core % Core $
S-ared L# Cac-e
5ode %
5ode % R6M Core % Core $ L# Cac-e Core " Core # Core % Core $
5ode "
5ode " R6M L# Cac-e Core " Core #
5ode "
5ode " R6M L# Cac-e Core " Core #
5ode $
5ode $ R6M Core % Core $ L# Cac-e Core " Core # Core % Core $
5ode #
5ode # R6M L# Cac-e Core " Core #
5ode "
5ode " R6M L# Cac-e Core " Core #
5ode $
5ode $ R6M Core % Core $ L# Cac-e Core " Core # Core % Core $
5ode #
5ode # R6M L# Cac-e Core " Core #
Memory Qones9CM6 > 5ormal Qones: CP&s <@/CM6 capacity <nterrupt processing Page reclamation 3ernel t-read 93s/apdK: Lots of ot-er 3ernel t-reads
5ode "
5ormal Oone
5ode %
Jone:reclaim:mode
Controls 5&M6 specific memory allocation policy W-en set and node memory is e0-austedE
Reclaim memory from local node rat-er t-an allocating from ne0t node Slo/er allocation -ig-er 5&M6 -it ratio 6llocate from all nodes ,efore reclaiming memory Aaster allocation -ig-er 5&M6 miss ratio
;=;0
1>88888
1)88888
1888888
inst> inst< inst) inst1
?88888
"ops
788888
>88888
)88888
*-e Linu0 system sc-eduler is .ery good at maintaining responsi.eness and optimiQing for CP& utiliQation *ries to use idle CP&s regardless of /-ere process memory is located==== &sing remote memory degrades performanceP
Red Hat is /or3ing /it- t-e upstream community to increase 5&M6 a/areness of t-e sc-eduler and to implement automatic 5&M6 ,alancing=
Remote memory latency matters most for long+ running significant processes e=g= HP*C VMs etc=
Re/ritten for Red Hat Enterprise Linu0 )=' to s-o/ per+node system and process memory information "%%7 compati,le /it- prior .ersion ,y default displaying /sys===nodeSnH/numastat memory allocation statistics 6ny command options in.o3e ne/ functionality
+m for per+node system memory info SpatternH for per+node process memory info
See numastat91:
P89 ,ode ' ,ode * ,ode 2 ,ode 3 --------------- ------ ------ ------ -----*'081 (6emu-+2m) *2*6 4'22 4'28 *406 *'62( (6emu-+2m) 2*'8 06 413 8'11 *'61* (6emu-+2m) 4'(6 341' 3'36 **' *'1*3 (6emu-+2m) 4'43 34(8 2*30 *'00 --------------- ------ ------ ------ -----)ot&# **462 **'40 (612 *'6(8
P89 ,ode ' ,ode * ,ode 2 ,ode 3 )ot&# --------------- ------ ------ ------ ------ ----*'081 (6emu-+2m) ' *'123 0 ' *'128 *'62( (6emu-+2m) ' ' 0 *'1*1 *'122 *'61* (6emu-+2m) ' ' *'126 ' *'126 *'1*3 (6emu-+2m) *'133 ' 0 ' *'138 --------------- ------ ------ ------ ------ ----)ot&# *'133 *'123 *'14' *'1*1 42(*3
ConTt assign more memory t-an can ,e used ConTt ma3e guest unnecessarily /ide
Aor ,est 5&M6 affinity and performance t-e num,er of guest VCP&s s-ould ,e SM num,er of p-ysical cores per node and guest memory S a.aila,le memory per node Guests t-at span nodes s-ould consider SL<*
Researc- 5&M6 topology of eac- system Ma3e a resource plan for eac- system Bind ,ot- CP&s and Memory
Mig-t also consider de.ices and <RBs Inumactl +5 SnodesH +m SnodesH S/or3loadHJ Edit 0mlE SnumatuneH Smemory modeMUstrictU nodesetMU"+$U/H S/numatuneH
Control 5roup #Cgroups$ for CP7(+emory(=etwor.(1is. *enefit? guarantee 3uality of Service @ /ynamic resource allocation )/eal for managing any multi%application environment
5e/ Red Hat Enterprise Linu0 )=' user+le.el daemon to automatically impro.e out of t-e ,o0 5&M6 system performance and to ,alance 5&M6 usage in dynamic /or3load en.ironments Was tec-+pre.ie/ in Red Hat Enterprise Linu0 )=# <mpro.es 5&M6 performance for some /or3loads 5ot ena,led ,y default See numad91:
5umad Pic3er
Before numad
5ode % 5ode " 5ode $ 5ode #
6fter numad
5ode % 5ode " 5ode $ 5ode #
Process #8 Process $( Process "( Process )" Proc $( Proc "( Proc )" Proc #8
-umad 3 aligning memory and t#reads in nodes: Reduces memory latency! impro9es determinism
numad usage
Multiple applications running on t-e same ser.er Multiple instances of t-e same application Multiple .irtual guests
numad is most li3ely to -a.e a positi.e effect /-en processes can ,e localiQed in a fractional su,set of t-e systemVs 5&M6 nodes= <f t-e entire system is dedicated to a large in+memory data,ase application for e0ample ++ especially if memory accesses /ill li3ely remain unpredicta,le ++ numad /ill pro,a,ly not impro.e performance= Similarly .ery -ig- ,and/idt- applications ++ t-at really need all t-e system memory controllers ++ /ill li3ely not ,enefit from localiQation
Cefault is I+i ;E";J <ncreasing t-e ma0 inter.al /ill decrease o.er-ead ++ ,ut /ill also decrease responsi.eness to c-anging loads=
+u SnH to specify target utiliQation percent Cefault is I+u 1;J <ncrease t-e utiliQation target to more fully utiliQe t-e entire resources on eac- node Cecrease t-e utiliQation target to maintain more per+node resource margin for ,ursty loads
G8 ?8 A8 78 @8
+4Ps
CHs
+/ SCP&sHESMBsH for node suggestions @utput is a recommended node list e=g= I"+$ 'J Can ,e used regardless of /-et-er numad is running as a daemon Will ta3e a couple seconds if not running &sed ,y li,.irt for optional VM auto placement
)@88888
1% 8%
)888888
)% ;% '%
1@88888
+4P%
1888888
#% $% "%
@88888
Care#ouses
1>88888 1)88888 1888888 ?88888 788888 >88888 )88888 8 )8. >8. .sers ?8.
#; #% $; $% "; "% ; %
numad future
S-ipping in Red Hat Enterprise Linu0 )=' Potential future impro.ementsE Ce.ice and <RB affinity Related process -ints Auture *BC pending upstream 3ernel efforts
Per-aps complementary 5&M6 management roles as systems /ill continue to gro/ in siQe and comple0ity
%ummary ( 5uestions
I*&5ECJ tool ? adNusts system parameters to matcen.ironments + t-roug-put/latency= *ransparent Huge Pages ? auto select large pages for anonymous memory static -ugepages for s-ared mem 5on+uniform Memory 6ccess 95&M6:
numastat en-ancements numactl for manual control numad daemon for auto placement 9===Come ,ac3 for part $===:
cgroups Arc#itecture
K cat /etc/cgconfig=conf mount ] cpuset cpu cpuacct memory de.ices freeQer net2cls ,l3io ^ M /cgroup/cpusetZ M /cgroup/cpuZ M /cgroup/cpuacctZ M /cgroup/memoryZ M /cgroup/de.icesZ M /cgroup/freeQerZ M /cgroup/net2clsZ M /cgroup/,l3ioZ
K ls +l /cgroup dr/0r+0r+0 $ root root % !un $" "#E## ,l3io dr/0r+0r+0 # root root % !un $" "#E## cpu dr/0r+0r+0 # root root % !un $" "#E## cpuacct dr/0r+0r+0 # root root % !un $" "#E## cpuset dr/0r+0r+0 # root root % !un $" "#E## de.ices dr/0r+0r+0 # root root % !un $" "#E## freeQer dr/0r+0r+0 # root root % !un $" "#E## memory dr/0r+0r+0 $ root root % !un $" "#E## net2cls
&group #o23to
"GB/$CP& su,set of a ")GB/1CP& system
Knumactl ++-ard/are Kmount +t cgroup 000 /cgroups Km3dir +p /cgroups/test Kcd /cgroups/test Kec-o " H cpuset=mems Kec-o $+# H cpuset=cpus Kec-o "G H memory=limit2in2,ytes Kec-o YY H tas3s
cgroups
[root@dhcp-100-19-50 ~]# forkmany 20MB 100procs &
[root@dhcp-100-19-50 ~]# top -d 5 top - 12:24:13 up Tasks: 315 total, Cpu0 Cpu1 : : 0.0%us, 0.0%us, 1:36, 4 users, load average: 22.70, 5.32, 1.79 0 stopped, 0.0%wa, 0.0%wa, 0 zombie 0.0%si, 0.0%si, 0.0%st 0.0%st
0.0%hi, 0.0%hi,
Cpu2 Cpu3
Cpu4 Cpu5 Cpu6 Cpu7
:100.0%us,
0.0%sy,
0.0%ni, 0.0%ni,
0.0%id, 0.0%id,
0.0%wa, 0.0%wa,
0.0%hi, 0.0%hi, 0.0%hi, 0.0%hi,
0.0%hi, 0.2%hi,
0.2%si, 0.4%si, 0.0%si, 0.2%si,
0.0%si, 0.2%si,
0.0%st 0.0%st 0.0%st 0.0%st
0.0%st 0.0%st
: 89.6%us, 10.0%sy,
: : : : 0.4%us, 0.4%us, 0.0%us, 0.0%us, 0.6%sy, 0.0%sy, 0.0%sy, 0.0%sy,
Mem: Swap:
K /common/l/oodman/code/memory ' faulting too3 "=)")%)$s touc-ing too3 %=#)'(#8s K numastat numa2-it numa2miss local2node ot-er2node node% node" )A88>)< '#(;;% $#';( $"#';$% $8%%$(( '$#(#' $#;1# $";%"#)
incorrect "indingsN
L ec#o 1 M cpusetBmems L ec#o 83< M cpusetBcpus K numastat node% node" numa2-it ")$##"1 '#'"%) numa2miss $#';( 18?)>@? local2node ")$#"(' '"1'(% ot-er2node $#;1# "%(1%8' K /common/l/oodman/code/memory ' faulting too$ 1BGA77)As touc#ing too$ 8B>@><))s K numastat numa2-it numa2miss local2node ot-er2node node% node" ")$##'" '#'"'8 $#';( )1<<A<? ")$#$"8 '"1;#" $#;1# $"'(#;'
6A
%;/$1/"#