Documente Academic
Documente Profesional
Documente Cultură
5.3
ABSTRACT
This presentation updates AIX/VMM (Virtual Memory Management) and LVM/JFS2
storage IO performance concepts and tactics for the day-to-day Power/AIX system
administrator. It explains the meaning of the numbers offered by AIX commands
(vmstat, iostat, mpstat, sar, etc.) to monitor and analyze the AIX VMM and storage
IO performance and capacity of a given Power7/AIX LPAR.
These tactics are further illustrated in Part II: Updated Real-world Case Histories -How to Monitor and Analyze the VMM and Storage I/O Statistics of a Power/AIX
LPAR.
Part II: Updated Real-world Case Histories -- How to Monitor and Analyze
the VMM and Storage I/O Statistics of a Power/AIX LPAR
ABSTRACT
These updated case-histories further illustrate the content presented in Part I:
Updated Concepts and Tactics -- How to Monitor and Analyze the VMM and Storage
I/O Statistics of a Power/AIX LPAR.
This presentation includes suggested ranges and ratios of AIX statistics to guide VMM
and storage IO performance and capacity analysis.
Each case is founded on a different real-world customer configuration and workload
that manifests characteristically in the AIX performance statistics -- as performing:
intensely in bursts, with hangs and releases, AIX:lrud constrained, AIX-buffer
constrained, freely unconstrained, inode-lock contended, consistently light,
atomic&synchronous, virtually nil IO workload, long avg-wait's, perfectly ideal, long
avg-serv's, mostly rawIO, etc.
time
-------hr mi se
02:00:03
02:00:05
02:00:07
02:00:09
02:00:11
02:00:13
02:00:15
02:00:17
02:00:19
02:00:21
02:00:23
02:00:25
02:00:27
02:00:29
02:00:31
02:00:33
02:00:35
02:00:37
02:00:39
02:00:41
time
-------hr mi se
02:00:43
02:00:45
02:00:47
10
Poor ratio of pages freed to pages examined (fr:sr ratio) in vmstat -s output
$ uptime ; vmstat s
02:17PM
11
Poor ratio of pages freed to pages examined (fr:sr ratio) in vmstat -s output
Given sustained fr:sr ratios: 1:1.1/blue 1:3/green 1:5/warning 1:10/red
kthr
memory
page
faults
cpu
time
----------- --------------------- ------------------------------------ ------------------ ----------- -------r
b
p
avm
fre
fi
fo
pi
po
fr
sr
in
sy
cs us sy id wa hr mi se
1
1
0
3986577
2652 1944
797
0
0 1536 12803
880
2377 4459 10 4 55 31 14:17:58
2
2
0
3986576
2553 1863
757
0
0 2557 37067
852
4053 4446 11 4 55 30 14:18:00
2
1
0
3986574
2206 1959
799
0
0 2559 37499 1009
2523 4559 10 6 53 31 14:18:02
0
3
0
3986573
2597 2044
843
0
0 3069 42804
912
2377 4553 11 4 55 30 14:18:04
1
2
0
3986571
2511 1870
754
0
0 2559 167438
804
2203 4247 10 4 56 30 14:18:06
0
2
0
3986571
2197 1944
787
0
0 2560 102054
814
2310 4063 10 4 56 30 14:18:08
0
2
0
3986570
2872 1960
792
0
0 3070 42557
889
4148 4532 11 4 54 30 14:18:10
1
2
0
3986569
3752 1876
764
0
0 3070 65622
933
2363 4834 10 5 53 32 14:18:12
1
2
0
3986568
3864 1787
730
0
0 2559 49907
880
2135 4617 9 4 53 33 14:18:14
1
1
0
3986567
2634 1915
767
0
0 2047 30676
785
2774 3948 10 4 55 31 14:18:16
0
3
0
3986567
2523 1890
759
0
0 2552 27693
877
2646 4443 10 4 55 32 14:18:18
1
2
0
3986573
2040 2008
810
0
0 2557 23419
928
5155 4671 12 4 54 30 14:18:20
1
2
0
3986572
1962 1878
761
0
0 2554 52663
905
2525 4795 10 4 56 29 14:18:22
2
2
0
3986587
2652 1960
798
3
0 3071 14081 1030 11377 7789 13 9 51 27 14:18:24
2
2
0
3986570
2363 1938
781
0
0 2558 30570
836
3004 5732 10 5 56 29 14:18:26
2
1
0
3986734
2056 1884
762
1
0 2557 32017
888 31414 6058 15 11 47 26 14:18:28
2
0
0
3986617
1933 1920
779
2
0 2558 15377
933 22108 5545 15 9 48 28 14:18:30
1
0
0
3986612
2463 2008
826
0
0 3069 25129 1192
2823 5935 11 9 52 28 14:18:32
1
2
0
3986586
3073 1988
810
0
0 3064 15116
816
2732 4430 10 4 56 30 14:18:34
0
1
0
3986587
3402 1719
685
0
0 2555 24262
799
3395 4429 9 4 58 29 14:18:36
kthr
memory
page
faults
cpu
time
----------- --------------------- ------------------------------------ ------------------ ----------- -------r
b
p
avm
fre
fi
fo
pi
po
fr
sr
in
sy
cs us sy id wa hr mi se
12
These case-histories are founded on a mundane AIX command script, (see appendix)
Each sanitized textfile illustrates a common indicated performance issue, if not several.
Remedies to resolve will be offered, but not illustrated; most remedies are surprisingly simple.
Except for inexplicably poor performance, there are typically no other apparent issues.
In other words, Recognition is notably more problematic than Resolution.
A simple lack of CPU and/or gbRAM for the given workload, i.e. poor Tuning-by-Hardware
Improperly implemented tactics, i.e. AIX VMM parameter values that are far out-of-whack
Simply continuing to use old technologies when better technologies are free and available
Implementing tactics without understanding their purpose, appropriateness or compromise
13
Of course, this tactic assumes all other dependencies are properly/sufficiently tuned.
14
From: http://pic.dhe.ibm.com/infocenter/aix/v6r1/index.jsp?topic=%2Fcom.ibm.aix.cmds%2Fdoc%2Faixcmds3%2Fmount.htm
15
Question:
Is there a way to scan|steal|free Sequential IO, and only buffer-cache Random IO (for JFS2
rehits) without suffering the kernel processing overhead of lrud fr:sr ?
16
Question:
Is there a way to scan|steal|free Sequential IO, and only buffer-cache Random IO (for JFS2
rehits) without suffering the kernel processing overhead of lrud fr:sr ?
Answer:
Yes. Use mount o rbr <> to mount JFS2 RDBMS data filesystems.
rbr replaces lrud fr:sr by immediately freeing only the memory used to convey
Sequential Reads to the RDBMS (thus rbr for release-behind-read).
Unlike lrud, rbr is selective: It does no scanning of the buffer cache !!!
rbr only works when sequential reading of a file in this file system is detected.
Thereafter, only the real memory pages used by the file will be released once the
pages are copied to internal buffers. These internal buffers can be the RDBMS itself.
Result:
Sequential Reads of a mount o rbr <> JFS2 filesystem are not buffer-cached.
This also means the Random Reads are buffer-cached for read-rehits.
The kernel processing overhead of lrud fr:sr is substantially reduced.
Overall SAN IO performance/throughput is noticeably improved with The Tractor.
Copyright IBM Corporation 2012
17
18
The first priority should be to preclude any pagingspace-pageouts. Thus, a write-expedient pagingspace is only needed
if you have any unavoidable pagingspace-pageout activity. Ultimately, if we must suffer any pagingspace-pageouts,
we want them to write-out to the pagingspace as quickly as possible (thus my term: write-expedient).
So, for the sake of prudence, we should always create a write-expedient pagingspace. The listed traits below are
optimal for write-expediency; include as many as you can (but always apply the key tuning tactic below):
Create the pagingspace_vg using FC-SAN storage LUNs (ideally RAID5 LUNs on SSD, FC or SAS technology disk drives,
and not on SATA disk drives (which are slower and employs RAID6), nor on any local/internal SAS disks)
The total size of the pagingspace in pagingspace_vg should match the size of installed LPAR gbRAM
Assign 3-to-8 LUN/hdisks to pagingspace_vg and size each LUN to be an even fraction of installed gbRAM. For instance, if
the LPAR has 18gbRAM, then assign three 6gb LUN/hdisks to pagingspace_vg
Configure one AIX:LVM:VG:lv (logical volume) for each LUN/hdisk in pagingspace_vg; do not deploy PP-striping
(because it messes-up discrete hdisk IO monitoring) - just map one hdisk to one lv
The key tuning tactic: With root-user privileges, use AIX:lvmo to set pagingspace_vg:pv_pbuf_count=2048.
This will ensure pagingspace_vg:total_vg_pbufs will equal [<VGLUNcount> * pv_pbuf_count].
19
kthr
Number of kernel threads in various queues averaged per second over the sampling
interval. The kthr columns are as follows:
r
Average number of kernel threads that are runnable, which includes threads that
are running and threads that are waiting for the CPU. If this number is greater
than the number of CPUs, then there is at least one thread waiting for a CPU
and the more threads there are waiting for CPUs, the greater the likelihood of a
performance impact.
b
Average number of kernel threads in the VMM wait queue per second. This
includes threads that are waiting on filesystem I/O or threads that are blocking
on a shared resource, i.e. inode-lock.
p
For vmstat -I The number of threads waiting on I/Os to raw devices per second.
Threads waiting on I/Os to filesystems would not be included here.
Copyright IBM Corporation 2012
20
memory
Provides information about the real and virtual memory.
avm
The Active Virtual Memory, avm, column represents the number of active virtual
memory pages present at the time the vmstat sample was collected. It is the
sum-total of all computational memory including content paged-out to the
pagingspace. The avm statistics do not include file pages.
fre
The fre column shows the average number of free memory pages. A page is a 4
KB area of real memory. The system maintains a buffer of memory pages, called
the free list, that will be readily accessible when the VMM needs space. The
minimum number of pages that the VMM keeps on the free list is determined by
the minfree parameter of the vmo command.
21
22
Page (continued)
Information about page faults and paging activity. These are averaged over the interval and given in units per
second.
pi
The pi column details the number of pages paged in from paging space. Paging space is the part of virtual
memory that resides on disk. It is used as an overflow when memory is over committed. Paging space
consists of logical volumes dedicated to the storage of working set pages that have been stolen from
real memory. When a stolen page is referenced by the process, a page fault occurs, and the page must
be read into memory from paging space.
Due to the variety of configurations of hardware, software and applications, there is no absolute number to
look out for. This field is important as a key indicator of paging-space activity. If a page-in occurs, there
must have been a previous page-out for that page. It is also likely in a memory-constrained environment
that each page-in will force a different page to be stolen and, therefore, paged out.
po
The po column shows the number (rate) of pages paged out to paging space. Whenever a page of working
storage is stolen, it is written to paging space, if it does not yet reside in paging space or if it was
modified. If not referenced again, it will remain on the paging device until the process terminates or
disclaims the space. Subsequent references to addresses contained within the faulted-out pages results
in page faults, and the pages are paged in individually by the system. When a process terminates
normally, any paging space allocated to that process is freed. If the system is reading in a significant
number of persistent pages, you might see an increase in po without corresponding increases in pi. This
does not necessarily indicate thrashing, but may warrant investigation into data-access patterns of the
applications.
Copyright IBM Corporation 2012
23
page (continued)
Information about page faults and paging activity. These are averaged over the interval and
given in units per second.
fr
Number of pages that were freed per second by the page-replacement algorithm during the
interval. As the VMM page-replacement routine scans the Page Frame Table, or PFT, it
uses criteria to select which pages are to be stolen to replenish the free list of available
memory frames. The criteria include both kinds of pages, working (computational) and file
(persistent) pages. Just because a page has been freed, it does not mean that any I/O has
taken place. For example, if a persistent storage (file) page has not been modified, it will
not be written back to the disk. If I/O is not necessary, minimal system resources are
required to free a page.
sr
Number of pages that were examined per second by the page-replacement algorithm during
the interval. The page-replacement algorithm might have to scan many page frames
before it can steal enough to satisfy the page-replacement thresholds. The higher the sr
value compared to the fr value, the harder it is for the page-replacement algorithm to find
eligible pages to steal.
24
faults
Information about process control, such as trap and interrupt rate. The faults columns are as follows:
in
Number of device interrupts per second observed in the interval.
sy
The number of system calls per second observed in the interval. Resources are available to user
processes through well-defined system calls. These calls instruct the kernel to perform
operations for the calling process and exchange data between the kernel and the process.
Because workloads and applications vary widely, and different calls perform different functions, it
is impossible to define how many system calls per-second are too many. But typically, when the
sy column raises over 10000 calls per second on a uniprocessor, further investigations is called
for (on an SMP system the number is 10000 calls per second per processor). One reason could
be "polling" subroutines like the select() subroutine. For this column, it is advisable to have a
baseline measurement that gives a count for a normal sy value.
cs
Number of context switches per second observed in the interval. The physical CPU resource is
subdivided into logical time slices of 10 milliseconds each. Assuming a thread is scheduled for
execution, it will run until its time slice expires, until it is preempted, or until it voluntarily gives up
control of the CPU. When another thread is given control of the CPU, the context or working
environment of the previous thread must be saved and the context of the current thread must be
loaded. The operating system has a very efficient context switching procedure, so each switch is
inexpensive in terms of resources. Any significant increase in context switches, such as when cs
is a lot higher than the disk I/O and network packet rate, should be cause for further
investigation.
Copyright IBM Corporation 2012
25
cpu
Percentage breakdown of CPU time usage during the interval. The cpu columns are as follows:
us
The us column shows the percent of CPU time spent in user mode. A UNIX process can execute in either user
mode or system (kernel) mode. When in user mode, a process executes within its application code and does
not require kernel resources to perform computations, manage memory, or set variables.
sy
The sy column details the percentage of time the CPU was executing a process in system mode. This includes
CPU resource consumed by kernel processes (kprocs) and others that need access to kernel resources. If a
process needs kernel resources, it must execute a system call and is thereby switched to system mode to
make that resource available. For example, reading or writing of a file requires kernel resources to open the file,
seek a specific location, and read or write data, unless memory mapped files are used.
id
The id column shows the percentage of time which the CPU is idle, or waiting, without pending local disk I/O. If
there are no threads available for execution (the run queue is empty), the system dispatches a thread called
wait, which is also known as the idle kproc. On an SMP system, one wait thread per processor can be
dispatched. The report generated by the ps command (with the -k or -g 0 option) identifies this as kproc or wait.
If the ps report shows a high aggregate time for this thread, it means there were significant periods of time
when no other thread was ready to run or waiting to be executed on the CPU. The system was therefore mostly
idle and waiting for new tasks.
wa
The wa column details the percentage of time the CPU was idle with pending local disk I/O and NFS-mounted
disks. If there is at least one outstanding I/O to a disk when wait is running, the time is classified as waiting for
I/O. Unless asynchronous I/O is being used by the process, an I/O request to disk causes the calling process to
block (or sleep) until the request has been completed. Once an I/O request for a process completes, it is placed
on the run queue. If the I/Os were completing faster, more CPU time could be used.
A wa value over 25 percent could indicate that the disk subsystem might not be balanced properly, or it might be
the result of a disk-intensive workload.
Copyright IBM Corporation 2012
26
By default, file pages can be cached in real memory for file systems. The caching can be disabled using direct
I/O or concurrent I/O mount options; also, the Release-Behind mount options can be used to quickly
discard file pages from memory after they have been copied to the application's I/O buffers if the readahead and write-behind benefits of cached file systems are needed.
JFS2 default mount -- AIX uses file caching as the default method of file access. However, file caching
consumes more CPU and significant system memory because of data duplication. The file buffer cache
can improve I/O performance for workloads with a high cache-hit ratio. And file system readahead can
help database applications that do a lot of table scans for tables that are much larger than the database
buffer cache.
Raw I/O -- Database applications traditionally use raw logical volumes instead of the file system for
performance reasons. Writes to a raw device bypass the caching, logging, and inode locks that are
associated with the file system; data gets transferred directly from the application buffer cache to the disk.
If an application is update-intensive with small I/O requests, then a raw device setup for database data and
logging can help performance and reduce the usage of memory resources.
27
Exercise&experiment with the JFS2 Direct I/O and Concurrent I/O mount options
By default, file pages can be cached in real memory for file systems. The caching can be disabled using direct
I/O or concurrent I/O mount options; also, the Release-Behind mount options can be used to quickly
discard file pages from memory after they have been copied to the application's I/O buffers if the readahead and write-behind benefits of cached file systems are needed.
Direct I/O DIO is similar to rawIO except it is supported under a file system. DIO bypasses the file
system buffer cache, which reduces CPU overhead and makes more memory available to others (that is, to
the database instance). DIO has similar performance benefit as rawIO but is easier to maintain for the
purposes of system administration. DIO is pro-vided for applications that need to bypass the buffering of
memory within the file system cache. For instance, some technical workloads never reuse data because of
the sequential nature of their data access. This lack of data reuse results in a poor buffer cache hit rate,
which means that these workloads are good candidates for DIO.
Concurrent I/O -- CIO supports concurrent file access to files. In addition to bypassing the file cache, it
also bypasses the inode lock that allows multiple threads to perform reads and writes simultaneously on a
shared file. CIO is designed for relational database applications, most of which will operate under CIO
without any modification. Applications that do not enforce serialization for access to shared files should not
use CIO. Applications that issue a large amount of reads usually will not benefit from CIO either.
28
Release-behind-read and release-behind-write allow the file system to release the file pages from file system
buffer cache as soon as an application has read or written the file pages. This feature helps the
performance when an application performs a great deal of sequential reads or writes. Most often, these file
pages will not be reassessed after they are accessed.
Without this option, the memory will still be occupied with no benefit of reuse, which causes paging eventually
after a long run. When writing a large file without using release-behind, writes will go very fast as long as
pages are available on the free list. When the number of pages drops to minfree, VMM uses its LRU
algorithm to find candidate pages for eviction.
This feature can be configured on a file system basis. When using the mount command, enable releasebehind by specifying one of the three flags below:
The release-behind sequential read flag (rbr)
The release-behind sequential write flag (rbw)
The release-behind sequential read and write flag (rbrw)
A trade-off of using the release-behind mechanism is that the application can experience an increase in CPU
utilization for the same read or write throughput rate (as compared to not using release-behind). This is
because of the work required to free the pages, which is normally handled at a later time by the LRU
daemon. Also note that all sequential IO file page accesses result in disk I/O because sequential IO file
data is not cached by VMM. However, applications (especially long-running applications) with the releasebehind mechanism applied are still likely to perform more optimally and with greater stability.
Copyright IBM Corporation 2012
29
(page 1)
#!/bin/ksh -x
#
#
#
#
#
#
#
#
#
#================================================================================
date
uname -a
id
oslevel -s
lparstat -i
uptime
vmstat -s
vmstat -v
vmstat -Iwt 1 80
ps -ekf | grep -v egrep | egrep "syncd|lrud|nfsd|biod|wait|getty|xmwlm
30
(page 2)
ipcs -bm
lsps -a
lsps -s
lssrad -av
mount
df -k
cat /etc/filesystems
cat /etc/xtab
showmount
prtconf
ps -el | wc
ps -elmo THREAD | wc
ps -kl | wc
ps -klmo THREAD | wc
nfsstat
##### BEGIN rootuser-privileges section
vmo -L
# requires root-user to execute; makes no changes
ioo -L
# requires root-user to execute; makes no changes
no -L
# requires root-user to execute; makes no changes
nfso -L
# requires root-user to execute; makes no changes
schedo -L
# requires root-user to execute; makes no changes
raso -L
# requires root-user to execute; makes no changes
lvmo -L
for VG in `lsvg`
do
lvmo -a -v $VG
echo
done
31
uptime
sar -a 2 40
sar -b 2 40
sar -c 2 40
sar -k 2 40
sar -d 2 40
##### END rootuser-privileges section
# requires root-user
# requires root-user
# requires root-user
# requires root-user
# requires root-user
to execute;
to execute;
to execute;
to execute;
to execute;
(page 3)
makes no changes
makes no changes
makes no changes
makes no changes
makes no changes
aioo -a
lsdev
lscfg
lsconf
uptime
vmstat -Iwt 1 80
uptime
mpstat -w 2 40
uptime
mpstat -dw 2 40
uptime
mpstat -i 2 40
mpstat -w 2 40
vmstat -f
vmstat -i
nfso -a
lspv
for VG in `lsvg`
do
lsvg $VG ; echo
lsvg -p $VG ; echo ; echo ; echo
done
echo "\n\n============== ps -ef ==============================================================="
ps -ef
echo "\n\n============== ps -kf ==============================================================="
ps kf
32
(page 4)
0.0
0
0.0"
0" | grep -v "tm_act"
uptime
vmstat s ; vmstat -v
vmstat -Iwt 1 80 ; date ; id ; uname -a
33
Session Evaluations
ibmtechu.com/vp
Prizes will be
drawn from
Evals
34
35
2011
IBM Power Systems Technical University
October 10-14 | Fontainebleau Miami Beach | Miami, FL
Thank you
5.3
Trademarks
The following are trademarks of the International Business Machines Corporation in the United States, other countries, or both.
Not all common law marks used by IBM are listed on this page. Failure of a mark to appear does not mean that IBM does not use the mark nor does it mean that the product is not
actively marketed or is not significant within its relevant market.
Those trademarks followed by are registered trademarks of IBM in the United States; all others are trademarks or common law marks of IBM in the United States.
37
4-Apr-13
Copyright IBM Corporation 2012
37
Disclaimers
No part of this document may be reproduced or transmitted in any form without written permission from IBM
Corporation.
Product data has been reviewed for accuracy as of the date of initial publication. Product data is subject to change
without notice. This information could include technical inaccuracies or typographical errors. IBM may make
improvements and/or changes in the product(s) and/or program(s) at any time without notice. Any statements
regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals
and objectives only.
The performance data contained herein was obtained in a controlled, isolated environment. Actual results that may be
obtained in other operating environments may vary significantly. While IBM has reviewed each item for accuracy in a
specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. Customer
experiences described herein are based upon information and opinions provided by the customer. The same results
may not be obtained by every user.
Reference in this document to IBM products, programs, or services does not imply that IBM intends to make such
products, programs or services available in all countries in which IBM operates or does business. Any reference to
an IBM Program Product in this document is not intended to state or imply that only that program product may be
used. Any functionally equivalent program, that does not infringe IBM's intellectual property rights, may be used
instead. It is the user's responsibility to evaluate and verify the operation on any non-IBM product, program or
service.
THE INFORMATION PROVIDED IN THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER
EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE OR INFRINGEMENT. IBM shall have no responsibility to update this information. IBM
products are warranted according to the terms and conditions of the agreements (e.g. IBM Customer Agreement,
Statement of Limited Warranty, International Program License Agreement, etc.) under which they are provided. IBM
is not responsible for the performance or interoperability of any non-IBM products discussed herein.
38
Disclaimers (Continued)
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products in connection with this
publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM
products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those
products.
The providing of the information contained herein is not intended to, and does not, grant any right or license under any
IBM patents or copyrights. Inquiries regarding patent or copyright licenses should be made, in writing, to:
IBM Director of Licensing
IBM Corporation
North Castle Drive
Armonk, NY 10504-1785
USA
IBM customers are responsible for ensuring their own compliance with legal requirements. It is the customer's sole
responsibility to obtain advice of competent legal counsel as to the identification and interpretation of any relevant
laws and regulatory requirements that may affect the customer's business and any actions the customer may need
to take to comply with such laws.
IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer is
in compliance with any law.
The information contained in this documentation is provided for informational purposes only. While efforts were made
to verify the completeness and accuracy of the information provided, it is provided as is without warranty of any
kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related
to, this documentation or any other documentation. Nothing contained in this documentation is intended to, nor
shall have the effect of, creating any warranties or representations from IBM (or its suppliers or licensors), or altering
the terms and conditions of the applicable license agreement governing the use of IBM software.
39