Documente Academic
Documente Profesional
Documente Cultură
INTRODUCTION
Let us consider a scenario. At 2:30 PM, your phone rings, and your support personnel at the Command
Center tell you that the users are experiencing slow performance on your database. At the same time, your
developers call you and ask why GlancePlus shows a warning of 100% I/O bottleneck. Did the I/O
bottleneck cause the application slowdown? Or did the application slowdown cause I/O to be the
bottleneck? What can you do to help them? Maybe your real problem is related to logical I/O and not
physical I/O. If the culprit is physical I/O, how and what can you tune to improve your physical I/O
throughput and obtain optimal performance? This paper is written to answer those questions and share my
experience and knowledge on tuning physical I/O for better performance.
STRATEGIES ON TUNING PHYSICAL I/O PERFORMANCE
The most frequently cited methodology for physical I/O tuning is SAME which stands for Stripe and Mirror
Everything (Loaiza, 2001). The SAME methodology has four rules. First, stripe all files across all disks using
a one-megabyte stripe size. Second, mirror the data against disk failure for high availability. Third, use outer
tracks of the disk for frequently accessed data. Fourth, use partitions rather than disks to subset data. The
SAME paradigm has provided useful guidelines in our tuning practice for both configuring and laying out
data on the disk subsystems. However, tuning I/O for optimal performance requires more than just SAME,
because striping indiscriminately can introduce I/O hot spots and cause trouble for capacity growth
problems in the future (Adams, 2001a). Because the size of individual disks has been increasing, using the
SAME methodology tends to waste disk space. SAME can also add to configuration problems for application
failover in an Oracle Parallel Server environment. In addition, the SAME methodology is too simplistic for
database I/O tuning in that there are many factors influencing I/O service time from an application and
Oracle process point of view. In this section, I will discuss the strategies that every DBA needs to know at
the time of I/O tuning. These strategies are:
• Understanding the hardware/OS environment
• Understanding the applications
• Creating optimal physical database layout
• Using appropriate database parameters
• Creating optimal database objects
• Collecting and maintaining I/O statistics
Paper #559
Administrating the Database
First, you need to know the maximum I/O size in your system. In HP UX, MAXPHYS controls the I/O size.
It has been set to 256 KB on HP UX 11.0. If a disk driver sees an I/O request larger than this size, the driver
breaks the request into chunks of MAXPHYS size. In addition, if you are using cooked files (rather than raw
partitions), then you need to know what the chunk size of the file system buffer is. This is determined at the
time your System Administrator creates them, and it is often set to 4K or 8K. It is recommended that the
Oracle database block size matches the chunk size. However, you should not set the database block size
larger than the chunk size, because a single I/O will be broken into pieces, thus, increasing the number of
I/O.
Second, you should know how many spindles and controllers you have, not just how much storage capacity
you need for your database. If you have a 130 GB database, but it resides on only two 73 GB disks with one
controller and it also has more than 200 users doing a lot of I/Os, then your database will have I/O
performance problems. To make matters worse, to fully utilize storage capacity, you will be surprised to find
that your database files have been sharing the same spindles and disks with files belonging to other databases.
Therefore, each DBA needs to communicate with his/her management, business users, and System
Administrators about performance requirements as well as capacity requirements. DBAs need to write down
the number of disks, the speed of the disks, as well as the size of disks in their project requirement for the
new business initiatives.
Third, you should avoid RAID-5 (or RAID-S if you are using EMC). With RAID-5, data protection is
enforced by Exclusive Or (XOR) Boolean operations. Parity information for each data segment across
physical volumes is written to a separate physical disk. Therefore, RAID-5 uses 25% of the storage capacity
as overhead rather than 50% as in RAID-1 to achieve data protection (three units of data, one unit of parity).
With EMC RAID-S implementation, the EMC Symmetrix also generally addresses a RAID group of four
volumes, also cutting the cost of data protection to 25%. Additionally, with EMC Symmetrix RAID-S allows
the XOR calculation to take place at the disk level, instead of in directors or global cache thus saving
director/cache cycles. Thus, the cost of I/O is reduced greatly. However, in either case, the cost of
performance on writes is still substantial due to the parity calculation. Therefore, in any environment that
requires high I/O performance, it is important to use RAID-1 that simply mirrors all the disks.
Fourth, you should use raw devices rather than file system (cooked files), when write activities are high. I/O
to raw devices is much faster compared to file systems because it bypasses the Unix buffer cache. If you must
use file systems, for example, in the case of archive destination, you should choose a Journaled File System
(JFS), such as Veritas File System’s VxFS. The VxFS is an extents-based file system, not a block-based
reference system as in a regular Unix file system. It also has a direct I/O option. This feature allows for large
I/O operations and multipass writes, which provides a more efficient access to underlying disks. However, it
is still subject to inode contention. Using raw devices also allows you to enable asynchronous I/O on your
system.
Fifth, you should use asynchronous I/O and check its configuration. Enabling asynchronous I/O involves
creating the /dev/async character device and configuring the asynchronous driver in the Unix kernel. In
addition, HP UX parameter max_asych_ports needs to be set to the maximum number of processes
allowed to use asynchronous I/O. If max_async_ports is reached, subsequent processes will use
synchronous I/O. You can use lsof|grep async to see whether your database writer process is using
asynchronous I/O.
Paper #559
Administrating the Database
Sixth, you should use a Logical Volume Manager, such as HP LVM, to construct Logical Volumes across
several disks. Striping across disks improves I/O performance and allows load balancing across disks and
controllers. You can use vgdisplay, lvdisplay and pvdisplay to view volume group, logical volume, and
physical volume information, respectively.
Finally, you should know the unique configuration of the I/O subsystems that each vendor provides, such as
EMC. Ignorance of those unique configurations will cause you to miss good tuning opportunities. I will
briefly discuss the EMC configuration and its meta volume and CacheStormTM features. For further reading,
please refer to EMC web site, EMC training materials, and/or Pearce (2001).
GENERAL EMC SYMMETRIX ARCHITECTURE
Figure 1 shows the general EMC Symmetrix architecture. The front-end directors usually include Fibre
and/or SCSI Channel adapters. They connect hosts and move buffered I/Os to and from the Symmetrix
Cache. These directors each contain two microprocessors that provide high I/O throughput. All read and
write data goes through the Symmetrix cache before reaching the physical disks. Symmetrix cache uses Least
Recently Used (LRU) algorithms and Prefetch algorithms to determine data access patterns and maintain data
availability. Based on data access patterns, the prefetch algorithms move data blocks from disks to cache in
anticipation of reads, avoiding read misses. The back-end disk directors interface cache and physical disk
storage, moving all read data from disk to cache, performing prefetch operations, and destaging write data to
disk storage. Disk directors adjust track counts up to 12 tracks for sequential I/O. Each disk director also
contains two microprocessors to handle all I/O operations.
Figure 2 shows the structure of hyper volumes, the basic unit of EMC Symmetrix storage. Each hyper
volume is a logical split of a physical disk. For example, a 36GB disk can be divided into four 9GB hyper
volumes. The host views a hyper volume as a physical disk. The maximum size of a hyper volume is 16 GB.
An EMC physical disk can contain a maximum of 32 hyper volumes. An EMC frame can contain a
maximum of 4096 hyper volumes.
Paper #559
Administrating the Database
There are a number of advantages in using EMC Symmetrix. The first one is cache configurations that allow
Symmetrix reads and writes to occur at memory speed rather than disk speed. Similar to Oracle buffer cache,
the Least Recently Used (LRU) algorithm is used to ensure that only pages of data that have been used
recently are kept in cache. Using EMC microcode level 5568, the maximum global cache size increases from
32 GB to 64 GB, such increase results in a higher cache hit ratio and greater performance gain. In addition,
the Symmetrix Enginuity Quality of Service (QoS) operating environment allows customers to assign the
quantity of cache along with its LRU to different database applications/LUNs. The latest CacheStormTM
technology can partition the Global Cache Director into a maximum of 16 separately addressable cache
regions and allocate different amount of cache to groups of disks. This technology can reduce the probability
of cache contention for one region down to 6%.
The second advantage is EMC PowerPath implementation. PowerPath is an EMC host-based software
offering that allows a maximum of 32 I/O paths from the host to Symmetrix channel directors, automatically
balancing I/O requests among these paths. Without PowerPath, a host accesses each disk resource via a
single I/O path between host and I/O adapters (e.g., SCSI). Significant imbalance among the paths results in
sporadic application slowdown. PowerPath automatically rebalances the I/O requests across all available
paths to the adapters. In addition, PowerPath enhances high availability in standalone or cluster
environments. PowerPath directs I/O to an alternative path at the time of path failure, thus preventing server
failures in standalone environment and node failovers in clustered environment.
The third advantage of EMC Symmetrix is its meta volume addressing (Rarich, 2001). Symmetrix allows the
concatenation of hyper volumes up to 4 terabytes. Figure 3 shows a meta volume of 32 GB with 4 hyper
volumes each of 8 GB size. With micro code 5265 or higher, Symmetrix allows striping across the hyper
volumes that comprise a meta volume. The minimum stripe size on Symmetrix meta volumes is 960 K. The
host regards each meta volume as a single disk. Thus, you can use LVM to create logical volume groups
spanning multiple meta volumes for better I/O distribution. For example, you can create one meta volume
of 32 GB with eight hyper volumes each of 4 GB size and then use VxFS to create a 128 GB volume group
across 4 different meta volumes. As a result, any data file created in this volume group will have 32 spindles,
spread across the entire Symmetrix back end, to support its I/O operations. This will greatly reduce chances
of I/O hot spots. Using meta volumes also automatically allows for better I/O queuing on Symmetrix hyper
volumes.
Paper #559
Administrating the Database
Paper #559
Administrating the Database
Obtaining all the information requires collaborative effort from DBAs and developers, and sometimes,
involves business analysts and data modelers. How to obtain all the information is beyond the scope of the
discussion of this article. However, for successful tuning, DBAs should maintain a repository for access path
and pattern for each table. A form such as the following one makes proactive maintaining and tuning more
productive and efficient.
If you have not done any of the proactive steps, the most often used reactive method you can use is to find
the queries that have high physical reads. The following query on Steven Adams’ web site can give you some
hints on which queries are worth tuning effort (assuming TIMED_STATISTICS = TRUE in all examples)
(Adams, 2001b).
select
substr(to_char(s.pct, '99.00'), 2) || '%' load,
s.executions executes,
p.sql_text
from
(
select
address,
disk_reads,
executions,
Paper #559
Administrating the Database
pct,
rank() over (order by disk_reads desc) ranking
from
(
select
address,
disk_reads,
executions,
100 * ratio_to_report(disk_reads) over () pct
from
sys.v$sql
where
command_type != 47
)
where
disk_reads > 50 * executions
) s,
sys.v$sqltext p
where
s.ranking <= 5 and
p.address = s.address
order by
1, s.address, p.piece;
Oracle also offers a powerful tool, STATSPACK, for SQL tuning. Since you can set snapshot interval, the
STATSPACK is especially useful for monitoring and identifying SQL statements with high physical disk
reads during the problematic period. In order to do this, you have to set snapshot level at level of 5 or higher.
However, for real-time troubleshooting, you have to use SQL tracing with higher level. For example, you can
ask your user to issue the following command before his/her actual statement (Oracle, 2001).
ALTER SESSION SET EVENTS '10046 TRACE NAME CONTEXT FOREVER, LEVEL 12';
With level 12, the trace file will include both bind variable and wait event information. If you want to trace
other sessions, you can use the following methods.
1. Find sid, serial#, and use DBMS_SYSTEM.SET_SQL_TRACE_IN_SESSION (SID,
SERIAL#,TRUE). This will not provide wait and bind variable information though.
2. Find sid, serial#, and use DBMS_SUPPORT.START_TRACE_IN_SESSION (<SID>,
<SERIAL#>, waits=>TRUE, binds=>TRUE). You can also issue
DBMS_SUPPORT.STOP_TRACE_IN_SESSION(<SID>, <SERIAL#>) to stop tracing.
3. If you know the Unix process id, you can find the spid from v$process by using following query.
SELECT P.SPID FROM V$PROCESS P, V$SESSION S
WHERE S.PROCESS = <Unix Process Id> AND P.ADDR = S.PADDR;
Then, use ORADEBUG to set trace for the session.
SVRMGRL> CONNECT INTERNAL
SVRMGRL> ORADEBUG SETOSPID <Process Id>
SVRMGRL> ORADEBUG EVENT 10046 TRACE NAME CONTEXT FOREVER,LEVEL 12
Example: The following listing shows a segment of trace information by level 12.
WAIT #1: nam='db file sequential read' ela= 4 p1=261 p2=233717 p3=1
WAIT #1: nam='db file sequential read' ela= 7 p1=261 p2=233715 p3=1
WAIT #1: nam='db file sequential read' ela= 1 p1=261 p2=233605 p3=1
WAIT #1: nam='db file sequential read' ela= 7 p1=261 p2=233676 p3=1
Paper #559
Administrating the Database
WAIT #1: nam='db file sequential read' ela= 4 p1=261 p2=1272 p3=1
WAIT #1: nam='db file sequential read' ela= 1 p1=261 p2=233666 p3=1
WAIT #1: nam='db file sequential read' ela= 8 p1=261 p2=1200 p3=1
WAIT #1: nam='db file sequential read' ela= 3 p1=261 p2=1264 p3=1
WAIT #1: nam='db file sequential read' ela= 0 p1=261 p2=233734 p3=1
WAIT #1: nam='db file sequential read' ela= 0 p1=261 p2=1292 p3=1
WAIT #1: nam='db file sequential read' ela= 3 p1=261 p2=233650 p3=1
…
The trace file will reveal what the process has been waiting on. For I/O performance tuning, you look for
wait event ‘db file scattered read’ and ‘db file sequential read’. Check how long, for
what files, and for what blocks the process has been waiting on. If the I/O type of wait is too long, then you
may have an I/O hot spot on those files and need to redistribute the I/O load or rebuild the tables or
indexes. You may also consider rewriting the SQL statement and changing its access path and see whether
you can avoid the I/O contention.
Paper #559
Administrating the Database
9. Place redo log members on the same fast disks, because the faster members always have to wait for the
slower ones. Try to use the outer section of disks (e.g., the first hyper volume of each disk in EMC
Symmetrix) for redo logs.
Paper #559
Administrating the Database
Paper #559
Administrating the Database
where v$datafile.file#=tab.file#
and v$datafile.ts#=tab.ts#
and v$datafile.file#=a.file#
and ((readtim)*10/(phyrds)) > 40
and (phyrds) > 0
and object_id=tab.obj#
union
select owner, object_name, object_type, v$datafile.name name,
((readtim)*10/phyrds) read_time
from v$datafile, tab$ tab,dba_objects, v$filestat a
where v$datafile.file#=tab.file#
from v$datafile, tab$ tab,dba_objects, v$filestat a
where v$datafile.file#=tab.file#
and v$datafile.ts#=tab.ts#
and object_id=tab.obj#
and v$datafile.file#=a.file#
and ((readtim)*10/(phyrds)) > 100
and (phyrds) > 0;
Alternatively, if you know which logical volume/data file has I/O contention from the data collected
by Glance or sar, then you can use the following SQL to identify what objects are in that data file.
set linesize 132
col owner format a9
col segment_name format a35
col name format a40
break on name nodup skip 2
set pages 1000
Likewise, if you know the tables or indexes accessed by a certain application, you can use the following SQL
to identify what data files they are using. With PerfView, Glance, and sar, you can ascertain how the
application can be tuned better by tuning I/O contention.
select v.name,e.owner, e.segment_name, e.bytes, e.tablespace_name
from dba_extents e, v$datafile v
where e.file_id = v.file#
and e.segment_name in ('<segment_name>')
and e.owner ='<owner_name>';
Paper #559
Administrating the Database
A drawback for v$filestat is that its information is accumulative from instance startup and is difficult to use
for more fine-grained analyses over a period of time, such as hourly analysis, for mission-critical databases.
Oracle UTLBSTAT/ESTAT provides an opportunity for us to conduct such time-based analyses on the
overall database performance including I/O analyses. You can set up a cron job to run BSTAT/ESTAT
periodically to collect statistics on tablespace and data file levels. In my tuning practice, I changed the SQL in
ESTAT to the following, so that the output is more readable.
select file_name, table_space,
phys_blks_rd blks_read, phys_rd_time read_time,
phys_blks_wr blks_wrt, phys_wrt_tim write_time,
((phys_rd_time+phys_wrt_tim)*10)/
decode((phys_blks_rd+phys_blks_wr),0,0.0001,
(phys_blks_rd+phys_blks_wr)) "Access Time"
from stats$files
order by 7;
ACKNOWLEDGEMENT
I would like to express my appreciation to the following individuals for giving me support and help. Without
their efforts, I would not have been able to complete this paper.
1. Michael Erwin, Practice Manager of Oracle, who has shared with me his knowledge on how to tackle
I/O problems as well as how to tune overall system performance.
Paper #559
Administrating the Database
2. Bob Ritko, Jim Rayhorn, Ray Spillman, Mark Bortle, and Prasad Sangle, our Unix Administrators, who
have supported me in my tuning efforts and have given me tutoring on hardware and OS concepts.
3. David Dalton, Senior System Engineer of EMC, who answered my questions regarding EMC Symmetrix,
Symmetrix Manager, and EMC DB Tuner.
4. Ching-Yin Fang, Technical Consultant of HP, who helped me to understand HP tools and supported me
in solving I/O-related issues.
5. Prasad Kaggallu and Raju Kotini, my managers, for their managerial support.
6. Steve Adams, Mark Bonanno, Mike Craig, Scott Myers, Stan Nickel, Rich Niemiec, and Bob Ritko for
their technical reviews and comments on the earlier versions of this paper.
7. Stephanie Caswell Schuckers, George Trapp, Mike Henry, and John Atkins, my graduate advisors at the
Lane Department of CSEE, West Virginia University, who taught me fairness, kind, hardworking, as well
as knowledge. This paper is dedicated to you.
REFERENCES
Adams, S. (2001a). The Seven Deadly Sins Just Got Worse. Oracle OpenWorld Proceeding.
Adams, S. (2001b). www.ixora.com.au
Alomari, A. (1999). Oracle8 & Unix Performance Tuning. Prentice Hall.
Aronoff, E., Loney, K., & Sonawalla, N. (1997). Advanced Oracle Tuning and Administration. Osborne.
Himatsingka, B., & Loaiza, J. (1998). How to Stop Defragmenting and Start Living: The Definitive Word on
Fragmentation. Oracle Corporation White Paper.
Loaiza, J. (2001). Optimal Storage Configuration Made Easy. technet .oracle.com.
Loney, K. (1998). Oracle8 DBA Handbook. Osborne.
Millsap, C. V. (1995). The OFA Standard – Oracle for Open Systems. Oracle Corporation White Paper.
Niemiec, R. J. (1999). Oracle Performance Tuning: Tips & Techniques. Osborne.
Oracle Corporation. (2001). SQL*Trace – Notes for Application Support Analysis. Metalink Note:77343.1.
Pearce, B. (2001). Opening the Black Box: A DBA’s View of the EMC Symmetrix. IOUG Live-2001
Conference Proceeding.
Rarich, T. (2001). Meta Volumes and Striping. EMC Engineering White Paper.
Vengurlekar, N. (1998). Database Writer and Buffer Management. Oracle Corporation White Paper.
Paper #559