Sunteți pe pagina 1din 8

Case Study: Using Real-Time Diagnostic Tools to Diagnose Intermittent

Database Hangs

Authors: Carl Davis, Consulting Technical Advisor – Center of Expertise (COE), Oracle USA

Skill Level Rating for this Case Study: Intermediate

About Oracle Case Studies


Oracle Case Studies are intended as learning tools and for sharing information or
knowledge related to a complex event, process, procedure, or to a series of related
events. Each case study is written based upon the experience that the writer/s
encountered.

Each Case Study contains a skill level rating. The rating provides an indication of what
skill level the reader should have as it relates to the information in the case study.
Ratings are:

• Expert: significant experience with the subject matter


• Intermediate: some experience with the subject matter
• Beginner: little experience with the subject matter

Case Study Abstract


The purpose of this case study is to show how to deploy diagnostic tools from the Center
of Expertise (COE) such as LTOM and show_sessions to diagnose complex performance
problems in real-time. This case study will primarily focus on the Pre-Analysis phase.
This is where the majority of the work was done. Using real-time diagnostic tools to
extract the necessary trace information is the most difficult part of this case study. Once
we collected the diagnostic trace files, analyzing them was quite simple.

Performance problems that are intermittent happen without warning, last for a short
duration, and are extremely difficult to diagnose. Traditional means of diagnosing these
kinds of problems usually result in iterative attempts to capture the necessary data,
resulting in very long engagements between the customer and support. Frequently these
problems never truly get resolved as customers choose to upgrade in the hope that the
problem goes away. This case study deals with one of the most difficult performance
problems to diagnose, an intermittent database hang.
Performance problems can be differentiated into two categories, hangs and slowdowns. A
true database hang, sometimes called a database freeze, is the most severe type of
performance problem. In this case, existing database connections become non-responsive
and any new connections to the database are impossible. Execution of code has either
been halted, stuck in a tight loop, or is proceeding at an extremely slow rate causing the
user to perceive the hang as indefinite. The true database hang also prevents customers or
support analysts from obtaining diagnostic data as database connectivity is not possible.
Fortunately, these types of hangs are rare.

Far more common is the database slowdown. The database slowdown differs from a true
database hang in that database connections are still possible especially when connecting
as the sys user. Database activity proceeds slowly even to the point where the user may
consider the database completely hung but the execution of code is still proceeding.
Slowdowns, by definition, do not severely limit the ability for the customer or support
analyst to obtain at least some diagnostic data, as database connectivity is still possible.

A further differentiation can be made between a true hang and an intermittent hang. A
true hang will remain in the frozen state indefinitely. An intermittent hang will eventually
free itself. Diagnosing either hang requires using tools outside of Oracle in order to
collect diagnostic traces. Operating system debuggers like GDB can be used to obtain
systemstate information and in some cases hanganalyze trace files. If the database is
experiencing a true hang then the user can take the time to use GDB or similar debuggers
as the database will remain in the frozen state and diagnostics can be collected. The
intermittent database hang, however, may not last long enough for the time consuming
approach of using an Operating system debugger.

By using LTOM’s manual data recorder and automatic hang detector it is possible to
detect the hang and issue external commands to collect diagnostic traces using operating
system utilities like GDB or other custom utilities like COE’s show_sessions program.
Our task here is to show how to deploy real-time tools to diagnose complex performance
problems. Utilizing Oracle Diagnostic Methodology (ODM) we will step through the data
collection, analysis, and resolution phases.

Case History
The customer had been suffering from intermittent database hangs for over 6 months.
Oracle support had been involved but collecting the necessary diagnostic trace files had
proved impossible due to the short duration of the hang (3-5 minutes). Collecting
diagnostic traces was made even more difficult as the problem would occur without
warning.

The customer was able to collect statspack snapshots. Again these proved problematic, as
30-minute snapshots encapsulating the hang did not provide enough detail to determine
what was causing the hang. Using statspack snapshots to diagnose intermittent
performance problems presents it’s own challenges. The problem with static data

Page 2
captures to solve intermittent problems is that any performance spike that occurs during
the static statspack snapshots will be averaged out over the entire snapshot interval. In
our case we had a 3-5 minute hang averaged out over 30 minutes of snapshot interval.

Pre-Analysis Work

Detail

Step 1) Problem Verification:

As with any problem the first step is to identify and clarify what problem needs to be
solved and to verify its existence. To accomplish this we used LTOM's manual data
recorder to collect information from the Oracle database together with information on
operating system metrics. This presented us with an integrated picture of what was
happening on both the database and the operating system before, during, and after the
hang.

We deployed LTOM and first set up the manual recorder to collect data at 3 second
snapshots. When the database hung we were able to clearly determine the nature of the
problem and confirm that the hang was a result of an oracle resource issue and not
something external to oracle. The following LTOM snapshot taken at 11:14:27 showed
normal activity with no Oracle sessions waiting and adequate operating system resources
available prior to the database hang:

---------------SNAPSHOT# 4751
Mon Feb 14 11:14:27 PST 2005

r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id
0 0 28 94723088 26511624 1026 5418 412 4 4 0 0 20 0 0 20 7242 49559 9029 16 9 75

SID PID SPID %CPU TCPU PROGRAM USERNAME EVENT SEQ SECS P1 P2 P3

The next LTOM snapshot taken 3 seconds later at 11:14:30 captured the system just prior
to the total database hang. Virtually all database sessions were waiting on the same
library cache latch (latch #106).

---------------SNAPSHOT# 4752
Mon Feb 14 11:14:30 PST 2005

r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id
0 0 28 94723088 26511624 1026 5418 412 4 4 0 0 20 0 0 20 7242 49559 9029 16 9 75

SID PID SPID %CPU TCPU PROGRAM USERNAME EVENT SEQ SECS P1 P2 P3
8 9 21471 * * QMN0 null latch free 2952 0 43487260472 106 5
19 20 21804 * * TNS DEV latch free 62783 0 43487260472 106 2
63 601 19039 * * TNS KS4029 latch free 6313 0 43487260472 106 3
74 447 20696 * * TNS KG5770 latch free 1201 0 43487260472 106 3
79 660 21950 * * TNS AB0320 latch free 5442 0 43487260472 106 4

Page 3
81 443 3430 * * TNS null latch free 54 2 43487260472 106 53
82 717 16672 * * TNS JL3855 latch free 6661 0 43487260472 106 5
95 664 6718 * * TNS RP5398 latch free 32354 0 43487260472 106 5
98 749 20654 * * TNS PF5036 latch free 609 0 43487260472 106 1
104 740 23674 * * TNS RF4846 latch free 1483 0 43487260472 106 3
108 166 24606 * * TNS NM227 latch free 11903 0 43487260472 106 5
109 518 12420 * * TNS AS5032 latch free 1839 0 43487260472 106 2
116 220 8864 * * TNS AL4171 latch free 34631 0 43487260472 106 3
120 547 24207 * * TNS FJ356 latch free 7192 0 43487260472 106 1
126 382 13454 * * TNS BM249 latch free 5902 0 43487260472 106 4
129 306 21359 * * TNS DI3891 latch free 2787 0 43487260472 106 3
130 665 24166 * * TNS EW3761 latch free 2078 0 43487260472 106 4
131 530 3773 * * TNS null latch free 51 2 43487260472 106 50



817 813 5656 * * TNS null latch free 28 2 43487260472 106 27
818 814 5774 * * TNS null latch free 27 0 43487260472 106 26
819 815 5775 * * TNS null latch free 27 2 43487260472 106 26
820 816 5971 * * TNS null latch free 25 0 43487260472 106 24
821 817 6007 * * TNS null latch free 24 0 43487260472 106 23
822 818 6127 * * TNS null latch free 23 2 43487260472 106 22
823 819 6146 * * TNS null latch free 23 0 43487260472 106 22

The latch name can be retrieved from v$latch using p2 from wait event data when the
event is latch free.

The query to retrieve the latch name would be as follows:

Select name from v$latch where latch# = 106;

The next LTOM snapshot occurred 2 minutes and 25 seconds later at 11:16:52. Here we
see the system had returned to a normal state as the database hang had completed. Our
database connection during snapshot 4752 and 4753 was also hung resulting in our next
snapshot occurring 2 minutes and 25 seconds later instead of 3 seconds later as expected.

---------------SNAPSHOT# 4753
Mon Feb 14 11:16:52 PST 2005

r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id
0 0 25 94723012 26513624 1026 5418 412 4 4 0 0 20 0 0 20 7242 49559 9029 16 10 76

SID PID SPID %CPU TCPU PROGRAM USERNAME EVENT SEQ SECS P1 P2 P3

We had, at this point, successfully verified that the customer problem was a database
hang. We also had some indication to what could be causing it at a high level. It is
apparent we had severe library cache latch contention. In particular some process was
holding the library cache parent latch for an extremely long time (in excess of 2 minutes).
We also know from reviewing the LTOM vmstat data, the hang had nothing to do with an
operating system resource such as CPU or memory.

Page 4
Step 2) Dig deeper to extract additional diagnostic information from the hung
database.

Once the problem had been verified we needed to continue collecting further diagnostic
traces to determine the cause of the problem, i.e. what is causing the library latch
contention. The data collection from LTOM proved that the hang was due to processes
waiting for the parent library cache latch. The next challenge was to determine which
process was holding the latch and why it was being held for such a long time causing all
other processes to wait.

Efforts to use GDB, the operating system debugger, to attach to a process to take a
systemstate proved unsuccessful. The customer could not get the information because
GDB would also hang and even though it appeared to attach to the process and the
command to generate the systemstate appeared to work, no trace file could ever be
produced. This effort was problematic, in that even if successful the systemstate could
not complete before the hang would end. The hang would last 3-5 minutes and it
normally took over 5 minutes to generate a systemstate on their production database
when the database had little activity. Hang Analyze was not a possibility because there is
no way to call Hang Analyze thru GDB in oracle version 8.

Because we couldn’t gather further diagnostic data during the hang COE created a
program called show_sessions which was used instead of GDB to gather data directly
from the SGA. Show_sessions attaches to the SGA and reads information contained in
our Oracle data structures similar to the way a systemstate dump works. Show_sessions
was able to gather comprehensive process and session data that would normally be
gathered from systemstate, hanganalyze, or querying v$session, v$process,
v$session_wait, v$sql etc. but were not available due to the hang preventing database
access.

We again deployed LTOM and used LTOM's automatic hang detector to detect the hang
and make a call to the show_sessions program. LTOM's automatic hang detector not only
detects a hang but allows the user to specify an optional file to run when the hang is
detected. We configured LTOM to call the show_sessions program when the next hang
occurred.

From reviewing the output of show_sessions we could determine that the process holding
the latch was the SMON process. We then reconfigured LTOM to call both
show_sessions along with the unix utility pstack. We created a shell file to call both
programs 3 times with a 30 second delay between calls. This would give us multiple
samples to compare during the time the hang was occurring. The pstack utility produces
a hexadecimal stack trace with a list of function calls that the process was executing
during the time the pstack command was issued. We waited for the next hang and
collected the information from LTOM, show_sessions and the pstack command on the
blocking process (SMON).

Page 5
To call pstack issue the following command:

$ pstack ospid

Finally we had all the diagnostic trace information necessary to analyze and solve the
problem.

Analysis

Summary

Now that all the required diagnostic traces were obtained we could determine the cause of
the database hang. We took the following actions:

1. Reviewed the output from show_sessions. Found the process holding the parent
library cache latch and the sql that this process was currently executing.
2. Reviewed the pstack of the blocking process. This showed the underlying functions
of Oracle code that was executing during the time of the hang of the process causing
the hang.
3. Reviewed the bug database to see if this was a known bug.
4. Identified effective solutions.
5. Delivered the best solution.

Detailed Analysis

1. Review the output from show_sessions to find which process was holding the library
cache latch.

*** Process (0xa01e272e0) Serial: 1 OSPid: 28710


HOLDING LATCH: 0x380014980
*** Latch: 106 (0x380014980) Level: 5 Gets: 1110 Misses: 0 ImmediateGets: 0
ImmediateMisses: 0
Sleeps: 0 SpinGets: 0 Sleeps1: 0 Sleeps2: 0 Sleeps3: 0

*** Session (5): a0269e608 User: SYS PID: 28710 blocker: 0


SQL Addr: 9e2a85958
SQL: delete from obj$ where owner#=:1 and name=:2 and namespace=:3 and(remoteowner=:4 or
remoteowner is null and :4 is null)and(linkname=:5 or linkname is null and :5 is
null)and(subname=:6 or subname is null and :6 is null)

pSQL Addr: 9e2a85958


pSQL: delete from obj$ where owner#=:1 and name=:2 and namespace=:3 and(remoteowner=:4 or
remoteowner is null and :4 is null)and(linkname=:5 or linkname is null and :5 is
null)and(subname=:6 or subname is null and :6 is null)

Session Waits:
Seq: 632 Event#: 94 P1: 0x1 (1) P2: 0x51dd (20957) P3: 0x1 (1) Time: 3

Page 6
Here we could see that the process at address 0xa01e272e0 was holding latch #106.
This is what was preventing all the other processes from acquiring this latch and
causing the database to hang. Other relevant information captured by show_sessions
was the ospid (28710) which we could map back to the SMON process by querying
v$process and v$session. We could also see the actual sql that the process was
executing:

delete from obj$ where owner#=:1 and name=:2 and namespace=:3 and(remoteowner=:4
or remoteowner is null and :4 is null)and(linkname=:5 or linkname is null and :5 is
null)and(subname=:6 or subname is null and :6 is null)

2. Review the process stack from the blocking process (SMON).

28710: ora_smon_live
0000000100fdb414 kglhdde (101e3c3e0, 9bf904bf8, a0001c1b8, 0, 9cce56774, 2000) + 114
0000000100fdb990 kglhdunp2 (1f, a200a2bc8, 3e8, 101ca1f30, 0, 1f0) + 2b0
0000000100fdb624 kglhdunp (101e3c3e0, 25, 0, 70000, 1, 7ffffffc) + 1a4
0000000100fcd9a4 kglobf0 (9cce56458, 0, 1, 1, 3c4, 0) + 1c4
0000000100fcbb08 kglhdiv (a2009f178, 9ef2dc040, 10000, ffffffffffffff98, 3c4, e) + 2c8
0000000100fd45b8 kglpndl (380003710, a2009f118, e, 101e3d748, ffffffff7fffc320, e) + a58
000000010041e54c kssdct (a1f6cb5c8, 24, 1, 0, 101e3ede8, 0) + 18c
000000010034dfac ktcrcm (0, 0, 0, 0, 0, 0) + aec
00000001005dd1e0 kqlclo (a03b86d24, 0, 3, 20000000, 0, 0) + a80
000000010023b46c ktmmon (0, 380005880, 6, a01e272e0, 3800050b8, 0) + 188c
000000010042330c ksbrdp (0, 101e401e0, 0, 0, 100909238, 100909204) + 2ec
000000010090939c opirip (32, 0, 0, 0, 0, 0) + 31c
000000010015f720 opidrv (32, 0, 0, 6c6f6700, 0, 0) + 6a0
0000000100149e10 sou2o (ffffffff7fffe890, 32, 0, 0, 101e7c5c8, 100134e0c) + 10
0000000100134f28 main (1, ffffffff7fffeab8, ffffffff7fffeac8, 0, 0, 100000000) + 128
0000000100134ddc _start (0, 0, 0, 0, 0, 0) + 17c

Comparing the 3 samples of the pstack of the SMON process revealed that the SMON
process was stuck executing the same code during the hang. We could see that all 3
samples showed the identical stack. This meant that the same code was executing (or
stuck in execution) during the time we collected pstacks.

3. Now that we know which session was holding the latch (from step #1) and the sql that
was being executed, we could search the bug database to see if this was a known
bug. The stack trace of SMON clearly matched the bug signature of bug 2791662.
(NOTE: This bug is not viewable by customers). This bug causes database
hangs/freezes due to a process (in our case SMON) holding a library cache latch
when executing a drop statement while invalidating read-only objects (cursors)
dependent on the object being dropped. In the case where you have a large number of
read-only dependent objects, this latch could be held for a very long time. This was
the cause of the user’s intermittent hang.

4. A patch existed to correct the problems associated with bug 2791662. A less desirable
workaround would have been to issue the drop command on the underlying object in
off-hours.

Page 7
5. Clearly the best solution was to apply the patch to fix the bug. The customer applied
the patch and the problem was fixed.

Conclusion
The use of a structured methodology (ODM) will always lead to reduced resolution
times. Problem identification and verification is an extremely important step and
unfortunately is too often overlooked. In this case different diagnostic tools were
available but before the right diagnostic tool could be selected the problem needed to be
identified and verified. The technical differences between a hang and a slowdown can
appear to be insignificant to the end user but this is a very important distinction when it
comes to selecting the most appropriate diagnostic trace/tool. Using statspack for
example to determine the root cause of a hang was not as preferable as collecting
systemstate or hanganalyze dumps.

The need to deploy the right data collection tool early on cannot be over emphasized. The
customer had tried for months to collect data during the hang but was unsuccessful. Had
LTOM/show_sessions been deployed in the beginning of this engagement a solution
could have been obtained in days rather than months.

References
The Oracle Center of Expertise offers a suite of tools to assist customers in resolving
performance issues. These tools include:

Note 352363.1: LTOM - The On-Board Monitor User Guide


Note 301137.1: OS Watcher User Guide
Note 362094.1: HANGFG User Guide
Note 362791.1: STACKX User Guide
Show_sessions: This is not currently available for download. However, if you are
interested in getting information on this tool, please contact the author of this case study,
Carl Davis.

Other relevant documents to this study include:

Note 312789.1: What is the Oracle Diagnostic Methodology (ODM)?


Note 215858.1: Interpreting HANGANALYZE trace files to diagnose hanging and
performance problems
Note 273324.1: Using HP-UX Debugger GDB To Produce System State Dump

Page 8

S-ar putea să vă placă și