AIX Commands and Tools For DB2 Troubleshooting AIX Solaris HPUX UNIX Linux System Storage Administration KSH - Perl Scripting

AIX commands and tools for DB2 troubleshooting AIX Solaris HPUX ...
http://www.aixmind.com/?p=1257
AIX Solaris HPUX UNIX Linux system storage administration ksh/perl scripting
UNIX administrator blog tips experience command log knowledge base

May 8, 2010 11:44 pm
AIX commands and tools for DB2 troubleshooting

by admin in AIX Introduction There are many scenarios where the troubleshooting of DB2 issues can involve and benefit from gathering operating system level data and analyzing it to understand the issues further. This article discusses a number of problems you may face with your database including CPU usage problems, orphan processes, database corruption, memory leaks, hangs and unresponsive application. Here the author tried to explain some AIX utilities and commands to help you understand and resolve each of these troublesome issues. The data you collect from running these commands can be sent to the IBM Technical Support Team when opening a problem management request (PMR) in order to expedite the PMR support process. The end of each section of this article discusses the documents you should gather to send to the Technical Support Team. While this article gives troubleshooting tips to use as a guideline, you should contact the IBM Technical Support Team for official advice about these problems. Monitor CPU usage In working with your database, you might notice a certain DB2 process consuming a high amount of CPU space. This section describes some AIX utilities and commands which you can use either to analyse the issue yourself or to gather data before submitting a PMR to IBM Technical Support: Through ps Command: A ps command reveals the current status of an active process. You can use
ps -auxw | sort r +3 |head 10
to sort and get a list of the top 10 highest CPU consuming processes. Listing 1 shows the ps output: Listing 1. Sample ps output
root@mavrickit $ ps auxw|sort -r +3|head -10 USER scot dpf scot PID 1658958 1036486 1822932 %CPU 0.1 0.0 0.0 %MEM SZ RSS TTY STAT STIME TIME COMMAND
9.0 218016 214804 - A 1.0 14376 14068 1.0 12196 11608 - A - A
Sep 13 38:16 db2agent (idle) 0 Sep 17 Sep 12 3:10 db2hmon 0 6:41 db2hmon 0
1 of 19
02/29/12 1:05 AM
dpf dpf dpf dpf dpf db2inst1
1011760 1532116 786672 1077470 1269798 454756
0.0 0.0 0.0 0.0 0.0 0.0
0.0 9264 0.0 9264 0.0 9264 0.0 9264 0.0 9248 0.0 9012
9060 9020 8984 8968 9044 7120
- A - A - A - A - A - A
Sep 17 Sep 17 Sep 17 Sep 17 Sep 17 Jul 19
3:03 db2hmon 3 3:04 db2hmon 2 3:02 db2hmon 5 3:03 db2hmon 1 2:50 db2hmon 4 0:52 db2sysc 0
Through topas Command When executing a ps -ef command, you see the CPU usage of a certain process. You can also use the topas command to get further details. Similar to the ps command, a topas command retrieves selected statistics about the activity on the local system. Listing 2 is a sample topas output that shows a DB2 process consuming 33.3% CPU. You can use the topas output to get specific information such as the process id, the CPU usage and the instance owner who started the process. It is normal to see several db2sysc processes for a single instance owner. DB2 processes are renamed depending on the utility being used to list process information: Listing 2. Sample topas output
Name db2sysc db2sysc test db2sysc db2sysc db2sysc sharon lrud db2sysc db2sysc topas db2sysc db2sysc db2sysc PID 105428 38994 14480 36348 116978 120548 30318 9030 130252 130936 120598 62248 83970 113870 CPU% 33.3 14.0 1.4 0.8 0.5 0.5 0.3 0.3 0.3 0.3 0.3 0.2 0.2 0.2 PgSp Owner 11.7 udbtest 11.9 udbtest 0.0 root 1.6 udbtest 1.6 udbtest 1.5 udbtest 0.5 root 0.0 root 1.6 udbtest 1.6 udbtest 3.0 udbtest 1.6 udbtest 1.6 udbtest 1.7 root
Through vmstat Command The vmstat command can be used to monitor CPU utilization; you can get details on the amount of user CPU utilization as well as system CPU usage. Listing 3 shows the output from a vmstat command: Listing 3. Sample vmstat output
kthr memory page faults cpu
2 of 19
02/29/12 1:05 AM
----- ----------- ------------------------ ------------ ----------r b avm fre re pi po 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 fr sr 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cy 0 0 0 0 0 0 0 0 0 0 in 4009 4237 3955 4101 4525 sy cs us 62 67 69 68 63 72 64 66 sy id 38 33 31 31 37 28 36 34 40 27 0 0 0 0 0 0 0 0 0 0 wa 0 0 0 0 0 0 0 0 0 0
32 3 1673185 44373 0 24 0 1673442 44296 0 30 3 1678417 39478 0 33 1 1677126 40816 0 28 0 1678606 39183 0 35 1 1676959 40793 0 23 0 1671318 46504 0 30 0 1677740 40178 0 30 1 1683402 34425 0 0 0 1684160 33808 0
60051 9744 63775 9214 70833 8457 68745 8336 75183 8708
4085 70195 9271 4780 4326 4419 4186 68416 9360 58747 9201
76528 10042 60 72187 9661 73
When reading a vmstat output, as above, you can ignore the first line. The important columns to look at are us, sy, id and wa. Whereas id: Time spent idle. wa: Time spent waiting for I/O. us: Time spent running non-kernel code. (user time) sy: Time spent running kernel code. (system time) In Listing 3, the system is hitting an average of 65% user CPU usage and 35% system CPU usage. Pi and Po values are equal to 0, thus there are no paging issues. The wa column shows there does not seem to be any I/O issues. Listing 4 shows the wa (waiting on I/O) to be unusually high and this indicates there might be I/O bottlenecks on the system which in turn causes the CPU usage to be inefficient. You can check errpt -a output to see if there are any reported issues with the media or I/O on the system. Listing 4. Sample vmstat output showing I/O issues
Kthr memory page faults cpu
----- ----------- ------------------------ ------------ ----------r b 2 8 avm fre re pi po 0 0 0 fr 929 0 0 sr 1689 0 0 cy 0 0 0 in 998 sy 6066 cs us sy id wa 3 2 1 76 16 0 0 98 99
495803 3344 0 0
1832 4 1326 0 1289 0
0 30 495807 3340 0 0 0 30 495807 3340 0 0
1093 4697 1055 2291
3 of 19
02/29/12 1:05 AM
Storage Virtualization Get the free How to Guide: Storage for the Virtual Era.
www.HDS.com
Know to Migrate from Unix Reduce dependency on Complex Systems & minimize Distruptions
www.Dell.com/unixtolinux
Red Hat Linux Training RHCE, RHCSS Training & Exam Best Red Hat training in Hyderabad
www.cossindia.net
Buy Intel Xeon Processors Dedicated Servers with E3-series. Most affordable CPU. From $200!
www.CARI.net
Datacenter in India Ctrls Colocation Servers & Racks Host @ 99.995% Uptime SLA. Read more!
Ctrls.in/Rack-Space-Hosting
0 30 495807 3676 0 2 0 29 495807 3292 0 1 1 29 495810 3226 0 1 4 24 495810 3247 0 3 4 25 495810 3247 0 0 2 26 495810 3262 0 2 3 22 495810 3245 0 1
0 3 0 0 0 0 0
376 2266 5427 6830 3969 5558 4084
656 3219 7572 10018 6752 9587 7547
0 0 0 0 0 0 0
1128 6803 1921 8089
2210 1 2528 14
2 4
0 0
97 82 52 53 61 42 30
3175 16788 4257 37 2483 10691 2498 40 1900 14037 1960 33 2162 10629 2695 50 1894 10866 1970 53
11 0 7 5 8 0 1 0
17 0
Through iostat Command An iostat command quickly tells you if your system has a disk I/O-bound performance problem. Listing 5 is an example of an iostat command output: Listing 5. Sample iostat output
System configuration: lcpu=4 disk=331 tty: tin tout avg-cpu: % user % sys % idle % iowait
4 of 19
02/29/12 1:05 AM
0.0 Disks: hdisk119 hdisk115
724.0 % tm_act 100.0 100.0 100.0
17.9 Kbps 5159.2 5129.6 10288.8
12.3 tps
0.0
69.7 Kb_read 394.4 393.0 790.8 Kb_wrtn 1560 1656 3216 24236 23992 48228
hdiskpower26
%tm_act : Reports back the percentage of time that the physical disk was active or the
total time of disk requests. Kbps : Reports back the amount of data transferred to the drive in kilobytes. tps : Reports back the number of transfers-per-second issued to the physical disk. Kb_read : Reports back the total data (kilobytes) from your measured interval that is read from the physical volumes. Kb_wrtn : Reports back the amount of data (kilobytes) from your measured interval that is written to the physical volumes. To check if you are experiencing resource contention, you can focus on the %tm_act value from the above output. An increase in this value, especially more than 40%, implies that processes are waiting for I/O to complete, and you have an I/O issue on your hands. Checking which hard disk has higher disk activity percentage and whether DB2 uses those hard disks gives you a better idea if these two factors are related. What to collect You should collect the following information before opening a PMR with IBM Technical Support: db2support.zip of high cpu process of high cpu process Technical support might also send you the db2service.perf1 script which basically collects data repeatedly over a period of time. The output of the script needs to be bundled and sent back to the support team for their further analysis. Troubleshoot orphan processes There are scenarios when, even after doing a db2stop, you notice (by doing a ps -ef | grep DB2) certain DB2 processes such as the db2fmp process still running and consuming resources. If there was a case of abnormal shutdown, it is advised to do a ipclean after the instance has been stopped. Doing a db2stop should inherently shutdown all DB2 related processes; however, if an application using those processes was abnormally terminated, this might cause related DB2 processes to become orphan processes. Orphan DB2 processes are those which are not attached or linked to any other DB2 processes. Abnormal termination of an application includes shutting it down by doing a Ctrl+C, closing the KSH session or killing it with a -9 option. One way of confirming that the process is orphaned, is to try and match the process ID (PID) of the orphaned process from the ps -ef output with the Coordinator column of the db2 list applications show detail output. If the PID cannot be found in the db2 list apps output, then it is an orphan process. For example, if you issue a db2 list applications show detail command, you get this output:
5 of 19
02/29/12 1:05 AM
Listing 6. Sample list applications output

CONNECT Auth Id Application Name Appl. Application Id Seq# Number of Coordinating DB Coordinator Status Status Change Time DB Name DB Path Handle Agents partition number pid/thread JDE test.exe 2079 AC1C5C38.G80D.011F44162421 0001 1 0 2068646 UOW Waiting 04/04/2006 09:25:17.036230 PTPROD /db2pd/otprod/ptprod/otprod/NODE0000/SQL00001/ --NOTICE PID 2068646. This is the PID on the local server. Part of the ps -ef output from the server: ps -ef |grep 2068646 otprod 2068646 483566 0 09:06:28 - 0:59 db2agent (PTPROD) 0
This output shows the process with PID of 2068646 is not an orphaned process and is still attached to a DB2 process. In order to avoid orphan processes, you may want to do the following: Make normal, clean exits at the client side so that DB2 is aware and can clean up resources on the server. Tweak values of TCPKEEPIDLE time to a number less than the default, and tune the DB2CHECKCLIENTINTERVAL and KEEPALIVE values. What to collect If you do notice orphan processes and wish to investigate this issue, you should collect the following information before opening a PMR with IBM Technical Support: - grep db2 output -db2support.zip with -c option - A callstack of the process that is collected using dbx, db2pd -stack or kill -36 <pid>. The dbx command is a popular command line debugger used in both Solaris and AIX systems. The dbx output is helpful and can be run as follows: Listing 7. The dbx command
dbx -a <PID> At the dbx prompt type th th info where th current 1 where th current 2 where --- Displays all threads for the process --- Displays additional info about the threads --- Get stack trace for thread 1 --- Makes t1 current --- Displays stack for thread 1 --- Makes thread 2 current --- Displays stack for thread 2.
6 of 19
02/29/12 1:05 AM
... continue for all threads of the process detach --- Detach from process
dbx -a <PID of orphan process>
Detect database corruption You can start to investigate whether the database is corrupted if a user complains of not being able to access certain database objects or is unable to connect to a specific database partition. The following section highlights some of the errors that are logged by DB2 and how you can ensure that there are no operating system (OS) level issues affecting or causing DB2 database corruption. You might notice errors similar to the one in Listing 8 being logged in the db2diag.log: Listing 8. Corruption errors
RETCODE : ZRC=0x87040001=-2029780991=SQLD_BADPAGE "Bad Data Page" DIA8500C A data file error has occurred, record id is "". Or RETCODE: ZRC=0x86020019=-2046689255=SQLB_CSUM "Bad Page, Checksum Error" DIA8426C A invalid page checksum was found for page "".
SoftLayer Official Site More Custom Hosting From A Trusted Source. 24x7 Support. Chat Now.
SoftLayer.com/Asia-Pacific
Or 2007-07-09-11.29.45.696176+120 I16992C16377 LEVEL: Severe PID : 68098 TID : 1 PROC : db2agent (sample) INSTANCE: instest NODE : 000 DB : sample APPHDL : 0-635 APPID: *LOCAL.instest.070709082609 FUNCTION: DB2 UDB, buffer pool services, sqlbcres, probe:20 MESSAGE : Important: CBIT Error DATA #1 : Hexdump, 4096 bytes
These errors are logged when DB2 tries to access data in a container and there is some form of corruption. In such an instance when DB2 cannot access the data, the database might be marked as bad. You can narrow down where there might be possible corruption. In the db2diag.log, look for messages similar to the following: Listing 9. Corruption errors showing database object details
2006-04-15-03.15.37.271601-360 I235258C487 LEVEL: Error PID : 152482 TID : 1 PROC : db2reorg (SAMPLE) 0
7 of 19
02/29/12 1:05 AM
INSTANCE: instest NODE : 000 DB : SAMPLE APPHDL : 0-68 APPID: *LOCAL.SAMPLE.060415091532 FUNCTION: DB2 UDB, buffer pool services, sqlbrdpg, probe:1146 DATA #1 : String, 124 bytes Obj={pool:5;obj:517;type:0} State=x27 Parent={5;517}, EM=55456, PP0=55488 Page=55520 Cont=0 Offset=55552 BlkSize=12 BadPage
The above errors indicate corruption has occurred in tablespace:5 and tableid:517. To check which table this refers to, execute the following SQL query: Listing 10. Query to find a table with corruption
db2 "select tabname, tbspace from syscat.tables where tbspaceid = 5 and tableid = 517"
On the Operating System (OS) level, the most common causes for corruption are either hardware issues or file system corruption. For example, in the db2diag.log if you see the database being marked damaged with a ECORRUPT (89) error as follows : Listing 11. Sample file system-related corruption errors
2007-05-22-13.45.52.268785-240 E20501C453 LEVEL: Error (OS) PID : 1646696 TID : 1 PROC : db2agent (SAMPLE) 0 INSTANCE: tprod NODE : 000 DB : SAMPLE APPHDL : 0-32 APPID: GA260B45.M505.012BC2174219 FUNCTION: DB2 UDB, oper system services, sqloopenp, probe:80 CALLED : OS, -, unspecified_system_function OSERR : ECORRUPT (89) "Invalid file system control data detected."
You can check the following Review the errpt -a output and look for hardware I/O or disk-related messages. Listing 12 is an example of an errpt -a output which shows a file system corruption: Listing 12. Sample errpt output
LABEL: J2_FSCK_REQUIRED IDENTIFIER: B6DB68E0 Date/Time: Thu Jun 7 20:59:49 DFT 2007 Sequence Number: 139206 Machine Id: 000BA256D600 Node Id: cmab
8 of 19
02/29/12 1:05 AM
Class: O Type: INFO Resource Name: SYSJ2 Description FILE SYSTEM RECOVERY REQUIRED Probable Causes INVALID FILE SYSTEM CONTROL DATA DETECTED Recommended Actions PERFORM FULL FILE SYSTEM RECOVERY USING FSCK UTILITY OBTAIN DUMP CHECK ERROR LOG FOR ADDITIONAL RELATED ENTRIES Detail Data ERROR CODE 0000 0005 JFS2 MAJOR/MINOR DEVICE NUMBER 0032 0004 CALLER 0028 8EC8 CALLER 0025 D5E4 CALLER 002B 4AC8
2. Run the fsck command on the file system where the container resides to be sure that it is sound. fsck interactively checks and repairs any file system malfunction. From the pSeries and AIX Information Center we can find the following examples of using the fsck command. Listing 13. The fsck command
To check all the default file systems enter: fsck This form of the fsck command asks you for permission before making any changes to a file system. To check the file system /dev/hd1, enter: fsck /dev/hd1 This checks the unmounted file system located on the /dev/hd1 device.
What to collect
9 of 19
02/29/12 1:05 AM
You should collect the following information before opening a PMR with IBM Technical Support: 1. errpt -a 2. db2support.zip 3. fsck results Debug memory leaks It is important to distinguish, if possible, between a memory leak and a system-wide performance degradation due to increased demands for memory. So initially it is pertinent to check that nothing has changed in the environment that could explain increased memory usage. The rest of this section discusses how to use AIX Operating System techniques to spot, track and debug those leaks. The article does not discuss detailed DB2 tools and techniques, although there is some mention where necessary. What is a memory leak? A particular kind of unintentional memory consumption by a computer program where the program fails to release memory when no longer needed. This condition is normally the result of a bug in a program that prevents it from freeing up memory that it no longer needs. The term is meant as a humorous misnomer, since memory is not physically lost from the computer. Rather, memory is allocated to a program, and that program subsequently loses the ability to access it due to program logic flaws. Specifically, it is a bug in the code whereby malloc() memory allocation calls are not met by corresponding free() memory calls. No corresponding free() system calls lead to unfreed blocks. Typically this is a slow process and occurs over days or weeks particularly if the process is left active as is often the case. Some leaks are not even detectable, particularly if the application terminates and its processes are destroyed. Lisitng 14 is an example of a C code snippet that demonstrates memory leak. In this instance, memory was available and pointed to by the variable s, but it was not saved. After this function returns, the pointer is destroyed and the allocated memory becomes unreachable, but it remains allocated. Listing 14. Sample c code
#include <stdio.h> #include <stdlib.h> void f(void) { void* s; s = malloc(50); /* get memory */ return; /* * Memory was available and pointed to by s, but not saved. * After this function returns, the pointer is destroyed, * and the allocated memory becomes unreachable. * * To "fix" this code, either the f() function itself * needs to add "free(s)" somewhere or the s needs /* memory leak - see note below */
10 of 19
02/29/12 1:05 AM
* to be returned from the f() and the caller of f() needs * to do the free(). */ } int main(void) { /* this is an infinite loop calling the above function */ while (1) f(); /* Malloc will return NULL sooner or later, due to lack of memory */ return 0; }
How to spot, track and debug memory leaks To begin with, you should call IBM if you suspect a DB2 process is leaking memory. But how do you know that you are experiencing this situation? This section discusses some of the options. The first option is to use the ps utility. The ps utility can be used to quickly and simply determine if a process is leaking. This example demonstrates how a particular process is growing in size: Listing 15. Sample ps aux output showing the process growing in size
ps aux: 1st iteration: USER COMMAND db2inst1 db2logmgr 2nd iteration: db2inst1 db2logmgr 3rd iteration: db2inst1 db2logmgr 4th iteration: db2inst1 db2logmgr 5th iteration: db2inst1 db2logmgr 225284 0.1 0.0 20312 19128 - A 11:26:06 10:37 225284 0.1 0.0 20116 18932 - A 11:26:06 10:36 225284 0.1 0.0 19908 18724 - A 11:26:06 10:36 225284 0.1 0.0 19696 18512 - A 11:26:06 10:34 225284 0.2 0.0 19468 18280 - A 11:26:06 10:34 PID %CPU %MEM SZ RSS TTY STAT STIME TIME
11 of 19
02/29/12 1:05 AM
ps -kelf: 1st iteration: F S STIME UID TTY PID TIME CMD 225284 254158 0 60 20 580e59400 18466 PPID C PRI NI ADDR SZ WCHAN
40001 A db2inst1 11:26:06
- 10:34 db2logmgr (***) 0
2nd iteration: 40001 A db2inst1 11:26:06 225284 254158 1 60 20 580e59400 18696
- 10:34 db2logmgr (***) 0
3rd iteration: 40001 A db2inst1 11:26:06 225284 254158 0 60 20 580e59400 18900
- 10:36 db2logmgr (***) 0
4th iteration: 40001 A db2inst1 11:26:06 225284 254158 0 60 20 580e59400 20106
- 10:36 db2logmgr (***) 0
5th iteration: 40001 A db2inst1 11:26:06 225284 254158 0 60 20 580e59400 20312
- 10:37 db2logmgr (***) 0
Server Statistics Easy to use server statistics and monitoring tool with notifications. www.status2k.com Server & Network Racks 300,000 Racks Exported Globally Seismic, Network, OEM, 19" Racks www.chevalgrp.com Unlimited Hosting @ $1.5 Unlimited space, bandwidth, emails 99% up-time , 24/7 local support photonhosting.com
The SZ and RSS values in the ps aux output are the 2 key columns to focus on when trying to spot a potential memory leak. As you can see, the values in bold are increasing. It is not sufficient, however, to determine root cause and more debugging is certainly required. Again, please raise this issue with IBM Technical Support, but what follows are some likely problem determination steps IBM will take. Debug using procmap and gencore As root: 1. 2. 3. 4.
procmap <db2logmgr pid>> procmap.1 ps aux > ps_aux.1 ps -kelf > ps_kelf.1 gencore <db2logmgr pid> <file> and sleep for a period of time, then
procmap <db2logmgr pid> > procmap.2
1. ps aux > ps_aux.2 2. ps -kelf > ps_kelf.2 3. gencore <db2logmgr pid> < file>
12 of 19
02/29/12 1:05 AM
Then repeat these steps again for another 2 or 3 iterations. Please note, on 64 bit AIX, the gencore creates very large files. Regardless of the word size, fullcore needs to be enabled. The following commands can be used to check that the environment is set up correctly: Listing 16. The lsattr command
lsattr -El sys0| grep -i core fullcore true Enable full CORE dump True
And the limits for the instance owner needs to be set appropriately too. You may well be asked to enable MALLOC_DEBUG and export this to the DB2 environment. What follows is an example of this: To start DB2 memory debugging for the next time the instance is started, run: db2set DB2MEMDBG=FFDC . > To start malloc debugging for the next time the instance is started, run: export MALLOCDEBUG log:extended stack_depth 12. And append MALLOCDEBUG to the DB2 registry variable DB2ENVLIST: > db2set DB2ENVLIST MALLOCDEBUG. Then stop and restart DB2. Once the core files have been created, you can use snapcore to bundle the core files and libraries into pax file. An example of snapcore is as follows: Listing 17. Sample snapcore
snapcore /home/db2inst1/sqllib/db2dump/c123456/core /home/db2inst1/sqllib/adm/db2sysc
This creates a file with a *.pax extension in /tmp/snapcore by default. The core file is useless without the executable that cored, in this case it was db2sysc not db2logmgr, which was seen to be growing, because that is a process not an executable. DB2 support is then able to interrogate the core to track the DB2 malloc() allocations against free() calls. Recover from hangs What is a hang A hang occurs when a process has not moved forward or changed after a period of time. This can happen if a thread or process reaches a point in its execution where it can go no further and is waiting for a response. It also occurs when the process is in a very tight loop and never completes the function. The first step is to identify if what you are experiencing is a hang or a severe degradation. Then you need to understand what is affected, or the scope. Some simple questions can help a lot: Why do you think it has hung? Are all DB2 commands hanging? How long has the command been running for? How long does it normally run for Then to access the scope: Are OS commands hanging too? If the answer to this is yes, then you need get assistance from the AIX support team.
13 of 19
02/29/12 1:05 AM
Are db2 connect statements affected? Can SQL be issued over existing connections? If in a DPF environment, can you issue commands against other partitions? Can you issue commands against other databases? Recovery Remember, please collect the stacks before you recover. Once you have the stacks the only choice you have is to issue db2_kill. Then check for any processes and IPCs shared memory, message queues and semaphores left lying around after the kill. You may have to remove any you find manually. You could also try ipclean to remove these resources. If the IPCs are not cleared out by ipclean or ipcrm and the processes are removed by kill -9, then the process is most likely hung in the kernel and you need to call AIX support. Once it has come down, restart with db2start and then do a restart db command. What to collect The single most important piece of information to collect is a stack trace of the process that is believed to be hung. IBM DB2 support cannot debug a hang without this, and the stack trace must be collected prior to recovering DB2. If this is not done, you may have another outage in the future. There will be pressure to restart DB2, but you must resist. The system must be in a hung state in order to diagnose the root cause of the problem and do the necessary debugging. A restart clears the situation and you have lost the window of opportunity to make the necessary changes. More seriously, you cannot provide any confidence that it wont recur. Thus, you need to resist the pressure to restart DB2 until you have collected all the diagnostics. The following table describes good probelm determination (PD) and data caputre versus bad PD and data capture. Note that the best PD and data caputre requires the fewest steps and has a better change of success in determining root cause. Poor PD and data capture: Occurrence Detection Recovery FFDC on (requires restart) Restart (outage #2) Schedule outage, hopefully problem does not reoccur before Occurrence (outage #3) Detection Data Collection Recovery Diagnosis (clock ticking) Better PD and data capture: Occurrence (outage #1) Detection Recovery FFDC on Occurrence (outage #2) Detection Data Collection Recovery
14 of 19
02/29/12 1:05 AM
Diagnosis (clock ticking) Good PD and data capture: Occurrence (outage #1) Detection Data Collection Recovery Diagnosis (clock ticking) Stack traces A stack trace is a snapshot of the function calls at a particular point in time. So multiple stack traces, a few minutes apart, provide a sense of motion. There are a variety of ways to collect stack traces; the following lists are, in my opinion, the most reliable:
Procstack <pid of hung process> >> pid.pstack.out
This is an AIX utility that just dumps the stack to a file. In this instance, I am appending the file because it is run again later and I do not want to have to re-write it.
Kill -36 <pid>
This command does not kill the process, but it sends a signal to dump its stack. This actually creates a fullyformatted trap file to the DIAGPATH area of DB2. Because it gives more information than procstack and the way it works internally, it is generally more expensive, particularly if there are hundreds of processes, which is often the case. The main focus of this article is to discuss AIX operating system tools to debug DB2. No discussion of hang problem determination is complete without mentioning db2pd, so the following invocations can be used to generate stacks traces:
db2pd -stacks (This generates stack dumps again all PID) db2pd -stack <pid> (This generates a stack dump for the PID specified)
The trap file is created in the DIAGPATH area. Listing 18 shows an example of its usage: Listing 18. db2pd -stacks usage
1. -stacks $ db2pd -stacks Attempting to dump all stack traces for instance. See current DIAGPATH for trapfiles. 2. -stack <pid> $ db2pd -stack 1454326 Attempting to dump stack trace for pid 1454326. See current DIAGPATH for trapfile.
The DB2 support will ask you to tar and compress the DIAGPATH area. Most commonly they will ask you to run a db2support command which does it for you, providing the correct flags are used. However, if you use the OS method of procstack, you have to submit the output files. Truss The truss command can be used but is not as effective as a stack dump and is only likely to reveal anything if the processes is looping and can be reproduced. If the process is hung, only a stack dump can reveal how it got there.
15 of 19
02/29/12 1:05 AM
ps It is also a good idea to collect ps listings for all partitions, if applicable, before and after the stack dumps. If you collect the data manually the pseudo-code looks like this: Listing 19. procstack
Procstack Pid or PIDs Ps eafl >> Ps aux >> Sleep 120 Repeat for at least 3 iterations. Or: Kill -36 <pid> or PIDs Ps eafl Ps aux >> >> pseafl.out psaux.out pseafl.out psaux.out >> procstack.out
Sleep 120 Repeat for at least 3 iterations. NB: IBM DB2 support can provide a data collect script which automates this process.
Investigate unresponsive applications Sometimes applications are merely unresponsive, and you have to figure out why it is unresponsive and how to get it to respond. If you issue a force application and it does not respond, you may be left wondering what you can do. First of all, it is important to know that force makes no guarantees to force. It is simply a wrapper around an OS kill command. Without going into the architectural details of DB2, there are some situations which are dangerous to force. As such, the db2agent sets its priority level to be higher than that of the force. Under these circumstances, force does not work, and this is by design. The bottom line is, not every unresponsive application is caused by a bug. It is possible that the application is just doing something important and not responding to any additional commands until it completes its current task. Recovery Recovery almost certainly requires a db2stop,db2start as DB2 does not take kindly to key engine processes being killed. It tends to invoke panic and bring the instance down. I would asses the impact the rogue application is having and, if possible, leave it in situ until you can recycle. It may be holding locks that are contending with other users, for example, and this is adversely affecting the application, in which case you may have to take an outage to remove it. What to collect The debugging of an unresponsive application is treated in the same way as a hung, but clearly the scope is narrower. You need to collect the following elements to send to IBM Technical Support: - Iterative stack traces of the db2agent or DB2 process that is unresponsive.
16 of 19
02/29/12 1:05 AM
- ps listings and other items, like: db2level, dbm cfg, db cfg, db2diag.log and possibly an application snapshot. Conclusion Problem determination in DB2 is made simpler because of the tools and utilities available in AIX. Often it is necessary to use both AIX and DB2 tools and commands to figure out what the problem is. This article discusses some of the problems associated with troubleshooting in DB2 and has hopefully given you the tools you need to fix your database. tags: command, DB2, tool, Troubleshooting Be aware of IZ75570: 7047-185 DOES NOT BOOT AFTER UPDATE TO TL5 newer older AIX Interview Questions and Answers No Responses to "AIX commands and tools for DB2 troubleshooting". Add a comment? or Follow comments by RSS? Be the first and share your thoughts! Please leave a comment... Stay behind your words, use your real name or nickname. name e-mail site Please be nice and tolerant, don't offend. Thanks!
Home About RSS Feed Search for: February 2012 M T W T F S S 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Jan Tag cloud
filesystem VIOS lvm IBM command HACMP device network security mksysb performance linux TSM NIM paging-space powerha NIC Error Remove tcpip process backup VIO console AIX recovery
jfs2
VG LDAP
WPAR restore SSH debug sendmail Printing GPFS
HMC ODM NFS
Migration Boot
NPIV SEA login Oracle
Categories AIX (787)
17 of 19
02/29/12 1:05 AM
application (4) Cluster (24) Database (6) Fun (5) HPUX (6) linux (30) Misc (5) network (6) Perl (5) security (1) Shell Script (27) Solaris (28) Storage (35) TSM (20) UNIX (3) UNIX Interview questions and answers (3) Virtualization (20) Webspere (1) Blogcatalog
Recent Posts How to upload big AIX dump files AIX version of stat command Command to get File Timestamp Useful HMC key combintaions LANFREE timeouts and session processing Diagnose and change TSM server timeouts ANS9592E A SOAP TCP connection error has happened Remove/clear the Reservation lock of a Hitachi LUN AIX migration stop at importing root volume group with LED 0c46, went directory into single-user mode. Some tips and techniques for managing GPFS file systems List PureScale information Enable ssh on AIX to use PAM NPIV FAQs for PowerVM Virtual I/O Server Environment A regular user can ssh but is unable to sftp into an AIX machine Executing the command /usr/sbin/automount hangs How to determine the ulimit settings of a running process on Linux Search Engine Optimization and SEO Tools Search Engine Submission - AddMe Blogroll Themes UNIX unix.com Meta Log in RSS Feed
18 of 19
02/29/12 1:05 AM
Comments RSS AIX Solaris HPUX UNIX Linux system storage administration ksh/perl scripting 2010. "Life is Simple" is a simple theme for WordPress, powered by Chereshka.
19 of 19
02/29/12 1:05 AM

AIX Commands and Tools For DB2 Troubleshooting AIX Solaris HPUX UNIX Linux System Storage Administration KSH - Perl Scripting

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

AIX Commands and Tools For DB2 Troubleshooting AIX Solaris HPUX UNIX Linux System Storage Administration KSH - Perl Scripting

Încărcat de

Drepturi de autor:

Formate disponibile

AIX commands and tools for DB2 troubleshooting AIX Solaris HPUX ...

UNIX administrator blog tips experience command log knowledge base

AIX commands and tools for DB2 troubleshooting

9.0 218016 214804 - A 1.0 14376 14068 1.0 12196 11608 - A - A

dpf dpf dpf dpf dpf db2inst1

1011760 1532116 786672 1077470 1269798 454756

0.0 0.0 0.0 0.0 0.0 0.0

9060 9020 8984 8968 9044 7120

Sep 17 Sep 17 Sep 17 Sep 17 Sep 17 Jul 19

76528 10042 60 72187 9661 73

1832 4 1326 0 1289 0

0 30 495807 3340 0 0 0 30 495807 3340 0 0

1093 4697 1055 2291

376 2266 5427 6830 3969 5558 4084

656 3219 7572 10018 6752 9587 7547

1128 6803 1921 8089

0.0 Disks: hdisk119 hdisk115

724.0 % tm_act 100.0 100.0 100.0

17.9 Kbps 5159.2 5129.6 10288.8

Listing 6. Sample list applications output

dbx -a <PID of orphan process>

40001 A db2inst1 11:26:06

- 10:34 db2logmgr (***) 0

2nd iteration: 40001 A db2inst1 11:26:06 225284 254158 1 60 20 580e59400 18696

- 10:34 db2logmgr (***) 0

3rd iteration: 40001 A db2inst1 11:26:06 225284 254158 0 60 20 580e59400 18900

- 10:36 db2logmgr (***) 0

4th iteration: 40001 A db2inst1 11:26:06 225284 254158 0 60 20 580e59400 20106

- 10:36 db2logmgr (***) 0

5th iteration: 40001 A db2inst1 11:26:06 225284 254158 0 60 20 580e59400 20312

- 10:37 db2logmgr (***) 0

procmap <db2logmgr pid> > procmap.2

WPAR restore SSH debug sendmail Printing GPFS

HMC ODM NFS

NPIV SEA login Oracle

Categories AIX (787)

S-ar putea să vă placă și