Documente Academic
Documente Profesional
Documente Cultură
MS BRM01-209
500 Eldorado Blvd
Broomeld, Colorado 80021
U.S.A.
le and situate
properly as /kernel/genunix.
Inform the administrator of the
naming convention for the UNIX
le on Solaris. Explain to users
and managers as well.
Update the log les that are
maintained with the system,
including initials and timestamp.
1
1-26 Sun Systems Fault Analysis Workshop
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services July 1999, Revision D.1
Exercise: Conducting Fault Analysis and Diagnosis
Exercise objective The objective of this exercise is to apply the
methods and steps for fault analysis and diagnosis.
Preparation
Your instructor will coordinate the creation of workgroup units for the
exercise. The instructors role is that of the customer or the user; you
can ask the instructor questions about the problem.
Tasks
As a group, solve Workshop #46 in Appendix D. Use the methods
discussed in this module to analyze and diagnose the problem.
1
Fault Analysis and Diagnosis 1-27
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services July 1999, Revision D.1
Fault Analysis Worksheet Template
Analysis Phase
Initial Customer Description
ProblemStatement
Resources
ProblemDescription
Error Messages
Symptoms and
Conditions
Relevant Changes
Comparative
Facts
1
1-28 Sun Systems Fault Analysis Workshop
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services July 1999, Revision D.1
Fault Analysis Worksheet Template
Diagnostic Phase
Test and Verication
Corrective Action
Likely Cause(s) Test(s) Result(s) Verication(s)
Final Repair Communication Documentation
1
Fault Analysis and Diagnosis 1-29
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services July 1999, Revision D.1
Exercise: Conducting Fault Analysis and Diagnosis
Exercise Summary
Discussion Take a few minutes to discuss what experiences, issues,
or discoveries you had during the lab exercises.
G Experiences
G Interpretations
G Conclusions
G Applications
1
1-30 Sun Systems Fault Analysis Workshop
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services July 1999, Revision D.1
Check Your Progress
Before continuing on to the next module, check that you are able to
accomplish or answer the following:
K Use an organized total system approach for fault analysis and
diagnosis
K Write accurate problem statements
K Describe a system problem in terms of error messages, symptoms,
relative comparisons, and technical conditions
K Identify and use commonly available resources to solve technical
problems
K Generate a list of likely causes on a per fault basis
K Design and implement tests to validate likely causes
K Communicate and document information gathered in fault
analysis
K Use the Fault Analysis Worksheet to gather and document facts
1
Fault Analysis and Diagnosis 1-31
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services July 1999, Revision D.1
Think Beyond
Can you describe a way to improve the current fault analysis
procedure used in your company environment? What on-line
resources are available at your company site?
2-1
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services July 1999, Revision D.1
Diagnostic Tools 2
Objectives
Upon completion of this module, you should be able to
G Differentiate watchdog resets, panics, and system hangs
G Differentiate hardware and software problems
G Provide examples of fatal and non-fatal error conditions
G Identify a comprehensive set of Solaris commands and utilities
which are useful in fault analysis
G Identify a comprehensive list of Solaris system les which contain
information that is useful in fault analysis
G Describe the syntax, function, and relevance of each command or
system le
G Use Solaris commands and les to determine system conguration
and status information
G Solve workshop problems using Solaris utilities and system les
2
2-2 Sun Systems Fault Analysis Workshop
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services July 1999, Revision D.1
Relevance
Discussion This module presents a comprehensive set of fault
analysis tools which are readily available on the Solaris operating
system. The usefulness of each tool is explained, and examples that
demonstrate usage are provided. Expand the list with additional tools
that you know of, or combination uses of the tools presented here.
Before beginning Module 2, consider the following questions related to
your own experience with the Solaris operating system:
G What tools and utilities do you currently nd most useful?
M ___________________________________
M ___________________________________
M ___________________________________
M ___________________________________
G Are there operations you would like to be able to perform on a
Solaris platform but have not been able to?
G What tools are you most interested in learning?
G Can you describe a recent Solaris system problem and how it was
solved?
References
Additional resources The following references can provide
additional details on the topics discussed in this module:
G Solaris User and System Administration AnswerBooks
(http://docs.sun.com)
G Solaris man pages
G The SunSolve database
2
Diagnostic Tools 2-3
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services July 1999, Revision D.1
Introduction to Hardware and Software Errors
The fundamental error detection mechanisms for all architectures are
basically the same. As architecture designs became more complex, the
error detection mechanisms became more sophisticated, specically in
the multiprocessor environment.
This module begins by classifying the fundamental error detection
mechanisms and providing examples of errors that represent each
major fault category.
2
2-4 Sun Systems Fault Analysis Workshop
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services July 1999, Revision D.1
Introduction to Hardware and Software Errors
Error Categories
Error categories include
G Software errors
G Hardware-corrected errors
G Recoverable errors
G Fatal errors
G Critical errors
Error Reporting Mechanisms
Error reporting mechanisms include
G Bus errors
G Interrupts
G Resets
2
Diagnostic Tools 2-5
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services July 1999, Revision D.1
Type of Errors
Software Errors
Errors that do not originate in the hardware are classied as software
errors. All such errors are detected by the processor and are reported.
Examples of software errors are programming errors or bugs in the
kernel code.
Hardware-Corrected Errors
For error-logging purposes, hardware-corrected errors are always
signaled by an interrupt. No recovery action is normally required. One
bit error from memory is corrected by the error checking and
correcting (ECC) logic. This is reported in the error log.
2
2-6 Sun Systems Fault Analysis Workshop
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services July 1999, Revision D.1
Type of Errors
Recoverable Errors
Recoverable errors caused by hardware are usually signaled by a bus
error posted to the requesting device and a specied interrupt, which
could broadcast the error. Error recovery in such cases is normally
handled by the trap routines, while error logging is done by the
interrupt handler.
The contents of trap registers can be examined, using adb and the
OBP register commands, to determine the trap type. Software trap
types are dened in /usr/include/sys/trap.h; hardware trap types
are dened in machtrap.h which is located in
/usr/include/<version_number>/sys.
A non-essential device losing power or becoming inaccessible is an
example of a recoverable error.
Critical Errors
Critical errors require immediate attention, system shutdown, and
power-off. They are notied through a high-level broadcast interrupt if
at all possible. Types of critical errors include
G An alternating current/direct current (AC/DC) failure
G Temperature warning
G Fan failure
Fatal Errors
A fatal error is a hardware error in which proper system operation
cannot be guaranteed. All fatal errors initiate a system-watchdog reset.
Parity errors on backplanes are an example of a fatal error.
2
Diagnostic Tools 2-7
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services July 1999, Revision D.1
Error Reporting Mechanisms
Bus Errors
Bus errors are one of the mechanisms for error reporting on the
system. Bus errors are issued to the processor when the processor
references a virtual or physical location that cannot be satised for
hardware reasons. Some typical bus errors that occur are
G Illegal address or internal hardware failure
G Instruction fetch or data load
G On an SBus, direct virtual memory access (DVMA) operations
G Synchronous/asynchronous data store
G Memory management unit (MMU) operations
2
2-8 Sun Systems Fault Analysis Workshop
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services July 1999, Revision D.1
Error Reporting Mechanisms
Interrupts
Interrupts are another mechanism for error reporting. These are device
dependent and are issued to notify the CPU of external device
conditions that are asynchronous with the normal operation.
Interrupts indicate
G Device done or ready
G Error detected (the action taken depends on the device driver)
G Change in power status
2
Diagnostic Tools 2-9
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services July 1999, Revision D.1
Watchdog Resets
A watchdog reset is one of the most difcult computer failures to
diagnose. It is an error mode that attempts to bring the system to a
well-known (deterministic) state. It represents a response by the system
to a fault condition which is deemed potentially dangerous.
As a safety precaution, the system goes immediately down without
taking a core dump. The absence of the core dump le makes the
watchdog reset condition more difcult to analyze than the case of a
panic. Often, only OBP register commands are available to discern the
cause of the fault. Systems with sun4u and sun4d architectures can
also use prtdiag -v.
CPUWatchdog Reset
A CPU watchdog reset is initiated on a single processor machine when
a trap condition occurs while traps are disabled and a register bit to
enable traps is not set. Because this condition is related to a register
setting, it is referred to as a CPU watchdog reset.
The CPU branches to a reserved physical address, and system goes
immediately down. The causes for CPU watchdog resets can be either
hardware or software, and usually affect only one CPU.
SystemWatchdog Reset
When a fatal error is detected on a multiprocessor machine, a system
watchdog reset is initiated. A system watchdog reset affects all CPUs
and I/O devices. Writes in progress may be lost, but the state of main
memory is not altered and continues to be refreshed after a system
watchdog reset. In most cases, the system watchdog reset condition is
hardware related.
2
2-10 Sun Systems Fault Analysis Workshop
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services July 1999, Revision D.1
Differentiating Hardware and Software Faults
Observation of the following details assist in differentiating hardware
from software problems when panic or watchdog resets occur:
G Fault frequency Hardware problems tend to come on suddenly, be
more random, and become aggravated as time goes on, while
software problems tend to more predictable and often repeatable.
G History Traumatic events such as frequent power failures,
dropping hardware or mishandling can lead to hardware
problems.
2
Diagnostic Tools 2-11
Copyright 1999 Sun Microsystems, Inc. All Rights Reserved. Enterprise Services July 1999, Revision D.1
Differentiating Hardware and Software Faults
G Watchdog resets Usually, (not always) watchdog resets are
hardware related.
G Log les Related messages in /var/adm/messages are checked.
Sometimes hardware and software failures are prefaced by
messages written to the log le. If there are disk error messages in
the log le, and UNIX