Sunteți pe pagina 1din 3

Solaris Troubleshooting: Local Boot Issues, Errors and Resolutions (for Legacy Solaris 2.

6) The following are some of the most common local boot errors encountered and possible workarounds. Local boot 1) Time out waiting for arp/rarp: Note: If this is an Enterprise Server, check to make sure that the key switch located on the front panel is not in diag mode positioned to the right. The switch should be in the vertical position, straight up and down. If this error occurs immediately after the boot command is issued (but prior to seeing the OS release displayed) look at diag-switch. (From the ok prompt type printenv diag-switch ) Check to see if it is set to true. (See the above note regarding Enterprise Systems.) When diag-switch is set to true the system will look attempt to boot from the diag-device (rather than the boot-device.) By default the diag-device is set to net I have seen cases where diag-switch is set to false and the system after reset will still attempt to boot from the diag-device. In this case it may be required to set the diag-device to the root disk (i.e. disk) (This may indicate a stuck bit.) Another possibility is that the first boot-device is not being seen, so It will look for the second boot-device which by default is net. Do a probe for that disk. (The probe command depends on the system type.) If arp/rarp time-outs appear after the kernel starts (i.e. OS release is displayed on screen) this would indicate that there is system call attempting to go over the net. This can sometimes be debugged by placing a set -x in a /etc run level file (e.g. rcS, rc2, rc3) for the suspected startup run levels or a boot -v from the OK prompt to the root drive. 2) Cant open kernel/unix: This can mean that either the system cant open kernel/unix due to wrong rev. of OS for that system (during a cdrom boot) or that the kernel is corrupt and cant be opened. Or finally, I have seen the boot-file set to kernel/unix but with a extra character which makes the system unbootable even from cdrom. Solution would be to do a set-default boot-file. 3) Cant mount/usr: Possibilities on this error include /usr is not mountable due to corruption or bad super block. Also, that the /etc/vfstab entry is either missing or pointing to the incorrect block device or mount point. 4) Cannot create /var/adm/utmp or /utmpx: First check (by booting from cdrom with the -s option) that /var can be mounted. (fsck may have to be run against /var if its a separate partition.) This error almost always points to a corrupted kernel that is caused by a kernel patch that was added in multi-user mode. There are SRDBs out that recommends to boot single user from cdrom (i.e. boot cdrom -s from the ok prompt). However I have never seen this procedure work. What we have had to resort to is: (1) Restore the system disk from backup. (if available) (2) Run an upgrade from either the same OS release and revision. Doing an upgrade is not a guaranteed fix however this has been done successfully. After this procedure It may be necessary to repatch the system to the most current patch cluster revision and or individual needed patches. (3) As a last resort, backup all data files and reinstall. 5) Spawning too rapidly: This indicates corruption and can normally be corrected by booting single user from cdromand running fsck on the raw logical device for the root disk. Example: fsck /dev/rdsk/c0t0d0s0

It may be necessary to run this command repeatedly until it runs cleanly. Then halt the system and attempt a boot from the root disk. 6) File just loaded does not appear to be executable: This can be a missing or corrupt boot block. If the boot alias or boot path is not correct or if the root drive is not known, it may be necessary to boot single user from a cdrom and run format too look at the partition table to determine the boot partition. If the boot alias is correct it may be necessary to install a boot block. First mount the root drive to verify that root (and the files and directories) are all there. If so install a boot block with the following procedure: ok boot cdrom -s # cd / # fsck /dev/dsk/c0t3d0s0 # mount /dev/dsk/c0t3d0s0 /a # cp /platform/`uname -i`/ufsboot /a/platform/`uname -i`/ufsboot # cd / # umount /a # /usr/sbin/installboot /usr/platform/`uname -i`/lib/fs/ufs/bootblk \ /dev/rdsk/c0t3d0s0 # halt 7) Cant find boot program: This is similar to the above error but this indicates a missing boot program. This can be caused by missing root files/directories or of course by the boot alias/path being incorrect. 8) Fast data access mmu miss (or CPU panics): This is a memory management error. It can be caused by bad memory or by corrupt software info. There is a known issue with Sun Enterprise Systems with the 400MHz CPU with 8MB cache as described in following info: When trying to install Solaris 2.5.1 HW 11/97 or 2.6 HW 5/98 on a Ex000 server with a 400MHz/8MB CPU module, booting from cdrom or network install server produces Fast Data Access MMU Miss error or panic with mutex_enter: badmutex. Solution: NOTE: This procedure requires downloading and applying patches so system must have a network connection. 1. Verify the OpenBoot PROM (OBP) version by typing .Version at the OK> prompt or use the /usr/sbin/prtconf -V command at the UNIX prompt. If needed, upgrade to at least flash PROM version 3.2.21 using patch 103346-22 or greater. 2. OK> setenv auto-boot false 3. OK> reset (usually not needed with 2.6) 4. OK> limit-ecache-size 5. OK> boot cdrom (at least 2.5.1 HW 11/97 or 2.6 HW 3/98) 6. Install the OS but do not allow auto-reboot! 7. # init 0 8. OK> reset (usually not needed with 2.6) 9. OK> limit-ecache-size 10. OK> boot 11. Make sure you have a network connection, FTP to sunsolve.sun.com and get latest kernel patch (minimum levels to support 400 mhz/8mb cache listed): 12. Patches should be installed single-user so bring system down with init S

13. Reboot. 9) CPU panic: This error can also be caused by hardware or software. In a single CPU system it is easy to determine between the two by booting single user from cdrom, if the system boots that means that the problem is software related. This is because the kernel uses the same hardware to boot whether booting from a cdrom or a hard drive. However in a system that has multiple CPUs it may be needed to run a prtdiag -v from the cdrom (while in single user mode). If prtdiag -v doesnt point to the bad CPU it will be necessary to do a boot kadb from the OK prompt. It may be necessary to get the kernel group involved to either analyze the kadb orforce a core. If the above test points to software the next step will be to run fsck on the raw logical device Solaris Troubleshooting: Determine cause of a system fault at ok> prompt The following is useful information on how to determine the cause of a system crash at the ok prompt, using the open boot commands. The Synchronous Fault Status Register (SFSR) provides information on exceptions (faults) issued by the Memory Management Unit (MMU). At the the ok prompt type: .sfsr Look at the fault type (FT). This number is in hexadecimal format and means: 0 no error 1 Invalid address error 2 Protection error 3 Privilege violation error 4 Translation error 5 Access bus error(timeout) 6 Internal error 7 reserved

S-ar putea să vă placă și