The Linux SCSI developers don't necessarily maintain old revisions of the code due to space constraints. So, if you are not running the latest publically released Linux kernel (note that many of the Linux distributions, such as MCC, SLS, Yggdrasil, etc. often lag one or even twenty patches behind this) chances are we will be unable to solve your problem. So, before reporting a bug, please check to see if it exists with the latest publically available kernel.
If after upgrading, and reading this document thoroughly, you still believe that you have a bug, please mail a bug report to the SCSI channel of the mailing list where it will be seen by many of the people who've contributed to the Linux SCSI drivers.
In your bug report, please provide as much information as possible regarding your hardware configuration, the exact text of
all of the messages that Linux prints when it boots, when the error condition occurs, and where in the source code the error is. Use the procedures outlined in Capturing messages and Locating the source of a panic().
Failure to provide the maximum possible amount of information may result in misdiagnosis of your problem, or developers deciding that there are other more interesting problems to fix.
The bottom line is that if we can't reproduce your bug, and you can't point at us what's broken, it won't get fixed.
If you are not running a kernel message logging system :
Insure that the /proc filesystem is mounted.
grep proc /etc/mtab
If the /proc filesystem is not mounted, mount it
mkdir /proc
chmod 755 /proc
mount -t proc /proc /proc
Copy the kernel revision and messages into a log file
cat /proc/version > /tmp/log
cat /proc/kmsg >> /tmp/log
Type CNTRL-C after a second or two.
If you are running some logger, you'll have to poke through the appropriate log files (/etc/syslog.conf should be of some use in locating them), or use dmesg.
If Linux is not yet bootstrapped, format a floppy diskette under DOS. Note that if you have a distribution which mounts the root diskette off of floppy rather than RAM drive, you'll have to format a diskette readable in the drive not being used to mount root or use their ramdisk boot option.
Boot Linux off your distribution boot floppy, preferably in single user mode using a RAM disk as root.
mkdir /tmp/dos
Insert the diskette in a drive not being used to mount root, and mount it. Ie
mount -t msdos /dev/fd0 /tmp/dos
or
mount -t msdos /dev/fd1 /tmp/dos
Copy your log to it
cp /tmp/log /tmp/dos/log
Unmount the DOS floppy
umount /tmp/dos
And shutdown Linux
shutdown
Reboot into DOS, and using your favorite communications software include the log file in your trouble mail.
Like other unices, when a fatal error is encountered, Linux calls the kernel panic() function. Unlike other unices, Linux doesn't dump core to the swap or dump device and reboot automatically. Instead, a useful summary of state information is printed for the user to manually copy down. Ie :
Unable to handle kernel NULL pointer dereference at virtual address c0000004
current->tss,cr3 = 00101000, %cr3 = 00101000
*pde = 00102027
*pte = 00000027
Oops: 0000
EIP: 0010:0019c905
EFLAGS: 00010002
eax: 0000000a ebx: 001cd0e8 ecx: 00000006 edx: 000003d5
esi: 001cd0a8 edi: 00000000 ebp: 00000000 esp: 001a18c0
ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
Process swapper (pid: 0, process nr: 0, stackpage=001a09c8)
Stack: 0019c5c6 00000000 0019c5b2 00000000 0019c5a5 001cd0a8 00000002 00000000
001cd0e8 001cd0a8 00000000 001cdb38 001cdb00 00000000 001ce284 0019d001
001cd004 0000e800 fbfff000 0019d051 001cd0a8 00000000 001a29f4 00800000
Call Trace: 0019c5c6 0019c5b2 0018c5a5 0019d001 0019d051 00111508 00111502
0011e800 0011154d 00110f63 0010e2b3 0010ef55 0010ddb7
Code: 8b 57 04 52 68 d2 c5 19 00 e8 cd a0 f7 ff 83 c4 20 8b 4f 04
Aiee, killing interrupt handler
kfree of non-kmalloced memory: 001a29c0, next= 00000000, order=0
task[0] (swapper) killed: unable to recover
Kernel panic: Trying to free up swapper memory space
In swapper task - not syncing
Take the hexadecimal number on the EIP: line, in this case 19c905, and search through /usr/src/linux/zSystem.map for the highest number not larger than that address. Ie,
0019a000 T _fix_pointers
0019c700 t _intr_scsi
0019d000 t _NCR53c7x0_intr
That tells you what function its in. Recompile the source file which defines that function file with debugging enabled, or the whole kernel if you prefer by editing /usr/src/linux/Makefile and adding a "-g" to the CFLAGS definition.
#
# standard CFLAGS
#
Ie,
CFLAGS = -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer -pipe
becomes
CFLAGS = -g -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer -pipe
Rebuild the kernel, incrementally or by doing a
make clean
make
Make the kernel bootable by creating an entry in your /etc/lilo.conf for it
image = /usr/src/linux/zImage
label = experimental
and re-running LILO as root, or by creating a boot floppy
make zImage
Reboot and record the new EIP for the error.
If you have script installed, you may want to start it, as it will log your debugging session to the typescript file.
Now, run
gdb /usr/src/linux/tools/zSystem
and enter
info line *<your EIP>
Ie,
info line *0x19c905
To which GDB will respond something like
(gdb) info line *0x19c905
Line 2855 of "53c7,8xx.c" starts at address 0x19c905 <intr_scsi+641&>
and ends at 0x19c913 <intr_scsi+655>.
Record this information. Then, enter
list <line number>
Ie,
(gdb) list 2855
2850 /* printk("scsi%d : target %d lun %d unexpected disconnect\n",
2851 host->host_no, cmd->cmd->target, cmd->cmd->lun); */
2852 printk("host : 0x%x\n", (unsigned) host);
2853 printk("host->host_no : %d\n", host->host_no);
2854 printk("cmd : 0x%x\n", (unsigned) cmd);
2855 printk("cmd->cmd : 0x%x\n", (unsigned) cmd->cmd);
2856 printk("cmd->cmd->target : %d\n", cmd->cmd->target);
2857 if (cmd) {;
2858 abnormal_finished(cmd, DID_ERROR << 16);
2859 }
2860 hostdata->dsp = hostdata->script + hostdata->E_schedule /
2861 sizeof(long);
2862 hostdata->dsp_changed = 1;
2863 /* SCSI PARITY error */
2864 }
2865
2866 if (sstat0_sist0 & SSTAT0_PAR) {
2867 fatal = 1;
2868 if (cmd && cmd->cmd) {
2869 printk("scsi%d : target %d lun %d parity error.\n",
Obviously, quit will take you out of GDB.
Record this information too, as it will provide a context in case the developers' kernels differ from yours.