The Linux SCSI HOWTO: Reporting Bugs

3. Reporting Bugs

The Linux SCSI developers don't necessarily maintain old revisions of the code due to space constraints. So, if you are not running the latest publically released Linux kernel (note that many of the Linux distributions, such as MCC, SLS, Yggdrasil, etc. often lag one or even twenty patches behind this) chances are we will be unable to solve your problem. So, before reporting a bug, please check to see if it exists with the latest publically available kernel.

If after upgrading, and reading this document thoroughly, you still believe that you have a bug, please mail a bug report to the SCSI channel of the mailing list where it will be seen by many of the people who've contributed to the Linux SCSI drivers.

In your bug report, please provide as much information as possible regarding your hardware configuration, the exact text of

all of the messages that Linux prints when it boots, when the error condition occurs, and where in the source code the error is. Use the procedures outlined in Capturing messages and Locating the source of a panic().

Failure to provide the maximum possible amount of information may result in misdiagnosis of your problem, or developers deciding that there are other more interesting problems to fix.

The bottom line is that if we can't reproduce your bug, and you can't point at us what's broken, it won't get fixed.

3.1 Capturing messages

If you are not running a kernel message logging system :

Insure that the /proc filesystem is mounted.


grep proc /etc/mtab

If the /proc filesystem is not mounted, mount it


mkdir /proc
chmod 755 /proc
mount -t proc /proc /proc

Copy the kernel revision and messages into a log file


cat /proc/version > /tmp/log
cat /proc/kmsg >> /tmp/log

Type CNTRL-C after a second or two.

If you are running some logger, you'll have to poke through the appropriate log files (/etc/syslog.conf should be of some use in locating them), or use dmesg.

If Linux is not yet bootstrapped, format a floppy diskette under DOS. Note that if you have a distribution which mounts the root diskette off of floppy rather than RAM drive, you'll have to format a diskette readable in the drive not being used to mount root or use their ramdisk boot option.

Boot Linux off your distribution boot floppy, preferably in single user mode using a RAM disk as root.


mkdir /tmp/dos

Insert the diskette in a drive not being used to mount root, and mount it. Ie


mount -t msdos /dev/fd0 /tmp/dos


mount -t msdos /dev/fd1 /tmp/dos

Copy your log to it


cp /tmp/log /tmp/dos/log

Unmount the DOS floppy


umount /tmp/dos

And shutdown Linux


shutdown

Reboot into DOS, and using your favorite communications software include the log file in your trouble mail.

3.2 Locating the source of a panic()

Like other unices, when a fatal error is encountered, Linux calls the kernel panic() function. Unlike other unices, Linux doesn't dump core to the swap or dump device and reboot automatically. Instead, a useful summary of state information is printed for the user to manually copy down. Ie :


Unable to handle kernel NULL pointer dereference at virtual address c0000004
current->tss,cr3 = 00101000, %cr3 = 00101000
*pde = 00102027
*pte = 00000027
Oops: 0000
EIP:    0010:0019c905
EFLAGS: 00010002
eax: 0000000a   ebx: 001cd0e8   ecx: 00000006   edx: 000003d5
esi: 001cd0a8   edi: 00000000   ebp: 00000000   esp: 001a18c0
ds: 0018   es: 0018   fs: 002b   gs: 002b   ss: 0018
Process swapper (pid: 0, process nr: 0, stackpage=001a09c8)
Stack: 0019c5c6 00000000 0019c5b2 00000000 0019c5a5 001cd0a8 00000002 00000000
       001cd0e8 001cd0a8 00000000 001cdb38 001cdb00 00000000 001ce284 0019d001
       001cd004 0000e800 fbfff000 0019d051 001cd0a8 00000000 001a29f4 00800000
Call Trace: 0019c5c6 0019c5b2 0018c5a5 0019d001 0019d051 00111508 00111502
            0011e800 0011154d 00110f63 0010e2b3 0010ef55 0010ddb7
Code: 8b 57 04 52 68 d2 c5 19 00 e8 cd a0 f7 ff 83 c4 20 8b 4f 04
Aiee, killing interrupt handler
kfree of non-kmalloced memory: 001a29c0, next= 00000000, order=0
task[0] (swapper) killed: unable to recover
Kernel panic: Trying to free up swapper memory space
In swapper task - not syncing

Take the hexadecimal number on the EIP: line, in this case 19c905, and search through /usr/src/linux/zSystem.map for the highest number not larger than that address. Ie,


0019a000 T _fix_pointers
0019c700 t _intr_scsi
0019d000 t _NCR53c7x0_intr

That tells you what function its in. Recompile the source file which defines that function file with debugging enabled, or the whole kernel if you prefer by editing /usr/src/linux/Makefile and adding a "-g" to the CFLAGS definition.


#
# standard CFLAGS
#

Ie,


CFLAGS = -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer -pipe

becomes


CFLAGS = -g -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer -pipe

Rebuild the kernel, incrementally or by doing a


make clean
make

Make the kernel bootable by creating an entry in your /etc/lilo.conf for it


image = /usr/src/linux/zImage
label = experimental

and re-running LILO as root, or by creating a boot floppy


make zImage

Reboot and record the new EIP for the error.

If you have script installed, you may want to start it, as it will log your debugging session to the typescript file.

Now, run


gdb /usr/src/linux/tools/zSystem

and enter


info line *<your EIP>

Ie,


info line *0x19c905

To which GDB will respond something like


(gdb) info line *0x19c905
Line 2855 of "53c7,8xx.c" starts at address 0x19c905 <intr_scsi+641&>
   and ends at 0x19c913 <intr_scsi+655>.

Record this information. Then, enter


list <line number>

Ie,


(gdb) list 2855
2850    /*      printk("scsi%d : target %d lun %d unexpected disconnect\n",
2851                host->host_no, cmd->cmd->target, cmd->cmd->lun); */
2852            printk("host : 0x%x\n", (unsigned) host);
2853            printk("host->host_no : %d\n", host->host_no);
2854            printk("cmd : 0x%x\n", (unsigned) cmd);
2855            printk("cmd->cmd : 0x%x\n", (unsigned) cmd->cmd);
2856            printk("cmd->cmd->target : %d\n", cmd->cmd->target);
2857            if (cmd) {;
2858                abnormal_finished(cmd, DID_ERROR << 16);
2859            }
2860            hostdata->dsp = hostdata->script + hostdata->E_schedule / 
2861                sizeof(long);
2862            hostdata->dsp_changed = 1;
2863        /* SCSI PARITY error */
2864        } 
2865
2866        if (sstat0_sist0 & SSTAT0_PAR) {
2867            fatal = 1;
2868            if (cmd && cmd->cmd) {
2869                printk("scsi%d : target %d lun %d parity error.\n",

Obviously, quit will take you out of GDB.

Record this information too, as it will provide a context in case the developers' kernels differ from yours.