Kernel Debugging 101

Posted by on March 30, 2012

A dark fog had been rolling in that night, and we had been setting up a new cluster of servers for our CI system. CentOS 6.2, LXC and random kernel panics were all there to lend a hand. The kernel panics were new to our party, having been absent at the previous cluster setup.

The first set of servers we installed had been running happily, however this new set were not. They would always kernel panic under the slightest load, and sometimes without much load at all.

A nice feature, which is enabled by default in CentOS 6, allows your kernel to dump a core file when it panics. When your system comes back, you can retrieve this core file, examine it and work out what happened. This is a short story on how we did it. The work we did is on CentOS but it can easily be applied to other Linux distributions too.

Basics and theory

When a kernel panics, it dumps a core file into /var/crash/, which can then be examined.
In your tool belt, you need:

crash is gdb-like. It uses gdb and lets you examine vmcore files from kernels.

Basic debugging

Start crash by pointing it at the vmlinux file installed by the debuginfo kernel, and the vmcore file:

sudo crash /usr/lib/debug/lib/modules/`uname -r`/vmlinux 
/var/crash/<time>/vmcore

You will see output like this:

      KERNEL: /usr/lib/debug/lib/modules/<kernel>/vmlinux
    DUMPFILE: /var/crash/127.0.0.1-2012-03-28-23:51:01/vmcore
  [PARTIAL DUMP]
        CPUS: 16
        DATE: Wed Mar 28 23:50:56 2012
      UPTIME: 00:23:26
LOAD AVERAGE: 0.95, 1.45, 1.01
       TASKS: 986
    NODENAME: buildtest07
     RELEASE: 2.6.32-220.7.1.el6.x86_64
     VERSION: #1 SMP Wed Mar 7 00:52:02 GMT 2012
     MACHINE: x86_64  (2400 Mhz)
      MEMORY: 24 GB
       PANIC: "Oops: 0000 [#1] SMP " (check log for details)
         PID: 0
     COMMAND: "swapper"
        TASK: ffff880337eb9580  (1 of 16)
[THREAD_INFO: ffff880637d18000]
         CPU: 12
       STATE: TASK_RUNNING (PANIC)

This tells us some important bits of information:

If you now run:

crash> log

and jump to the end, you will see much longer output like this:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
IP: [<ffffffff8142bb40>] __netif_receive_skb+0x60/0x6e0
PGD 10e0fe067 PUD 10e0b0067 PMD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/virtual/block/dm-6/removable
CPU 12
Modules linked in: veth bridge stp llc e1000e serio_raw
                   i2c_i801 i2c_core sg iTCO_wdt
                   iTCO_vendor_support ioatdma dca i7core_edac
                   edac_core shpchp ext3 jbd mbcache sd_mod
                   crc_t10dif ahci dm_mirror dm_region_hash
                   dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 0, comm: swapper Not tainted <kernel> #1 Supermicro X8DTT-H/X8DTT-H
RIP: 0010:[<ffffffff8142bb40>]  [<ffffffff8142bb40>] __netif_receive_skb+0x60/0x6e0
RSP: 0018:ffff88034ac83dc0  EFLAGS: 00010246
RAX: 0000000000000060 RBX: ffff8805353896c0 RCX: 0000000000000000
RDX: ffff88053e8c3380 RSI: 0000000000000286 RDI: ffff8805353896c0
RBP: ffff88034ac83e10 R08: 00000000000000c3 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000015 R14: ffff88034ac93770 R15: ffff88034ac93784
FS:  0000000000000000(0000) GS:ffff88034ac80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000060 CR3: 000000010e130000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffff880637d18000,
task ffff880337eb9580)
Stack:
 ffffc90013e37000 ffff880334bdc868 ffff88034ac83df0 0000000000000000
<0> ffff880334bdc868 ffff88034ac93788 ffff88034ac93700 0000000000000015
<0> ffff88034ac93770 ffff88034ac93784 ffff88034ac83e60 ffffffff8142c25a
Call Trace:
 <IRQ>
 [<ffffffff8142c25a>] process_backlog+0x9a/0x100
 [<ffffffff814308d3>] net_rx_action+0x103/0x2f0
 [<ffffffff81072001>] __do_softirq+0xc1/0x1d0
 [<ffffffff810d94a0>] ? handle_IRQ_event+0x60/0x170
 [<ffffffff8100c24c>] call_softirq+0x1c/0x30
 [<ffffffff8100de85>] do_softirq+0x65/0xa0
 [<ffffffff81071de5>] irq_exit+0x85/0x90
 [<ffffffff814f4dc5>] do_IRQ+0x75/0xf0
 [<ffffffff8100ba53>] ret_from_intr+0x0/0x11
 <EOI>
 [<ffffffff812c4b0e>] ? intel_idle+0xde/0x170
 [<ffffffff812c4af1>] ? intel_idle+0xc1/0x170
 [<ffffffff813fa027>] cpuidle_idle_call+0xa7/0x140
 [<ffffffff81009e06>] cpu_idle+0xb6/0x110
 [<ffffffff814e5ffc>] start_secondary+0x202/0x245
Code: 00 44 8b 1d cb be 79 00 45 85 db 0f 85 61 06 00 00 f6 83 b9
      00 00 00 10 0f 85 5d 04 00 00 4c 8b 63 20 4c 89 65 c8 49 8d
      44 24 60 <49> 39 44 24 60 74 44 4d 8b ac 24 00 04 00 00 4d
      85 ed 74 37 49
RIP  [<ffffffff8142bb40>] __netif_receive_skb+0x60/0x6e0
 RSP <ffff88034ac83dc0>
CR2: 0000000000000060

At this level the most interesting thing to note is the line starting with BUG:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000058

This tells us why the panic happened: A NULL pointer dereference.

Intermediate debugging

Now some more gnarly stuff. If you’re still here, you’re pretty damn brave.
There some cool things we can do in crash which might help us.
For example, we can look at the list of running processes:

   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
      0      0   0  ffffffff81a8d020  RU   0.0       0      0  [swapper]
>     0      0   1  ffff880638628a80  RU   0.0       0      0  [swapper]
>     0      0   2  ffff880337d934c0  RU   0.0       0      0  [swapper]
>     0      0   3  ffff8806386294c0  RU   0.0       0      0  [swapper]
>     0      0   4  ffff880337dd3580  RU   0.0       0      0  [swapper]
>     0      0   5  ffff880637c84080  RU   0.0       0      0  [swapper]
>     0      0   6  ffff880337df1540  RU   0.0       0      0  [swapper]
>     0      0   7  ffff880637c84ac0  RU   0.0       0      0  [swapper]
>     0      0   8  ffff880337e33500  RU   0.0       0      0  [swapper]
      0      0   9  ffff880637c85500  RU   0.0       0      0  [swapper]
>     0      0  10  ffff880337e774c0  RU   0.0       0      0  [swapper]
>     0      0  11  ffff880637cf40c0  RU   0.0       0      0  [swapper]
>     0      0  12  ffff880337eb9580  RU   0.0       0      0  [swapper]
>     0      0  13  ffff880637cf4b00  RU   0.0       0      0  [swapper]
>     0      0  14  ffff880337ed7540  RU   0.0       0      0  [swapper]
>     0      0  15  ffff880637cf5540  RU   0.0       0      0  [swapper]
      1      0   8  ffff880638628040  IN   0.0   21364   1568  init
      2      0   8  ffff880337c714c0  IN   0.0       0      0  [kthreadd]
      3      2   0  ffff880337c70a80  IN   0.0       0      0  [migration/0]
      4      2   0  ffff880337c70040  IN   0.0       0      0  [ksoftirqd/0]
      5      2   0  ffff880337c99500  IN   0.0       0      0  [migration/0]
....

The list continues on for another 950 lines. The lines starting with a > indicate the processes currently active on a CPU. We are working on a 16 core system, and 14 of them are currently in kernel mode tending to CPU scheduling duties. (The other 2 further down were gmond and php.)

You can also change the context you’re currently in. For example, currently we’re looking at the CPU where the kernel panic happened. However, we can switch to another CPU if we wanted to using set -c <CPU> and examine things there. (We don’t want to do this yet.)

Speaking of our kernel panic! Let’s get back to that..
The code at the end of the previous section tells us a null pointer dereference was the cause of the crash, and it gives us some clues as to what happened:

IP: [<ffffffff8142bb40>] __netif_receive_skb+0x60/0x6e0

Here’s the code that was being executed! __netif_receive_skb, this sounds like it might have something to do with receiving things on the network interface, if I had to guess! (and if I did, I’d be right – many things in the kernel are named to be obvious).

We could even grep through the kernel source to find this!

shell> ack -a __netif_receive_skb
net/core/dev.c
    2705:int __netif_receive_skb(struct sk_buff *skb)

Looking at line 2705 of net/core/dev.c shows us the method that was running when things broke down.

Advanced debugging

Are you really still reading this? You’ve just gone from brave, to foolish! There is no turning back! Here’s the rabbit hole:

There’s a lot more information here! Just because we know the method that was invoked doesn’t help us that much. The method could be 2 lines and the solution could be clear, but more likely the method will be 100 lines long and make many references to other things. So let’s start poking at the memory to see what specifically happened:

IP: [<ffffffff8142bb40>] __netif_receive_skb+0x60/0x6e0

There are two each bits of information in this line:

  1. ffffffff8142bb40 is the memory address at which the problem occurred.
  2. 0x60 is the hex offset for the line in the memory that caused the problem. 0x60 is 96 in decimal – remember this!
    We can inspect the memory by doing this:

    crash> dis -rl ffffffff8142bb40
    /usr/src/debug/kernel-2.6.32-220.7.1.el6/.../net/core/dev.c: 2706
    0xffffffff8142bae0 <__netif_receive_skb>:       push   %rbp
    0xffffffff8142bae1 <__netif_receive_skb+1>:     mov    %rsp,%rbp
    0xffffffff8142bae4 <__netif_receive_skb+4>:     push   %r15
    0xffffffff8142bae6 <__netif_receive_skb+6>:     push   %r14
    0xffffffff8142bae8 <__netif_receive_skb+8>:     push   %r13
    0xffffffff8142baea <__netif_receive_skb+10>:    push   %r12
    0xffffffff8142baec <__netif_receive_skb+12>:    push   %rbx
    0xffffffff8142baed <__netif_receive_skb+13>:    sub    $0x28,%rsp
    0xffffffff8142baf1 <__netif_receive_skb+17>:    nopl   0x0(%rax,%rax,1)
    /usr/src/debug/kernel-2.6.32-220.7.1.el6/.../net/core/dev.c: 2715
    0xffffffff8142baf6 <__netif_receive_skb+22>:    cmpq   $0x0,0x18(%rdi)
    /usr/src/debug/kernel-2.6.32-220.7.1.el6/.../net/core/dev.c: 2706
    0xffffffff8142bafb <__netif_receive_skb+27>:    mov    %rdi,%rbx
    /usr/src/debug/kernel-2.6.32-220.7.1.el6/.../net/core/dev.c: 2715
    0xffffffff8142bafe <__netif_receive_skb+30>:    jne    0xffffffff8142bb16 <__netif_receive_skb+54>
    /usr/src/debug/kernel-2.6.32-220.7.1.el6/.../arch/x86/include/asm/atomic_64.h: 23
    0xffffffff8142bb00 <__netif_receive_skb+32>:    mov    0xbdcbae(%rip),%eax        # 0xffffffff820086b4
    /usr/src/debug/kernel-2.6.32-220.7.1.el6/.../net/core/dev.c: 1364
    0xffffffff8142bb06 <__netif_receive_skb+38>:    test   %eax,%eax
    0xffffffff8142bb08 <__netif_receive_skb+40>:    jne    0xffffffff8142c008 <__netif_receive_skb+1320>
    /usr/src/debug/kernel-2.6.32-220.7.1.el6/.../net/core/dev.c: 1367
    0xffffffff8142bb0e <__netif_receive_skb+46>:    movq   $0x0,0x18(%rdi)
    /usr/src/debug/kernel-2.6.32-220.7.1.el6/.../include/trace/events/net.h: 68
    0xffffffff8142bb16 <__netif_receive_skb+54>:    mov    0x79becb(%rip),%r11d        # 0xffffffff81bc79e8
    0xffffffff8142bb1d <__netif_receive_skb+61>:    test   %r11d,%r11d
    0xffffffff8142bb20 <__netif_receive_skb+64>:    jne    0xffffffff8142c187 <__netif_receive_skb+1703>
    /usr/src/debug/kernel-2.6.32-220.7.1.el6/.../net/core/dev.c: 2719
    0xffffffff8142bb26 <__netif_receive_skb+70>:    testb  $0x10,0xb9(%rbx)
    0xffffffff8142bb2d <__netif_receive_skb+77>:    jne    0xffffffff8142bf90 <__netif_receive_skb+1200>
    /usr/src/debug/kernel-2.6.32-220.7.1.el6/.../include/linux/netpoll.h: 86
    0xffffffff8142bb33 <__netif_receive_skb+83>:    mov    0x20(%rbx),%r12
    0xffffffff8142bb37 <__netif_receive_skb+87>:    mov    %r12,-0x38(%rbp)
    0xffffffff8142bb3b <__netif_receive_skb+91>:    lea    0x60(%r12),%rax
    0xffffffff8142bb40 <__netif_receive_skb+96>:    cmp    %rax,0x60(%r12)

WOW! Lots of output!

What you’re looking at is the full list of steps that took place, starting with the method that was invoked and ending with the fault.
If you look at the last line of the output, you should recognise two key things:

0xffffffff8142bb40 <__netif_receive_skb+96>:    cmp    %rax,0x60(%r12)
  1. The memory address is the one we peeked
  2. The number 96 is the decimal offset in the memory where the fault occurred.
    A few lines above this, we see a new filename and line number mentioned, include/linux/netpoll.h:86.
    Congratulations! This is the actual line which caused the NULL pointer dereference!

But what is it actually DOING?

cmp    %rax,0x60(%r12)

Let’s examine this bit. In assembly, cmp is called to compare two registers. %rax and %r12 are two CPU registers which we want to compare. If we got back to our log output, we see the registers:

RAX: 0000000000000060 RBX: ffff8805353896c0 RCX: 0000000000000000
RDX: ffff88053e8c3380 RSI: 0000000000000286 RDI: ffff8805353896c0
RBP: ffff88034ac83e10 R08: 00000000000000c3 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000

RAX has some data in it, but R12 is NULL!

Now take a look at code that was mentioned earlier:

 84 static inline int netpoll_receive_skb(struct sk_buff *skb)
 85 {
 86         if (!list_empty(&skb->dev->napi_list))
 87                 return netpoll_rx(skb);
 88         return 0;
 89 }

If you know linux networking internals, a few things can be understood from this:

  1. We’re dealing with New API (NAPI) code, based on the variable name napi_list
  2. NAPI is invoked when the system thinks there are a large number of interrupts being handled, and changing to RX POLLING would be more efficient
  3. sk_buff is the struct that holds data about linux sockets
  4. We’re checking if the list, &skb->dev->napi_list is empty, so we’re expecting it to be defined, and be of the right type.

What is probably going on here, is that expecting &skb->dev->napi_list to hold a list of incoming packets that need to be processed. Unfortunately, it turns out that something, most likely napi_list, is not set, causing the NULL pointed dereference.

Congratulations! You’ve found the problem. Finding what the cause of this is, however, will be done in a future tutorial.
In the meantime we have some options for workarounds:

  1. Try and test if napi_list is NULL just before the !list_empty check, and define it. This might have other unforeseen issues though – what if it causes a memory leak? It’s certainly not fixing the cause of the problem.
  2. Disable the use of NAPI by recompiling the network driver. This is a pretty simple operation and probably worth trying, especially if the system isn’t massively CPU bound and dropping packets.

We ended up going with option 2, as it was the fastest and most likely fix.

But we’ll still be sleeping with one eye open…

Posted by on March 30, 2012
Category: engineering, operations

6 Comments

I’ll add one more step. Go to bugzilla.redhat.com (the people who *really* provided all that debugging code) and search to see if it’s a known issue. Yeah, even though you ran into the bug on CentOS, that still works. I didn’t find anything, but then I didn’t know which driver to look for.

Turning off NAPI seems like a bit of a band-aid; while it might avoid this particular problem, it also sends you down the less-tested code paths in that driver so you could just be heading for a different crash. Another step, therefore, is to contact the NIC vendor to see if there’s a driver/firmware update available.

Turning off NAPI in this case is a complete band-aid – the underlying problem still exists and might even manifest itself in other ways. However, so far things are running well.

In this case, the NIC is an Intel 82574L. We’re running the latest driver but we will be talking to Intel about it also.

How many interrupts per second are these servers dealing with now that NAPI’s been disabled?

Also, i’m not sure what driver you’re using but the e1000 driver can have interrupt throttling disabled by using the InterruptThrottleRate option (see http://www.kernel.org/doc/Documentation/networking/e1000.txt). I imagine that this is tunable with most other drivers.

In the crash log output said
BUG: unable to handle kernel NULL pointer dereference at 0000000000000060

But you write
BUG: unable to handle kernel NULL pointer dereference at 0000000000000058

Is it typo or I miss something?

You guys beat the things out of me. Your post is worth reading for any wannabe sysadmin like me.

Thank you for your excellent explanation. I was able to discover my kernel panics were being caused in dquot_initialize (disk quota management) and sure enough quotacheck confirmed my quota files were corrupted.