24 Jun 2007

Netdump

RHEL provides a crash dump facility called netdump (network crashdump = netdump). Traditionally UNIX writes the kernel dump to the swap partition. A classical crash dump facility first need to recover the dump before it's reused as swap. Other crash dump facilities enables kernel dumps to be written to disk. Diskdump is such a facility. However, great care must be taken as to not overwrite important data on the file system. Netdump solves that by writing the kernel dump to the network destined to a netdump server.

You might argue that the kernel never crashes. Unfortunate that is not true. Kernel crashes might be caused by software and hardware bugs (Oops, BUG(), panic). The kernel then responds by dumping as much information as it can (processor state, stack trace and so on) to the console. This might be enough for an experienced kernel hacker to find out what went wrong, but some crashed requires an analyze of the memory dump of the kernel.

Netdump server:
1. Install the netdump-server package.
2. Set password for the netdump user:

  # passwd netdump
  Changing password for user netdump.
  New UNIX password:

3. Netdump writes to /var/crash, and kernel dumps can take anywhere from 500MiB up to several GiB depending on amount of memory used on the client.
4. Start the netdump-server:

  # service netdump-server start
  Starting netdump server:                           [  OK  ]

Netdump client:
1. Install the netdump package.
2. Edit /etc/sysconfig/netdump and add netdump server:

  ...
  NETDUMPADDR=192.168.1.104
  ...

3. Propagate the shared secret to the server. This just copies the ssh public key to the crashdump server:

  # service netdump propagate
  netdump@192.168.1.104's password:

The above command just do:

  cat /etc/sysconfig/netdump_id_dsa.pub | \
  ssh -x netdump@$NETDUMPADDR cat '>>' /var/crash/.ssh/authorized_keys2

4. Restart netdump:

  # service netdump restart
  initializing netdump                              [  OK  ]
  initializing netconsole                           [  OK  ]

At the netdump server, a client directory is created for dump files:

 /var/crash/192.168.1.103-2007-06-24-09:44

Time for some testing! Lets crash the client! We crash the client by using sysrq. Read more about sysrq here.

 # sysctl -w kernel/sysrq=1
 kernel.sysrq = 1
 # echo "c" > /proc/sysrq-trigger

The kernel now crashes, but right before it reboots, it dumps to the netdump server (UDP port 6666). At the end of the dump, a SysRq-t is performed. SysRq-t dumps a list of current tasks and their information.

Note! While the dumping is in progress, interrupts are disabled. One consequence of this is that the keyboard is unresponsive.


At the server, two files are generated:

  # ls -lh /var/crash/192.168.1.103-2007-06-24-09\:44/
  -rw-------    1 netdump  netdump      1.3K Jun 24 09:44 log
  -rw-------    1 netdump  netdump      510M Jun 24 09:44 vmcore

You can now analyze the dump ("vmcore") using gdb, kdb or similar to figure out what went wrong. Enjoy!

No comments: