24 Jun 2007

Netdump

RHEL provides a crash dump facility called netdump (network crashdump = netdump). Traditionally UNIX writes the kernel dump to the swap partition. A classical crash dump facility first need to recover the dump before it's reused as swap. Other crash dump facilities enables kernel dumps to be written to disk. Diskdump is such a facility. However, great care must be taken as to not overwrite important data on the file system. Netdump solves that by writing the kernel dump to the network destined to a netdump server.

You might argue that the kernel never crashes. Unfortunate that is not true. Kernel crashes might be caused by software and hardware bugs (Oops, BUG(), panic). The kernel then responds by dumping as much information as it can (processor state, stack trace and so on) to the console. This might be enough for an experienced kernel hacker to find out what went wrong, but some crashed requires an analyze of the memory dump of the kernel.

Netdump server:
1. Install the netdump-server package.
2. Set password for the netdump user:

  # passwd netdump
  Changing password for user netdump.
  New UNIX password:

3. Netdump writes to /var/crash, and kernel dumps can take anywhere from 500MiB up to several GiB depending on amount of memory used on the client.
4. Start the netdump-server:

  # service netdump-server start
  Starting netdump server:                           [  OK  ]

Netdump client:
1. Install the netdump package.
2. Edit /etc/sysconfig/netdump and add netdump server:

  ...
  NETDUMPADDR=192.168.1.104
  ...

3. Propagate the shared secret to the server. This just copies the ssh public key to the crashdump server:

  # service netdump propagate
  netdump@192.168.1.104's password:

The above command just do:

  cat /etc/sysconfig/netdump_id_dsa.pub | \
  ssh -x netdump@$NETDUMPADDR cat '>>' /var/crash/.ssh/authorized_keys2

4. Restart netdump:

  # service netdump restart
  initializing netdump                              [  OK  ]
  initializing netconsole                           [  OK  ]

At the netdump server, a client directory is created for dump files:

 /var/crash/192.168.1.103-2007-06-24-09:44

Time for some testing! Lets crash the client! We crash the client by using sysrq. Read more about sysrq here.

 # sysctl -w kernel/sysrq=1
 kernel.sysrq = 1
 # echo "c" > /proc/sysrq-trigger

The kernel now crashes, but right before it reboots, it dumps to the netdump server (UDP port 6666). At the end of the dump, a SysRq-t is performed. SysRq-t dumps a list of current tasks and their information.

Note! While the dumping is in progress, interrupts are disabled. One consequence of this is that the keyboard is unresponsive.


At the server, two files are generated:

  # ls -lh /var/crash/192.168.1.103-2007-06-24-09\:44/
  -rw-------    1 netdump  netdump      1.3K Jun 24 09:44 log
  -rw-------    1 netdump  netdump      510M Jun 24 09:44 vmcore

You can now analyze the dump ("vmcore") using gdb, kdb or similar to figure out what went wrong. Enjoy!

18 Jun 2007

Magical SysRq

SysRq (System Request) is probably one of those keys on your keyboard that you rarely use. On Linux, you can use it to perform system functions if the system becomes unresponsive. You can sync disks, reboot or crash the kernel if that is what you want. To enable the "magical" sysrq, you need to have it compiled in the kernel. Luckily all major Linux distribution today have sysrq compiled in be default. To see the status if sysrq, issue:

  $ cat /proc/sys/kernel/sysrq
  1

By default this value is "1" on Debian/Ubuntu and "0" on RHEL. "0" disables sysrq and "1" enables all functions of sysrq. Other values exists, see Documentation/sysrq.txt. You might also use "sysctl to check and enable sysrq:

  # sysctl kernel/sysrq
  kernel.sysrq = 0
  # sysctl -w kernel/sysrq=1
  kernel.sysrq = 1

To "s"ync all filesystems, press "Alt+SysRq+s". You'll then see at the console:

 SysRq  :  Emergency Sync
 Emergency Sync complete

Other sysrq functions include "b"oot, "c"rash and "u"mount. See the Documentation/sysrq.txt for the full list.

A quick way to reboot, and a little nicer than using the power-button, is to:

1. Sync disks using: "Alt+SysRq+s":

 SysRq  :  Emergency Sync
 Emergency Sync complete

2. Remount all disks read-only: "Alt+SysRq+u":

 SysRq  :  Emergency Remount R/O
 Emergency Remount complete

3. Reboot: "Alt+SysRq+b":

 SysRq  :  Resetting

An (impatient) colleague of mine uses this procedure to shut down his laptop all the time...

If you're not on the console, you can still use sysrq. Just redirect the the command-key to /proc/sysrq-trigger. So to crash the running server do:

  # echo "c" > /proc/sysrq-trigger

Note: Crashing the running kernel using kexec/kdump is not supported in Debian 4.0 (Etch).