Troubleshooting

PXE installations work fine for a while, but eventually clients no longer boot the autoinstall kernel.

Are you using xinetd? xinetd 2.1.8.9pre11 (and assumably earlier versions) had a race condition which causes the tftp server to suddenly stop responding. this has supposedly been fixed in version 2.1.8.9pre13. Also, xinetd and inetd have default limits on how many connections can be spawned in a 60 second period. See the inetd or xinetd manpage for details on increasing this limit.

My client fails to assign a hostname to itself, making the autoinstall fail.

Ryan Braby reported the following problem:

[The install] fails when you have addesses like the following:

###.###.###.2
###.###.###.21
###.###.###.22 
###.###.###.23
###.###.###.24
etc.

For the node with the ###.###.###.2 address, the script tries to assign multiple hostnames, and then fails to get any install script.

This should be fixed in the next release (1.4.2).

My client autoinstallation/update hangs, crashes, or is ridiculously slow.

Goran Pocian reported an instance of horrible updateclient performance which went away when he upgraded from kernel 2.2.17 to 2.2.18.

Also, he noted that if you mount an NFS filesystem after executing prepareclient, getimage will retrieve its contents. As this can heavily increase network load, it can also cause bad performance.

Brian Finley reported other possible causes:

Every once in a while, someone reports some mysterious hanging, or transfer interruption issue related to rsync. I had a chance to speak with Andrew Tridgell in person today to discuss these issues.

There are two know issues that could be the source of these symptoms. One is a known kernel issue, and one is an rsync issue. The kernel issue is supposedly resolved in 2.4.x series kernels, (SystemImager has not yet been "officially" tested with 2.4.x kernels) and may not be present in all 2.2.x series kernels (I believe).

The rsync bug will be fixed in the rsync 2.4.7 release (to happen "Real Soon Now (TM)" ). The rsync bug is caused by excessive numbers of errors filling the error queue which causes a race condition. However, until rsync 2.4.7 has been out for some time, I will still recommend using v2.4.6 unless you specifically experience one of these issues.

Here's a hack that seems to work for Chris Black. Add "--bwlimit=10000" right after "rsync" in each rsync command in the <image>.master script.
Change: "rsync -av --numeric-ids $IMAGESERVER::web_server_image_v1/ /a/"
To:     "rsync --bwlimit=10000 -av --numeric-ids $IMAGESERVER::web_server_image_v1/ /a/"
	  

Here are some tips on diagnosing the problem:

  • If you get an error message in /var/log/messages that looks like:

    Jan 23 08:49:42 mybox rsyncd[19347]: transfer interrupted (code 30) at io.c(65)

    You can look up the code number in the errcode.h file which you can find in the rsync source code.

  • To diagnose the kernel bug: Run netstat -tn. Here is some sample output (from a properly working system):

      $ netstat -tn
      Active Internet connections (w/o servers)
      Proto Recv-Q Send-Q Local Address           Foreign Address State
      tcp        1      0 192.168.1.149:1094      216.62.20.226:80 CLOSE_WAIT
      tcp        1      0 192.168.1.149:1090      216.62.20.226:80 CLOSE_WAIT
      tcp        1      0 192.168.1.149:1089      216.62.20.226:80 CLOSE_WAIT
      tcp        0      0 127.0.0.1:16001         127.0.0.1:1029 ESTABLISHED
      tcp        0      0 127.0.0.1:1029          127.0.0.1:16001 ESTABLISHED
      tcp        0      0 127.0.0.1:16001         127.0.0.1:1028 ESTABLISHED
      tcp        0      0 127.0.0.1:1028          127.0.0.1:16001 ESTABLISHED
    	    

    The symptoms are:

    • machine A has data in it's Send-Q

    • machine B has no data in it's Recv-Q

    • the data in machine A's Send-Q is not being reduced

    What's happening is:

    1. one or both kernels aren't honoring the other's send/receive window settings (these are dynamically calculated)

    2. the result is the kernel(s) aren't getting data from machine A to machine B

    3. rsync, therefore, isn't getting data on the receive side

    4. the process appears to hang

  • Details about the rsync bug:

    What happens:

    1. a large numbers of errors clogs the error pipe between the receiver and generator

    2. all progress stops

    3. again, the process appears to hang

I hope this information helps...

A possible solution, suggested by Robert Berkowitz, is to add --bwlimit=10000 to the rsync options in the rsync initscript.

My autoinstallcd doesn't boot.

There was a problem with the following RPM: syslinux-1.48-1.i386.rpm Download and install a newer syslinux RPM from http://systemimager.org/

When making the autoinstalldiskette, my system gives me an error involving "dd" or "mount".

You are using a pre v0.19 version of SystemImager. Please download the latest version from http://systemimager.org/.

If you must use a pre v0.19 version for some reason, be sure that your kernel has "ramdisk" support. Or that you have ramdisk support with a module. If you are using a module, be sure that it is loaded with the modprobe command.

But it's probably easier to just get the latest version...

My client failed to autoinstall, and when I run an rsync command on it manually it takes forever for the image server to respond.

Be sure that the image server can look up the client's hostname based on its IP address. The easiest way to do this is to have entry in the image server's /etc/hosts file for the client system.

My autoinstall client booted up and said "dhcp didn't work", but when I do an ifconfig eth0 it has an IP address.

Are you using a pre 1.0 version of SystemImager? If so, please upgrade.

If for some reason you can't upgrade, then:

Are you connected to a switch? Most switches will wait a period of time (usually 30 seconds) after a connected system's interface has come up before transmitting on that port. Newer versions of the autoinstalldiskette bring the ethernet interface up and wait 45 seconds or so before making a DHCP request. It will then wait 35 seconds or so to give the system time to receive an address. You could be using an autoinstalldiskette that does not wait the proper time for your switch and is giving up before it should.

Be sure that you are using the latest version of SystemImager and that you are using the autoinstalldiskette image that comes with that version. Note that the version numbers may not match. See the VERSION file.