Are you using xinetd? xinetd 2.1.8.9pre11 (and assumably earlier versions) had a race condition which causes the tftp server to suddenly stop responding. this has supposedly been fixed in version 2.1.8.9pre13. Also, xinetd and inetd have default limits on how many connections can be spawned in a 60 second period. See the inetd or xinetd manpage for details on increasing this limit.
Ryan Braby reported the following problem:
[The install] fails when you have addesses like the following:
###.###.###.2
###.###.###.21
###.###.###.22
###.###.###.23
###.###.###.24
etc.
For the node with the ###.###.###.2 address, the script tries to assign multiple hostnames, and then fails to get any install script.
This should be fixed in the next release (1.4.2).
Goran Pocian reported an instance of horrible updateclient performance which went away when he upgraded from kernel 2.2.17 to 2.2.18.
Also, he noted that if you mount an NFS filesystem after executing prepareclient, getimage will retrieve its contents. As this can heavily increase network load, it can also cause bad performance.
Brian Finley reported other possible causes:
Every once in a while, someone reports some mysterious hanging, or transfer interruption issue related to rsync. I had a chance to speak with Andrew Tridgell in person today to discuss these issues.
There are two know issues that could be the source of these symptoms. One is a known kernel issue, and one is an rsync issue. The kernel issue is supposedly resolved in 2.4.x series kernels, (SystemImager has not yet been "officially" tested with 2.4.x kernels) and may not be present in all 2.2.x series kernels (I believe).
The rsync bug will be fixed in the rsync 2.4.7 release (to happen "Real Soon Now (TM)" ). The rsync bug is caused by excessive numbers of errors filling the error queue which causes a race condition. However, until rsync 2.4.7 has been out for some time, I will still recommend using v2.4.6 unless you specifically experience one of these issues.
Here's a hack that seems to work for Chris Black. Add "--bwlimit=10000" right after "rsync" in each rsync command in the <image>.master script.
Change: "rsync -av --numeric-ids $IMAGESERVER::web_server_image_v1/ /a/" To: "rsync --bwlimit=10000 -av --numeric-ids $IMAGESERVER::web_server_image_v1/ /a/"Here are some tips on diagnosing the problem:
If you get an error message in /var/log/messages that looks like:
Jan 23 08:49:42 mybox rsyncd[19347]: transfer interrupted (code 30) at io.c(65)
You can look up the code number in the errcode.h file which you can find in the rsync source code.
To diagnose the kernel bug: Run netstat -tn. Here is some sample output (from a properly working system):
$ netstat -tn Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 1 0 192.168.1.149:1094 216.62.20.226:80 CLOSE_WAIT tcp 1 0 192.168.1.149:1090 216.62.20.226:80 CLOSE_WAIT tcp 1 0 192.168.1.149:1089 216.62.20.226:80 CLOSE_WAIT tcp 0 0 127.0.0.1:16001 127.0.0.1:1029 ESTABLISHED tcp 0 0 127.0.0.1:1029 127.0.0.1:16001 ESTABLISHED tcp 0 0 127.0.0.1:16001 127.0.0.1:1028 ESTABLISHED tcp 0 0 127.0.0.1:1028 127.0.0.1:16001 ESTABLISHEDThe symptoms are:
machine A has data in it's Send-Q
machine B has no data in it's Recv-Q
the data in machine A's Send-Q is not being reduced
What's happening is:
one or both kernels aren't honoring the other's send/receive window settings (these are dynamically calculated)
the result is the kernel(s) aren't getting data from machine A to machine B
rsync, therefore, isn't getting data on the receive side
the process appears to hang
Details about the rsync bug:
What happens:
a large numbers of errors clogs the error pipe between the receiver and generator
all progress stops
again, the process appears to hang
I hope this information helps...
A possible solution, suggested by Robert Berkowitz, is to add --bwlimit=10000 to the rsync options in the rsync initscript.
There was a problem with the following RPM: syslinux-1.48-1.i386.rpm Download and install a newer syslinux RPM from http://systemimager.org/
You are using a pre v0.19 version of SystemImager. Please download the latest version from http://systemimager.org/.
If you must use a pre v0.19 version for some reason, be sure that your kernel has "ramdisk" support. Or that you have ramdisk support with a module. If you are using a module, be sure that it is loaded with the modprobe command.
But it's probably easier to just get the latest version...
Be sure that the image server can look up the client's hostname based on its IP address. The easiest way to do this is to have entry in the image server's /etc/hosts file for the client system.
Are you using a pre 1.0 version of SystemImager? If so, please upgrade.
If for some reason you can't upgrade, then:
Are you connected to a switch? Most switches will wait a period of time (usually 30 seconds) after a connected system's interface has come up before transmitting on that port. Newer versions of the autoinstalldiskette bring the ethernet interface up and wait 45 seconds or so before making a DHCP request. It will then wait 35 seconds or so to give the system time to receive an address. You could be using an autoinstalldiskette that does not wait the proper time for your switch and is giving up before it should.
Be sure that you are using the latest version of SystemImager and that you are using the autoinstalldiskette image that comes with that version. Note that the version numbers may not match. See the VERSION file.