3AM. It's always 3AM when these things happen.
Last night, my cellphone started beeping, and after it finally woke me up, I cracked open an eye and checked the screen. Text messages from Nagios, telling me that my main FreeBSD mail/Web server was incommunicado. Lovely.
I crawled out of bed and logged into my MacBook Pro. I had an open SSH session to that box, but it was all but unusable, echoing back a character every few seconds. An eventual 'uptime' showed the 5-minute load at over 300. Three hundred processes in the run queue basically means the box is thrashing wildly... but why?
The Nagios client had respawned a hundred or so times, sshd, snmpd, and inetd were all running 60-70% CPU utilization, completely consuming both CPUs. Everything had come to a standstill. I killed the offending processes from the console (hooray for Raritan KVM-over-IP!) and the box settled back down.
I first started sshd back up, and didn't see the load rise, but as soon as I attempted to SSH back into the box, it spiked to 100% utilization. I killed it, and rebuilt openssh-portable from ports, wondering if I'd been hacked, or the sshd binary had somehow become corrupt. I ran the newly-built sshd manually in debug mode, and watched the same problems occur. Obviously, this wasn't good. Checks of dmesg and /var/log/messages showed literally no problems whatsoever. The I/O subsystem seemed fine, as did all normal server operations -- I could SSH out, Apache, MySQL and sendmail were working, but there was obviously something very wrong.
The uptime on this server was 525 days. Generally speaking, I refrain from rebooting a box unless absolutely necessary, but in this case, I felt that I had to start with a clean slate. For the first time since September of 2006, I rebooted my main workhorse server.
It came back up without issue, other than the same sshd, snmpd, and inetd problems. The reboot was ultimately unnecessary. But what could be causing this problem? As I was making a cup of coffee, I thought that I might try removing hosts.deny to see if that made a difference. That did the trick -- all was well without it. But what caused that?