3AM. It's always 3AM when these things happen.
Last night, my cellphone started beeping, and after it finally woke me up, I cracked open an eye and checked the screen. Text messages from Nagios, telling me that my main FreeBSD mail/Web server was incommunicado. Lovely.
I crawled out of bed and logged into my MacBook Pro. I had an open SSH session to that box, but it was all but unusable, echoing back a character every few seconds. An eventual 'uptime' showed the 5-minute load at over 300. Three hundred processes in the run queue basically means the box is thrashing wildly... but why?
The Nagios client had respawned a hundred or so times, sshd, snmpd, and inetd were all running 60-70% CPU utilization, completely consuming both CPUs. Everything had come to a standstill. I killed the offending processes from the console (hooray for Raritan KVM-over-IP!) and the box settled back down.
I first started sshd back up, and didn't see the load rise, but as soon as I attempted to SSH back into the box, it spiked to 100% utilization. I killed it, and rebuilt openssh-portable from ports, wondering if I'd been hacked, or the sshd binary had somehow become corrupt. I ran the newly-built sshd manually in debug mode, and watched the same problems occur. Obviously, this wasn't good. Checks of dmesg and /var/log/messages showed literally no problems whatsoever. The I/O subsystem seemed fine, as did all normal server operations -- I could SSH out, Apache, MySQL and sendmail were working, but there was obviously something very wrong.
The uptime on this server was 525 days. Generally speaking, I refrain from rebooting a box unless absolutely necessary, but in this case, I felt that I had to start with a clean slate. For the first time since September of 2006, I rebooted my main workhorse server.
It came back up without issue, other than the same sshd, snmpd, and inetd problems. The reboot was ultimately unnecessary. But what could be causing this problem? As I was making a cup of coffee, I thought that I might try removing hosts.deny to see if that made a difference. That did the trick -- all was well without it. But what caused that?
Awhile ago, I wrote a quick script to scan /var/log/auth.log for spurious brute-force SSH login attempts, and to add the offending IP address to /etc/hosts.deny for sshd. This worked extremely well, reducing the potential effectiveness of these attacks to all but zero. The problem, as it turned out, was that the script eventually wrote over 140 IPs to /etc/hosts.deny, which either triggered a bug, or exceeded a line-length limit that I'm unaware of. Removing that line caused all previously-misbehaving services to return to normal, and after some time to settle down, the server was back to handling a few hundred thousand emails a day, alongside Web and DNS services. I rewrote the brute-force detection script to add IPs to a pf table instead of /etc/hosts.deny, and parsed the previous hosts.deny list into the table to retain that information. Of course, this is how I should have done it to begin with. It took two cups of coffee, but I was out of the woods.
This was a decidedly non-obvious solution to a decidedly bizarre problem. I'd still like to know if I hit a bug in the BSD stack, or what the hosts.deny line-lengths limits are. Anyone? Bueller?