I’ve been having a rough week with my computers.
It started out alright on Saturday when I successfully upgraded my main Ubuntu server. The upgrade consisted of the following:
- Moved to the other Dell 860 chassis which has a 3.0GHz CPU instead of the 2.8GHz I was running on (every little bit helps.)
- Re-installed the OS and applications from scratch – this allowed me to migrate from the 32 bit version (no idea why I was on the 32 bit version in the first place) to the 64 bit version of Ubuntu 188.8.131.52 LTS and all the latest packages.
- Moved from a single 500GB hard drive to a pair of RAID 1 mirrored 2TB drives. Lots of space and fault tolerant. What’s not to like?
- Rsynced all the home dirs, databases and config files over from the old server.
That move was quite successful.
On Sunday I migrated a somewhat large site to my server. www.shapeoko.com had been hosted on a t1.tiny instance at Amazon Web Services and it was dog-slow. I decided that since my server was running all the time anyway that I would volunteer to host it.
So I rsynced the files and dumped the MySQL databases and got it up and running on my server. It was actually pretty easy and I was feeling good about it.
The ShapeOko forums and Wiki get a fair amount of traffic, certainly lots more than my blog does, but the server was churning along with no issues.
Until around 9:40 in the morning on Tuesday, when my server and all the sites on it fell off the internet.
My ISP decided to perform service-interrupting maintenance (replacing some fiber switches) in the middle of the day. You can read about that here: Three hour fiber outage this morning.
So, after a three hour outage….
About 9:00 or so in the evening I was instant messaging with the owner of the ShapeOko forum, asking a question about a config when I clicked on the admin panel link and it froze. In the space of about two minutes my server went from a normal load of about 1.5 to 115 and was swapping like mad. It was so un-responsive that I had to go down in the basement and hit the power button to reboot it.
I still don’t know what caused the issue. The logs are all normal and all the graphs just go vertical. I have a suspicion that the CrashPlan backup software has a memory leak, but it has not repeated the problem, so I don’t know for sure. I tweaked the Apache configs a bit to reduce the number of workers and to restart them sooner in case there is a leak somewhere else too.
Then on Wednesday I was with Roz at TCJRD roller derby practice. I had brought my laptop along and was doing some work on my servers at home. I had built another server that afternoon and was going to move graphite, the graph web pages and the awstats display pages to it to reduce some of the load on the main web server.
I was working along and had gotten to the point where I wanted to add a NAT to my firewall for the new server. I typed in the IP address and hit “apply”…
I immediately lost all connections to any server in my house. Whisky-Tango-Foxtrot?
Luckily Liz was at practice too, so I packed up and headed home. I expected to see the server or firewall completely crashed, but no, everything looked fine. Except I could not get out to the internet. Was it the ISP again?
I did some troubleshooting, including rebooting the fiber modem, removing the NAT (not that it should have been the issue, but it was the last thing I changed,) and rebooting the firewall. Rebooting the firewall was painful because the disks had not been checked for 285 days and it forced an fsck. Root wasn’t too bad, but the /var/log partition is 79GB and that took about 15 minutes to check. It’s not the fastest of disks, and it’s not the fastest of machines.
Still no internet.
So I called the ISP. Again, surprisingly, I got to a tech right away, and again, surprisingly, he was competent and helpful. He looked at things on his end and said “that’s weird, I’m showing that your firewall has a MAC address with an asterisk in it.” (paraphrased.) He had me reboot the firewall and modem again (it went faster this time,) but he still saw the weird (impossible) MAC address. So finally he flushed the ARP cache on the switch and bada-bing, I had internet again.
I kept him on the line while I added the NAT to my firewall again, but it did not repeat the weird performance.
That outage lasted almost exactly an hour.
So, if you are keeping track at home, since I started hosting www.ShapeOko.com I’ve had three outages for a total downtime of more than four hours. Two of the outages have no explanation as to their causes.
I was totally frustrated after Wednesday’s outage.
Considering that in the past I’ve had uptimes on my firewall and servers of more than 250 days, I’m starting to feel picked on.
On the other hand, today I got all the graphite, graphs and awstats stuff up and running on the second server, so that goes in the ‘win’ column I think.