View Single Post
Old 11-07-12, 04:16 PM
  #1  
NeilGunton
Crazyguyonabike
 
Join Date: Nov 2003
Location: Lebanon, OR
Posts: 697

Bikes: Co-Motion Divide

Mentioned: 1 Post(s)
Tagged: 0 Thread(s)
Quoted: 35 Post(s)
Likes: 0
Liked 3 Times in 2 Posts
crazyguyonabike down for server move

Unfortunately I have had to take crazyguyonabike offline in order to move the box to a new datacenter. Short story is that there is a problem with one of the RAID drives, and I was not getting the remote help I needed from the people at the datacenter in Las Vegas where the machine has been colocated for several years. This, combined with a series of mysterious, sporadic and unexplained network outages have finally made me decide to bite the bullet and just move the thing to be closer to where we are living, so that I can work on it myself when I need to. So, I have found an ISP here in Corvallis OR (Peak Internet) which seems reasonable. The big downer to all this is that I need to take the server completely down for a few days while it is shipped to me. I just took it down today, and the current datacenter people should be in the process of figuring out shipping, boxing up etc. I hope the machine will be with me sometime early next week, assuming we are able to do 2-day shipping without it being completely extortionate. I will then replace the failed hard drive, and install the server in the new datacenter. We are approaching capacity on the current set of hard drives, so I will need to buy new drives (8 of them, yikes) eventually, but I've decided to try to leave that until a bit later. For now I will focus on just getting the server to the new colo, fixing the drive, rebuilding the replacement, and getting things back online again asap. Then I will do the system rebuild with all new drives sometime down the line, when I am able to give users of the site more notice and everything can be planned.

I'm sorry I wasn't able to give much notice today, but things have not been moving quickly at all with the failed drive, and I have been frustrated that we couldn't just do this apparently simple thing (probably would have taken about 10 minutes if they could simply take the drives out one by one and look at serial numbers, since I had the S/N of the failed drive in a log file, but no, instead they insisted that I used a very obscure command line utility to make an LED light up on the front of the server so they could tell which drive it was - the catch being that I couldn't do this, since the drive had completely gone and so didn't even have a device number, which the command line utility needed, but telling them this was like talking to a brick wall). Anyway, the server has been running in a degraded state now (with the one drive down) since mid October, and if the other drive that was paired with the failed one also went down, then we would lose everything (it's RAID10, for anyone who's interested). Well, we wouldn't lose everything permanently, since I have MySQL replication both to a remote backup server in Germany and to my home workstation, and the pics are also replicated every minute, so worst case we would lose the last minute's worth of pic uploads if the server went down hard. But then I would need to rebuild the system from scratch, which is always a pain. I am keen to get it working properly again, and in the new datacenter, so once I had the new place lined up, I decided to just get it done asap rather than waiting around. I know this will be bad for some people who are on the road at the moment and trying to update their journals, but I just have to bite that bullet for a few days of downtime. If/when I get bigger then I can perhaps afford to buy hot spare servers, but I'm not remotely there yet. I don't want to use the backup server in Germany as a hot spare, because I am in the middle of moving house at the moment, so my own workstation may well be taken offline at any time to go to the new place. So, if the German server was the only one hosting the live website, it would have no backups. If it went down, we would potentially lose everything that had changed since the last snapshot, and I don't want to risk that. So, distasteful as it is, we have a few days of downtime, sorry about that. It doesn't happen very often, fwiw. Feel free to let me know I'm a crappy admin, unprofessional etc, I probably deserve it, but I'm just trying to fix what I regard as a critical problem as quickly as possible.

Thanks for your patience,

Neil
NeilGunton is offline