Over the last couple of days, we’ve been working hard to improve the reliability and performance of Whereoscope — on the Web, on the phone, everywhere. This is a big deal, because we know that if you can’t trust Whereoscope to be online, all the time, then you really can’t trust it at all.
The piece of the system that has been letting us down the most, as regular readers will know, is our database. There’s been a bunch of things go wrong with it, so we decided that enough is enough, and took fairly drastic steps to overhaul it. The technical explanation of what we’ve done is to implement “replica sets”. Without going too far into the details, the way we were operating previously was akin to driving a car and carrying a spare tire: if a tire goes flat, you can get back on the road, but you’ll need to pull over, swap the tire over, probably inflate it, curse and swear a lot and so on. It takes time. Replica sets are more like having 3 complete cars driving around with you (a bit like the President!): if one of them gets a flat, we leave it by the side of the road and jump into the next one. We then call up our local dealer (or Secret Service outpost) and ask them to have a new car waiting when we get there: that is to say, there is little to no delay in getting back up and running in the case of a flat tire. Importantly there is also no delay when any other kind of failure occurs. This is important because every time Whereoscope has gone down, it has been because of something we didn’t anticipate; relieving us of the responsibility to foresee failures is probably the single best thing we can do for reliability. Computers are much better at failing than we are at predicting how they’ll do it.
So Whereoscope now has 3 separate database servers, and every time we get data coming in, we’ll write that data out to all three of them. There’s definitely still ways things can go wrong, but the database was by far the most fragile. By having these “hot-spares” running all the time, we’re seriously limiting the impact a database failure can have on the system as a whole. The replication system also has automated failover, meaning that it will automatically shutdown the errant server and switch-over to one of the spares before we even notice anything is wrong.
There’s another side-benefit to come of this also: since we now have 3 servers where we used to have one, we’re able to use these extra database servers to speed things up (something the President’s motorcade can’t do). We’re being cautious in rolling this out, because there are some subtleties to it, but moving forward, we’re optimistic that this will just magically make everything go faster with no work for us (or you!)
Kudos to our admins on this migration: we managed to do the whole switch-over with only about 3 minutes of downtime, very late last night.
And just in case anyone was wondering, we also have an array of “hot-spare” webservers, and a load-balancer with automatic failover, so we’re safe if our webservers start misbehaving also.
Our goal is to get to a place where we don’t even have to think about the reliability of the system: to have it completely functional 100% of the time, whether or not we’re around to look after it (after all, we do occasionally sleep). To be clear, this is just one step on the road to that goal, but it’s a big one — the database server is simultaneously the most critical and fragile component of the system, so it’s great to have it replicated and safe. We’ll keep you posted on further improvements as they happen.
CTO & Co-Founder.