A day to remember!
It was an interesting day, at around 8:40 (GMT) we started getting notifications from our monitoring instruments that our database cluster was failing over, one node at a time.
Now this alone normally wouldn’t be a problem because we host with Amazon EC2 so new nodes should automatically be firing up and joining the cluster, thus keeping the cluster alive.
But it was no “Normal” day, and new nodes didn’t fire up and the cluster did go down.
About Amazon EC2
Its safe to say that Amazon EC2 is an amazing implementation of Cloud Computing and the mechanisms they have in place to allows companies like ours to scale, and stay up are the best i have seen in the past 12 years of working in and running internet companies.
Amazon allow us to create “instances”, these are VM running our own software stack. They allow us to create an infinite number of them, and we can switch them on and off again at will.
Amazon also have a storage solution called EBS (Elastic Block Storage) which allows us to provision virtual disks that can be attached and detached to instances at will.
For the last four years this system has given us ~99.9% uptime and has allowed us to scale to meet our customer demands.
What happened to Amazon yesterday?
Yesterday amazon had a network failure that triggered a re-mirror of all the EBS volumes in the North Virginia data centre, this re-mirror combined with a large number of customers trying to fire up new instances to recover from the network failure caused Amazon to hit a storage capacity problem.
Basically they didn’t have enough disk space to support the re-mirror and as the thousands upon thousands of customers they have tried to fire up new replacement images the problem compounded its self.
Now its not for me to comment on Amazon running out of disk space as i’m sure there is more to it than that, but those are the details that Amazon have shared with us.
So why did Code Spaces go down?
Take a quick look at the Infrastructure overview below, this is a very high level view of our world, basically each cluster relies on EBS volumes to store data and to perform block level (real time) backups (to the offsite backup devices):
Our DB cluster was in the effected region when amazon had its network failure yesterday, and all our DB volumes were trying to re-mirror and failing due to the capacity issues described above.
Had we known that the outage was going to be as long as it was (Amazon is still experiencing this issues 27 hours later) we would have moved our DB cluster in to a new availability zone and started new volumes from our backup.
Unfortunately Amazon took a long time (8 hours) to let us know that this was a big issue, and for those 8 hours we sat on our hands wondering if it would come back… It didn’t and at around 22:00 GMT we decided we (and you) had waited long enough, and we started our disaster recovery process.
Why did you wait 8 hours?
This is where we made our mistake, we thought (even after 8 hours) that amazon would be only minutes away from solving this, Why? well Amazon didn’t communicate the severity of the issue in fact for the first 8 hours of the outage there network status page indicated that they had “Performance Issues” (See screenshot). As soon as they admitted that this was a Severe Issue we started our disaster recovery process, it took 15 minutes for us to move everything to a new availability zone and to have everything up and running with full data integrity (no data loss).
![]()
So you waited 8 hours to do a 15 minute recovery?
In a nut shell, yes! We waited (and waited) when in reality we could have implemented our recovery process and had you guys back up in 15 minutes, for that we are genuinely sorry!
Lesson Learned.
Its still early to fully understand the impact of this, Amazon is still experiencing these issues, and many (many) sites are still down as a result of this (see http://ec2disabled.com for a complete list), However we have certainly learned a lot, and will be implementing the following ASAP:
Communication Channels
We need to be able to let you guys know whats happening in these extreme cases, so we have provisioned a server outside of Amazons infrastructure to hold a service status page, this will also hold a landing page for www.codespaces.com in the event that this happens again.
We will also be setting up a regular email to all customer (you can opt in or out) which will enable us to communicate with you and will give you a way to communicate back with us.
Strict Processes
We have a disaster recovery process and we have the mechanisms in place to implement it in 15 minutes, we have changed our internal processes and added a few more to ensure that on any kind of outage we are ready to push the button and get everything back up and running rather than waiting for our hosting company to do it for us.
The not so lucky ones!
It appears as if there are many sites still down as a result of this, I can only assume that these sites do not employ the level of backup that we do and do not have a well practised and thorough recovery process that we do.
When (or if) these companies will get their data back is still not clear, and I really hope that these guys learn the larger lessons from this.
A long hard look at ourselves
I think this raises an interesting conundrum for anyone who relies on a third party for a core part of their business, be it you guys who use us for hosting your valuable code or companies like us who invest heavily in hosting companies to keep us going. You need to be sure that you can rely on the 3rd party to “do the right thing” and have provision to ensure that even in the worst eventuality your investment is safe.
While i wish we had reacted quicker (and we will learn from this) and started our recovery process earlier I hope we have proved to you that even under the worst eventuality (and it doesn’t get much worse than this, again look at the growing list ec2disabled.com) we have the systems in place and have invested in your best interests in infrastructure that not only scales but provides us with the confidence and security that we can not only keep your data safe but recover from the worst kind of outage.
Here’s to a new day, with new (but slightly less chaotic) challenges!
If you would like to discuss this further please feel free to email me at f[dot]price[at]codespaces.com, I will also be holding a number of Web Chats with people so please contact me if you would like to join one of those.

I can only imagine the anxiety. The important thing is we did not lose any data which is why your service is valuable to us. I appreciate your forthright communication on this.
I think everyone on the net learned something yesterday.
The outages that Amazon have caused is really amazing, and it’s kinda nuts to think the cloud at large could be crippled quite so easily. With that said, I know this wasn’t a “Code Spaces” issue, you were apart of something greater – I appreciate your full disclosure and keeping us informed (unlike Amazon) and I have no doubt you will continue to provide the same level of service and support that you have the whole time I’ve been a member.
Thanks for the open explanation of your problems, gives me confidence that you are learning from this. Being able tomove the status/landing page outside the cloud is a great idea.
Good to hear you were prepared with disater recovery and realtime backups. I would really love to see a good writeup on the specifics of your infrastructure setup. As I am a sysadmin not by choice but because as a small business I have to do it all, so I would love to learn from you on how to set it up right.
Good, honest, transparent communication/updates like this sure sets a high benchmark for other companies to follow. Amazon EC2 should of had every little angle covered. However, its very impressive to know you are prepared for situations like these. Well done Code Spaces.
The ability to identify and admit your own failings is a great quality and inspires far more confidence than excuses and cover-ups. If only our governments could do the same.