Sep 5 2011

Image Source

http://stuff.nigel.mcnie.name/ccnz.jpg

It's been a year and a day since the first of a series of earthquakes hit Canterbury, which peaked in February this year with a 6.3 magnitude quake that took over 180 lives. While I was unaffected by the quakes themselves, I have since thought that I (along with everyone else in Wellington) have "gotten off lightly", due to the simple fact that Wellington straddles a known fault line with an 11% chance of rupture in the next 100 years [1].

The events following the quakes show that Wellingtonians, and indeed all New Zealanders, take this risk seriously. The eq.org.nz project was a fantastic grass-roots effort that showed the "can-do" attitude in our country. People weren't happy just to donate or leave it to the government - instead they pitched in and did something that helped everyone.

After eq.org.nz was shut down, there were some postmortems. Richard Clark wrote on realtime communication in a crisis, development at crisis speed and a look at how technology could automatically swing into effect in a disaster. InternetNZ facilitated a discussion between the internet community and the Wellington City Council, and a great video was made summarising the project. As I was a part of the tech team for the site, I took one lesson away in particular - the tech was not as ready as it should have been.

Yes, we managed to set up a site in a matter of hours after the quake. Yes, the site got significant use, and yes it was helpful to some. But I feel that we can do much better in the future, particularly with regards to the technology. The two areas we could improve on are being prepared and improving the tools.

Being Prepared

From quake to functioning map took us a few hours - a commendable effort, made possible thanks to Ushahidi, their hosted Crowdmap product, and of course the efforts of everyone involved. However, I feel this was much too long.

Firstly, we were working in an emotionally charged, chaotic environment - one in which mistakes would have been easy to make. Second, we were given a lot more luck than we may have otherwise expected, thanks to the Ushahidi team co-operating with us over the migration away from Crowdmap, and the help of CrisisCommons in getting us started [2]. And third, during the first few hours, stuff.co.nz and NZHerald (the two major online news sources in NZ) both set up their own maps, and had it not been for our connections and their willingness to help, three competing maps could easily have sprung up, none of which would have been much help at all.

Spending hours setting up the site manually is the wrong way to do things, and we all know it. However, we can't predict disasters in advance. So it seems to me that we either need the ability to set up a site in minutes, or a "general" site prepared in advance, which can be adapted within minutes for whatever disaster is taking place.

Let's explore both options briefly. The ability to set a site up in minutes requires some prior planning, however it makes no assumptions about the nature of the disaster, and allows us to set up and tear down sites as we see fit. Disaster hits two cities? Just deploy two sites. During down time, as long as a minimal amount of maintenance is done, we can keep the process well oiled. Furthermore, we can develop "template" sites - one for earthquakes, one for tornadoes etc., and deploy the right "type" of site during a disaster, providing great agility.

Setting up a site (or sites) in advance has a different set of complications and benefits. A URL can be published in advance of any disaster, that everyone knows, so when a disaster hits more people know where to go (e.g. wellington.disaster.org.nz [3]). However, it may take more work to be able to re-configure the site for different disasters, as opposed to simply deploying a fresh site. And naturally, the site will need maintaining during the time between disasters, which could be many years.

The other aspect of being prepared, as alluded to here, is having data prepared. That doesn't just mean having categories prepared - that means having map data for everything of significance ready. Why should it take until a disaster for us to plot Wellington's ATM locations on a map? Why can't we prepare this data beforehand, ready for use the moment it's needed?

Improving the Tools

We made great strides towards improving Ushahidi for the task of disaster mapping on the eq.org.nz project. However, I fear that a lot of our good work hasn't made it as far as it could. The code sits (rots?) in a github repository, many of our changes having not made it anywhere near upstream - despite still being needed.

I've helped out with a couple of disasters since the quakes, and in each case, I found myself doing the same setup work, then fixing the same bugs, as we did previously. Clearly, pushing some fixes upstream would be a good start. However, the issues with using Ushahidi for crisis mapping run deeper than just a few fixes.

We made many changes that improved the usability of the map and homepage, the stability/scalability of the platform, and to streamline the workflows of people approving reports. We didn't push these changes upstream, and I'm reminded of this with every disaster has occurred since then. The eq.org.nz site really did finish as a much better product, but it's a height we haven't attained since.

To be clear, I'm not blaming anyone for this. People burned out, interest waned and we all had our lives to get back to. But I think it would be great if we could "close the loop". After all, you never know when it's you who will be needing the map...

Closing the Loop

I found out about the Standby Task Force after the Christchurch Quake. They're a network of "adhoc groups of tech-savy mapping volunteers that emerge around crises into a flexible, trained and prepared network ready to deploy." In other words, they're the eq.org.nz team, except larger, more organised, and with a global focus. When disaster strikes, if the locals ask for help, the SBTF are ready to respond, providing a map, volunteers and expertise to get things rolling. Now I think of it, they're a little like Internation Rescue (from Thunderbirds), in its infancy. No rocket ships I'm afraid ;), but a group of people willing and able to help in times of emergency.

Somehow, through a combination of my motivation to improve the tech for future disasters, and the prodding of George Chamales and Kirk Morris, I've fallen into the position of SBTF Tech Team Leader. My plan is to, as part of this team, work on both of the technical aspects outlined above. The goal: that anyone will be able to deploy a map ready to handle a disaster within minutes of it occuring.

I've donated some code from Get Your Game On to get started on a system for performing one-command. We're also planning on maintaining a branch of Ushahidi [4] optimised for crisis mapping, to which I hope we can apply many of our patches from eq.org.nz and other sources over time.

With this, I hope we can truly close the loop - so that when Wellington, or anywhere else, is struck by disaster, our response will be as good as it can be.

Final thoughts: I'm only focusing on the tech. There are clearly other issues we need to work through, not least the political ones. I hope next time to see a much closer relationship between government and such volunteer efforts, although it's not an issue I feel I can influence personally.

[1]In researching this I was glad to discover that the 11% figure is a 50% decrease in what was commonly believed before the "It's Our Fault" study. All the same, 11% is not a particularly comforting figure.
[2]In comparison, imagine what it would have been like if the disaster had struck 10 years ago. No CrisisCommons, no Ushahidi, barely any internet to speak of. We are truly lucky.
[3]Naturally, something shorter would be better!
[4]Naturally, we will try and push as much upstream as possible, however there's a simple reality that not all patches may be suitable.

Like this post? Subscribe to my RSS feed and follow me on twitter to hear about new posts early.

Want to share this post?

Aug 10 2011

Note: I have updated the presentation since first giving it - check out the new one. The new post also includes tips for converting an existing software project to CD.

I gave a talk on Continuous Deployment today. Here are the slides and here are my speakers notes - which will probably be more interesting.

My thinking on CD has advanced since last year, but the essentials remain the same. To do CD is to make a strategic decision to remove fear from the deployment process; to treat your test suite as an asset of the highest value; to truly value user feedback; to remove deployment as an obstacle to any other activity.

What has changed? Over the last year, I've done CD on one project and worked on another using a fortnightly release schedule. I've been able to compare the two and observe first hand just how beneficial CD can be.

Continuous Deployment - Going Fast With Confidence

On the CD project, the complete lack of effort required to deploy changes has been a huge timesaver. I have never felt the need to wait before deploying one change, even if I was about to work on another. Little fixes, in other words, made it to production very quickly, pleasing my customers far more than assurances of "it'll be fixed next week" would have.

I think the best moment was when I noticed a user trying and failing to complete a wizard due to a bug. I fixed the bug and deployed - allowing them, on their sixth try (and probably to their complete surprise), to complete it. If ever there was a moment where I appreciated the value of a robust, quick deployment process, this was it.

Furthermore, this experience highlights one of the key benefits of CD. I could have hacked a fix on production - but it was easier to use the CD process, which included a full test suite run. There's simply no way any other process could have provided the same speed with the same level of assurance - hacking on production would have been the only faster option, and it would have been wildly dangerous. [1]

Fortnightly Deployment - The Lie

On the fortnightly deploy project on the other hand, we encountered all the same issues that I'm so tired of.

We'd do a release, then for the next two weeks, some fixes would be marked as so urgent that we had to do a deployment of just that fix, immediately. We'd made sure deployment was as close to a one-command process as possible. However, the process of patching and testing the stable branch was an annoying break in rhythm, given that we were doing most development on trunk [2].

This was actually a point raised by Andy Chilton at the talk today. It seems that many project teams realise that there are some fixes that just have to make it out fast, and as a result they build a separate "hotpatch" channel to accomodate them.

In my view this is madness, no matter how well tended the "hotpatch" process is. Do your hotpatches go through the test suite? They certainly should! And why create a "fast path", and then forbid its use in ways that would delight your customers?

But I think my biggest objection is this: why have two processes when you could just have one? We coders know the evil that lies in needless duplication and complexity - which is exactly what a "hotpatch" system is. Duplication and complexity.

The whole idea of having a separate deployment process exposes the "fortnightly" claim as a lie anyway. Who can honestly claim they deploy every fortnight, if they're hotpatching? [3]

Objections to CD

Perhaps the strongest objection that came up was that clients wouldn't tolerate the possibility of things breaking without them being aware of it. To me, this objection has a slight air of childishness about it - I'd give it more credit if clients ever bothered to hire a world-class QA team, but they never do, and they miss bugs slipping into production all the time even with their checking. I think there's just our old friend, the "Cover Your Ass" policy, at work here.

Besides, nothing about CD precludes the possibility that they can still have a QA team checking things - with the able assistance of feature flags that limit features under development to just them. And I'd contend that the QA team would be just as delighted as the client themselves when told a bug they found half an hour ago was not only fixed on production and ready for them to check again, but that a test had been written to make sure it never happens again.

Having said all of this, Brenda Wallace made the point that it all depends on the client, regardless of how good the idea sounds. Some simply won't change from what they know, and at the end of the day it's their project. Perhaps this is why CD is doing so well in the tech startup world - it's the startups themselves who are the clients [4].

Try It For Yourself - I'll Help

All up, it was a great discussion, and it seemed like many there could at least see how CD could be better. If you count yourself among their number, I encourage you to try it out on the next project you do, and see how you go. I'm more than happy to chat with you about it and share experiences if you do, so feel free to contact me if you want to discuss anything about it.


As an aside, I do intend to continue my Web App Performance series, I've just been focused on other things recently. Apart from business, I've joined the Standby Task Force and am developing scripts to automatically deploy an Ushahidi within a few minutes of a disaster occuring. More on that in a future post.

[1]I'm the first to admit that this particular example was rather fortuitous, but I think it's even more relevant as your site gets busier. You'll see the errors occuring, diagnose and fix the problem, deploy - and it's inevitable that some customers will then begin to succeed at what they were doing. Contrast with hotpatching, where you could break the site for more people - or a slower deployment process where more people would encounter the problem.
[2]The CD example (the wizard fix) just goes to show how artificial this problem is. We were pushing back because our process made it harder than it should have been. Software development teams around the world do this all the time - lowering customer expectations about how long it takes to fix problems. I think we're doing our clients a disservice.
[3]Substitute "weekly", "monthly" etc. as appropriate. If you tell me you deploy weekly, I bet you do more than 52 deployments in a year.
[4]"Client" is defined here as "the organisation that uses the project for their benefit". For example, Fairfax uses Catalyst IT to develop stuff.co.nz. Fairfax is the client. In a tech startup, it's the startup themselves that gets the benefit from the project, so they're their own client.

Like this post? Subscribe to my RSS feed and follow me on twitter to hear about new posts early.

Want to share this post?

May 20 2011

I've just seen this article in ComputerWorld about how the NZ Red Cross scaled its website to handle the aftermath of the Cantebury earthquake on February 22.

I don't know how much the Red Cross themselves knows about the details of how their site was fixed, but I know one thing for sure. It was Arjen Lentz and I who fixed it. Not Netspace.

The Story

Here is what the story says happened:

To cope with the spike in traffic Netspace Services Limited added a reverse proxy to handle the static content delivery, tuned PHP to include an op-code cache and also trimmed the PHP modules to the bare minimum. The Apache daemon was also tuned for high turnover of processes to prevent memory bloat. The web server was put onto a new 64-bit Debian operating system and the database was shifted to its own 64-bit hardware and operating system.

The MySQL database was tuned, with indexing optimisations applied and alterations to cache settings to increase performance while reducing overall load. The presentation code was rewritten to push more work to the SQL database, to optimise queries, and to streamline processing. Within 36 hours of the site getting hit with a wall of traffic ten times larger than the solution was specified to handle, Netspace had reduced processing and memory load on the servers while being able to serve the higher connection demands from the world, and return to business as usual.

The above has some grains of truth in it. However, it wasn't Netspace that eventually fixed the site. Read on for what really happened.

The Facts

Immediately after the quake, the NZ IT community pooled together to begin work on what would eventually be eq.org.nz - the Christchurch Recovery Map. I had previously worked on scaling a map for the Queensland Floods earlier in the year, so I became part of the team that worked on the site.

We migrated the site from a temporary home on Crowdmap to dedicated servers, configured and tuned the stack supporting the site (PHP under fastcgi, MySQL, the Ushahidi software that powered the site), and monitored it while the site reached 100,000 visits in a week. A more detailed list of what happened was published in the NZCS Newsletter on the 18th of March.

After the quake, the NZ Red Cross site was very slow to respond, if it responded at all, to most requests. It was obviously under high load with much of the country wanting to donate, but it was clear that most people were not even getting through to the site.

While we were working on eq.org.nz, we saw the Red Cross was having trouble, and the group made the decision to reach out to help. As a result, I was put in touch with Gerard Creamer at Netspace [1]. He connected me to Fletch, who works for them as a sysadmin.

The site was struggling. Fletch was preparing to move the site to a new server, having seemingly abandoned the existing one as a lost cause. I got access to it to see what was going on. Digging around, I found out why it was so slow. The apache powering the site was woefully misconfigured.

The hardware the site was on was, I believed, more than capable of handling the load the site was under. Unfortunately, the software hadn't been configured in any kind of sensible fashion. I made the decision that time would be better spent fixing the software rather than moving to new hardware - which wouldn't have made a button of difference if the site was again misconfigured!

After a pause while we had to reboot the server when it completely locked up, I got to work. I started by configuring the apache to allow many more connections and turned off keepalive, which was just an immediate fix to make the site a little more responsive.

I then began work on putting nginx in front of the site. Nginx is the reverse proxy mentioned in the story. The idea is that it handles the connections between visitors' browsers and the website, and serves any static files (images/CSS/javascript), instead of the slower, more memory-hungry apache.

However, while working on nginx, I noticed that at random intervals every 30 seconds to 10 minutes after apache was restarted, it would race to MaxChildren and the site would lock up. Very strange behaviour, which I initially put down to the high load. I figured nginx would fix it, so I kept an eye on it and restarted apache every time it locked up, while continuing to work on nginx.

After a while, I got nginx in front of the apache. That made an immediate difference to the site performance, now that apache was only handling PHP requests [2]. However, the random apache lockups continued.

Arjen came on the scene somewhere around this point, and began looking at the database. Everything that Netspace claimed they did regarding database tuning, he did. Index optimisations, alterations to cache settings, etc. In fact, Arjen even gave a talk about what he did at the Brisbane PHP & MySQL user group.

We began investigating the lockups in more detail. I've actually blogged about how they were caused and how we fixed them before - "the case of the crashing website" was actually "the case of the crashing NZ Red Cross website". Once we fixed it, success! The site was running smoothly.

The Story vs The Facts

Any mention in the story of Netspace having tuned software is dubious at best. I honestly can't remember whether I installed an opcode cache, whether they did, or whether there even was one. However, the reverse proxy setup and apache configuration was me. The site was, according to Fletch, running on the existing hardware that it always used to run on, so I'm not convinced it was ever "put onto a new 64-bit Debian operating system" - and that wouldn't have made any difference if it had.

Regarding the MySQL - apart from moving it to a separate machine (which may not have even happened - all I know is that it was on a separate machine when I got there) - all of the tuning was done by Arjen. He was MySQL employee #25, and now runs OpenQuery, a MySQL consultancy. There are tickets in OpenQuery's issue tracking system where he detailed all of the work he did.

The funniest part about the whole story is this:

Within 36 hours of the site getting hit with a wall of traffic ten times larger than the solution was specified to handle, Netspace had reduced processing and memory load on the servers while being able to serve the higher connection demands from the world...

The solution was specified to handle? The solution was a horribly misconfigured apache, an unconfigured MySQL and no reverse proxy, as well as a broken search system (see the case of the crashing website). The "wall" of traffic would only have had to have been 10 visitors at once to floor it.

The solution Arjen and I put in place dealt with the wall with ease. When I logged out of the web server, load was 0.1, and it was using just 700M of ram - in comparison to the swapping-like-crazy, constantly-locking-up mess it was in (aka: load of 20+, all 8G of ram + all swap used).

Gerard sent me an e-mail afterwards:

Hi Nigel,

Thank you so much for your help with Red Cross - we really appreciate it,
and so do the folks at Red Cross.  Your efforts have enabled Red Cross to
accept over 7500 donations since noon yesterday - over $800,000 going
directly to where it's needed.  And that number will continue to climb.

You seem to still have 10 open sessions on the redxprdww02 server - let me
know if you still need them and I can kill them off if you're done.

If there are any expenses or costs please don't hesitate to invoice them to
us.

Netspace Services Limited
PO Box 404
Palmerston North
New Zealand

Thank you again for all of your help.

Regards,
Gerard

Q & A

Q: Nigel, isn't this sour grapes?

I don't think so. There's no mention of us in the article. People will get the impression that it's Netspace that tuned the site if I don't say anything.

I wouldn't have had a problem with it if the article had said that Netspace accepted the help of the team that worked on eq.org.nz, who did [insert all the configuration stuff here].

ComputerWorld is a widely read publication, it doesn't seem right to let such inaccuracies go without correction.

Q: What about the other stuff in the article?

We only ever dealt with the website. Everything else wasn't us, and I know nothing about it.

I never actually dealt with Charles Ranby, the Red Cross IT Manager. I wonder how much he knows.

Q: So what do you think of NetSpace?

I hope they simply forgot to mention they had so much help.

Fletch was good to work with - responsive and helpful.

[1]I was actually connected to him at Face, but Face and Netspace seem to be interchangeable - see how Face claims Red Cross New Zealand as their client, at the bottom of the page.
[2]Unfortunately though, not as much as it could have, because the CMS they're using serves images through PHP scripts

Like this post? Subscribe to my RSS feed and follow me on twitter to hear about new posts early.

Want to share this post?