I've just seen this article in ComputerWorld about how the NZ Red Cross scaled its website to handle the aftermath of the Cantebury earthquake on February 22.
I don't know how much the Red Cross themselves knows about the details of how their site was fixed, but I know one thing for sure. It was Arjen Lentz and I who fixed it. Not Netspace.
The Story
Here is what the story says happened:
To cope with the spike in traffic Netspace Services Limited added a reverse proxy to handle the static content delivery, tuned PHP to include an op-code cache and also trimmed the PHP modules to the bare minimum. The Apache daemon was also tuned for high turnover of processes to prevent memory bloat. The web server was put onto a new 64-bit Debian operating system and the database was shifted to its own 64-bit hardware and operating system.
The MySQL database was tuned, with indexing optimisations applied and alterations to cache settings to increase performance while reducing overall load. The presentation code was rewritten to push more work to the SQL database, to optimise queries, and to streamline processing. Within 36 hours of the site getting hit with a wall of traffic ten times larger than the solution was specified to handle, Netspace had reduced processing and memory load on the servers while being able to serve the higher connection demands from the world, and return to business as usual.
The above has some grains of truth in it. However, it wasn't Netspace that eventually fixed the site. Read on for what really happened.
The Facts
Immediately after the quake, the NZ IT community pooled together to begin work on what would eventually be eq.org.nz - the Christchurch Recovery Map. I had previously worked on scaling a map for the Queensland Floods earlier in the year, so I became part of the team that worked on the site.
We migrated the site from a temporary home on Crowdmap to dedicated servers, configured and tuned the stack supporting the site (PHP under fastcgi, MySQL, the Ushahidi software that powered the site), and monitored it while the site reached 100,000 visits in a week. A more detailed list of what happened was published in the NZCS Newsletter on the 18th of March.
After the quake, the NZ Red Cross site was very slow to respond, if it responded at all, to most requests. It was obviously under high load with much of the country wanting to donate, but it was clear that most people were not even getting through to the site.
While we were working on eq.org.nz, we saw the Red Cross was having trouble, and the group made the decision to reach out to help. As a result, I was put in touch with Gerard Creamer at Netspace [1]. He connected me to Fletch, who works for them as a sysadmin.
The site was struggling. Fletch was preparing to move the site to a new server, having seemingly abandoned the existing one as a lost cause. I got access to it to see what was going on. Digging around, I found out why it was so slow. The apache powering the site was woefully misconfigured.
The hardware the site was on was, I believed, more than capable of handling the load the site was under. Unfortunately, the software hadn't been configured in any kind of sensible fashion. I made the decision that time would be better spent fixing the software rather than moving to new hardware - which wouldn't have made a button of difference if the site was again misconfigured!
After a pause while we had to reboot the server when it completely locked up, I got to work. I started by configuring the apache to allow many more connections and turned off keepalive, which was just an immediate fix to make the site a little more responsive.
I then began work on putting nginx in front of the site. Nginx is the reverse proxy mentioned in the story. The idea is that it handles the connections between visitors' browsers and the website, and serves any static files (images/CSS/javascript), instead of the slower, more memory-hungry apache.
However, while working on nginx, I noticed that at random intervals every 30 seconds to 10 minutes after apache was restarted, it would race to MaxChildren and the site would lock up. Very strange behaviour, which I initially put down to the high load. I figured nginx would fix it, so I kept an eye on it and restarted apache every time it locked up, while continuing to work on nginx.
After a while, I got nginx in front of the apache. That made an immediate difference to the site performance, now that apache was only handling PHP requests [2]. However, the random apache lockups continued.
Arjen came on the scene somewhere around this point, and began looking at the database. Everything that Netspace claimed they did regarding database tuning, he did. Index optimisations, alterations to cache settings, etc. In fact, Arjen even gave a talk about what he did at the Brisbane PHP & MySQL user group.
We began investigating the lockups in more detail. I've actually blogged about how they were caused and how we fixed them before - "the case of the crashing website" was actually "the case of the crashing NZ Red Cross website". Once we fixed it, success! The site was running smoothly.
The Story vs The Facts
Any mention in the story of Netspace having tuned software is dubious at best. I honestly can't remember whether I installed an opcode cache, whether they did, or whether there even was one. However, the reverse proxy setup and apache configuration was me. The site was, according to Fletch, running on the existing hardware that it always used to run on, so I'm not convinced it was ever "put onto a new 64-bit Debian operating system" - and that wouldn't have made any difference if it had.
Regarding the MySQL - apart from moving it to a separate machine (which may not have even happened - all I know is that it was on a separate machine when I got there) - all of the tuning was done by Arjen. He was MySQL employee #25, and now runs OpenQuery, a MySQL consultancy. There are tickets in OpenQuery's issue tracking system where he detailed all of the work he did.
The funniest part about the whole story is this:
Within 36 hours of the site getting hit with a wall of traffic ten times larger than the solution was specified to handle, Netspace had reduced processing and memory load on the servers while being able to serve the higher connection demands from the world...
The solution was specified to handle? The solution was a horribly misconfigured apache, an unconfigured MySQL and no reverse proxy, as well as a broken search system (see the case of the crashing website). The "wall" of traffic would only have had to have been 10 visitors at once to floor it.
The solution Arjen and I put in place dealt with the wall with ease. When I logged out of the web server, load was 0.1, and it was using just 700M of ram - in comparison to the swapping-like-crazy, constantly-locking-up mess it was in (aka: load of 20+, all 8G of ram + all swap used).
Gerard sent me an e-mail afterwards:
Hi Nigel, Thank you so much for your help with Red Cross - we really appreciate it, and so do the folks at Red Cross. Your efforts have enabled Red Cross to accept over 7500 donations since noon yesterday - over $800,000 going directly to where it's needed. And that number will continue to climb. You seem to still have 10 open sessions on the redxprdww02 server - let me know if you still need them and I can kill them off if you're done. If there are any expenses or costs please don't hesitate to invoice them to us. Netspace Services Limited PO Box 404 Palmerston North New Zealand Thank you again for all of your help. Regards, Gerard
Q & A
Q: Nigel, isn't this sour grapes?
I don't think so. There's no mention of us in the article. People will get the impression that it's Netspace that tuned the site if I don't say anything.
I wouldn't have had a problem with it if the article had said that Netspace accepted the help of the team that worked on eq.org.nz, who did [insert all the configuration stuff here].
ComputerWorld is a widely read publication, it doesn't seem right to let such inaccuracies go without correction.
Q: What about the other stuff in the article?
We only ever dealt with the website. Everything else wasn't us, and I know nothing about it.
I never actually dealt with Charles Ranby, the Red Cross IT Manager. I wonder how much he knows.
Q: So what do you think of NetSpace?
I hope they simply forgot to mention they had so much help.
Fletch was good to work with - responsive and helpful.
| [1] | I was actually connected to him at Face, but Face and Netspace seem to be interchangeable - see how Face claims Red Cross New Zealand as their client, at the bottom of the page. |
| [2] | Unfortunately though, not as much as it could have, because the CMS they're using serves images through PHP scripts |
Like this post? Subscribe to my RSS feed and follow me on twitter to hear about new posts early.
Want to share this post? Tweet
Hi and welcome! In 2009 I quit my job to become an entrepreneur, founding 
