Nigel McNie http://nigel.mcnie.name/ Website and blog of Nigel McNie, entrepreneur and founder of Shoptime (prior: Mahara and GeSHi). Nigel writes about webapps and being remarkable. en Fri, 04 Nov 2011 12:59 NZDT Fri, 04 Nov 2011 12:59 NZDT nigel@mcnie.name Veedub 5 Nigel McNie http://nigel.mcnie.name/static/images/nigel.jpg http://nigel.mcnie.name/ Realtime Conversations Made Easy (my latest job) http://nigel.mcnie.name/blog/realtime-conversations-made-easy-buzzumi <p>Why is it so hard - technically - to have a realtime conversation with people all over the world?</p> <p><em>&quot;It's easy, just use Skype!&quot;</em>. Not so fast. People need to install Skype - this is a task too hard for Grandma (unless you visit first and do it for her). And there's the whole contacts process. If you want to talk to me, we have to find each other and add contacts. After that, Skype will bother you when it's my birthday even though you only wanted to have one little chat. I could go on... <a class="footnote-reference" href="#id5" id="id1">[1]</a></p> <p><em>&quot;You're a geek, use IRC!&quot;</em>. Of course, IRC is basically unusable by many people whom you'd like to talk to - some co-workers, your family, certainly not Grandma. At least there's no contacts to manage. But there's no audio/video, and you can't &quot;drop in and out&quot; of an ongoing conversation very smoothly - if you lose your connection to the chat, you lose the messages sent while you were gone <a class="footnote-reference" href="#id6" id="id2">[2]</a>.</p> <p><em>&quot;Use google chat/jabber!&quot;</em>. Worst of both worlds. Low penetration, high complexity.</p> <p><em>OK then, just use... hmm</em>. Exactly.</p> <div class="section" id="ad-hoc-easy"> <h1>Ad-hoc &amp; Easy</h1> <p>There are two problems that current solutions have. One is that our conversations are ad-hoc, sometimes with people you've never met before and only have a fleeting connection to. Skype's contacts are not ad-hoc.</p> <p>The other is that people need easy. Configuring IRC is hard. Heck, installing a damn program is hard for many people. Twitter and Facebook get close, but holding a decent conversation on Twitter is impossible, and Facebook is too autistic for people to use for ad-hoc communication. Besides, even <em>signing up</em> is often too hard. How many services do you not use because you would have needed to sign up first?</p> <p>When I worked on <a href="http://www.mahara.org/">Mahara</a>, we used IRC. A few enthusiastic types joined us, but in large, IRC remains to this day inaccessable to many people. Not all users of Mahara are willing, or capable, of using it.</p> <p>And when the Christchurch Earthquake hit this year, the response team that built eq.org.nz used Skype - which was by most accounts a terrible solution. Skype chat simply has too many bugs when you get to rooms of more than a few people - and we had over a hundred. Richard wrote a post-mortem that <a href="http://phirate.posterous.com/real-time-communication-in-a-crisis">lays out the issues for communicating in a crisis</a>.</p> </div> <div class="section" id="introducing-buzzumi"> <h1>Introducing buzzumi</h1> <p>So it's been my great privilege to have worked on a new service that addresses these problems. With <a href="http://phirate.posterous.com/">Richard</a> (a fantastic webapp developer who happens to be my cousin), we've been the tech team behind <a href="https://buzzumi.com/">buzzumi</a>, which just launched (he's lead, I'm wingman). It's a webapp that lets you create ad-hoc discussions - text chat, audio and video (A/V is completely optional). Your chat is accessible to anyone who can click on a link, <strong>without them logging in</strong> <a class="footnote-reference" href="#id7" id="id3">[3]</a>. It's simple, beautiful, and lightning fast.</p> <img alt="http://f.dollyfish.net.nz/5d7c4c" src="http://f.dollyfish.net.nz/5d7c4c" style="height: 290px;" /> <img alt="http://stuff.nigel.mcnie.name/buzz-video.png" src="http://stuff.nigel.mcnie.name/buzz-video.png" style="height: 290px;" /> <p>The chat host can set the background, and it changes for everyone in the chat. There's no limit on participants, and up to six people can use A/V at once <a class="footnote-reference" href="#id8" id="id4">[4]</a>. It's great for team meetings, one on one discussions, and even webinars with hundreds of guests watching a broadcast. There's nothing to install (except flash), and no barrier to entry. You don't have to know the other participants, and when you're done, you can close the chat and never see them again.</p> <p>Perhaps one of the coolest features is that you can make a chat have a fee for entry. buzzumi handles all of the payments for you. So you could hold a webinar on a topic you're an expert in, charge $50 to enter, and simply collect your profits when you're done (buzzumi takes a 10% cut).</p> </div> <div class="section" id="please-try-it-out"> <h1>Please Try It Out!</h1> <p>If you're in education, how could you use this for your school/university? If you're in disaster response, can we please ditch Skype for this? And if you know me from somewhere else, I'd love to hear your thoughts about it, technical or otherwise.</p> <p><a href="https://buzzumi.com/">Give it a try</a>, and let me know what you think! I have created <a href="https://buzzumi.com/nigel/BqI2Wq7lFX">a chat which I&#39;ll hang out in for a while</a> as well, come by and say hello if you have a second.</p> <p>ps: <a href="http://www.getyourgameon.co.nz/">Get Your Game On</a> is still rolling, we are busy running much of the summer football around Wellington.</p> <table class="docutils footnote" frame="void" id="id5" rules="none"> <colgroup><col class="label" /><col /></colgroup> <tbody valign="top"> <tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td>.. in this footnote :). Skype can <a class="reference external" href="http://en.wikipedia.org/wiki/Supernode_(networking)">suck bandwidth when you're not around</a>, the new UI <a href="http://blog.brizk.com/post/4508062266/on-the-skype-ui-disaster">is</a> <a href="http://community.skype.com/t5/Windows/Problems-with-Skype-UI-since-5-x-even-to-5-5/td-p/80402">crap</a>, it <a href="http://en.wikipedia.org/wiki/Skype_security#Flaws_and_potential_flaws">promises encryption but has flaws including rumoured government backdoors</a>, and it's now owned by Microsoft. Eew.</td></tr> </tbody> </table> <table class="docutils footnote" frame="void" id="id6" rules="none"> <colgroup><col class="label" /><col /></colgroup> <tbody valign="top"> <tr><td class="label"><a class="fn-backref" href="#id2">[2]</a></td><td>Geeks have ways to get around this, through hax that allow them to use IRC on internet connections more reliable than their own. This is a workaround, not a true solution.</td></tr> </tbody> </table> <table class="docutils footnote" frame="void" id="id7" rules="none"> <colgroup><col class="label" /><col /></colgroup> <tbody valign="top"> <tr><td class="label"><a class="fn-backref" href="#id3">[3]</a></td><td>To create chats, you do have to sign up - but your guests do not.</td></tr> </tbody> </table> <table class="docutils footnote" frame="void" id="id8" rules="none"> <colgroup><col class="label" /><col /></colgroup> <tbody valign="top"> <tr><td class="label"><a class="fn-backref" href="#id4">[4]</a></td><td>Just quietly, the real limit is quite a bit higher, although we give no guarantees that it'll work properly beyond 6. I think the current record is 14. And note that this limit applies to people <em>publishing</em> their A/V, so you can have just one or two publishers with many more watching.</td></tr> </tbody> </table> </div> Continuous Deployment: Reprise http://nigel.mcnie.name/blog/continuous-deployment-reprise <p>Here are my <a href="http://www.slideshare.net/nigelmcnie/the-why-and-how-of-continuous-delivery/download">current slides</a> (which include my speaker notes) for the CD talk I've been doing around Wellington recently. Here it is embedded (CC-BY-SA):</p> <div style="width:425px; margin: 0 auto" id="__ss_9327958"> <strong style="display:block;margin:12px 0 4px"><a href="http://www.slideshare.net/nigelmcnie/the-why-and-how-of-continuous-delivery" title="The Why and How of Continuous Delivery" target="_blank">The Why and How of Continuous Delivery</a></strong> <iframe src="http://www.slideshare.net/slideshow/embed_code/9327958" width="425" height="355" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe></div><p>There are some changes from <a href="/blog/continuous-deployment-a-better-software-deployment-strategy">the first time I gave it</a>. Sadly, I have failed to shorten it, but here are the main changes:</p> <ul class="simple"> <li>I've renamed it to Continuous <em>Delivery</em>. This is simply making the point that the aim is Continuous Deployment <strong>to production</strong>. If you're doing it up until staging/test, and not production, you won't be getting the full range of benefits - although getting as far as staging is a good start if you're converting an existing project to CD.</li> <li>I have backed down slightly regarding my position on Feature Flags. Previously, I believed they were essential, but <a href="http://scottchacon.com/2011/08/31/github-flow.html">Github clearly manage without them</a>. As such, I've downgraded them to a &quot;strong recommendation&quot;. Feature Flags give you dark rollouts/kill switches and so on, which feature branches can never provide.</li> </ul> <p>I saw some feedback saying that there were not enough concrete examples. I agree, to some extent - however, I still find that many people haven't even <em>heard</em> of CD. As such, my presentation is designed as a conciousness-raising one - hopefully people who are interested will search around and find out more. If I had more time, it'd be great to walk through an example (IMVU springs to mind), but the hard reality is that I need to look for things to take <em>out</em>.</p> <p>There were also a couple of questions around how to convert an existing project to CD. So here's my suggested strategy for accomplishing that:</p> <div class="section" id="start-with-a-test-suite"> <h1>1. Start with a Test Suite</h1> <p>Remember the goal is to <strong>go fast with confidence</strong>. If you're just going fast, you're simply speeding, and will soon become another statistic.</p> <p>Okay, end of car analogy. However, you need to build confidence, which is what the test suite is for. Start there. You don't need to over-do it, but be smart. Start by ensuring it's easy to add tests, and then write relatively comprehensive tests for the critical flows in your application. What these are is entirely up to you - if you don't know them, maybe you had better start there :).</p> <p>You should also set up a CI server, and get the tests running on every commit. If nothing else, this should enhance your existing development process.</p> </div> <div class="section" id="rig-up-continuous-deployment-to-staging"> <h1>2. Rig up Continuous Deployment to Staging</h1> <p>Staging, test, UAT, whatever you call these environments that aren't production - get continuous deployment working at least that far. These environments can be broken with relative inconsequence, so you can work with confidence initially. Over time, you should find their uptime improves, especially as your test suite begins to show its strength.</p> <p>In particular, in this step, your aim is to put the test suite in the way of deployment. It should be impossible to deploy without a test run having completed successfully on the code being deployed.</p> <p>Note that the presentation suggests that you employ a &quot;Commit triggers tests, tests passing triggers deployment&quot; tactic. This is a simplification. In larger teams, it may make more sense to work more like this: &quot;tests are run every 15 minutes, with success resulting in deployment&quot;. This allows many commits to be made in one window (common for large teams), with the knowledge that anything you commit will take at worst 30 minutes to make it to production (assuming the tests pass).</p> </div> <div class="section" id="the-management-gambit"> <h1>3. The Management Gambit</h1> <p>You will know you've done the first two steps when the tech team thinks &quot;this is working nicely all the way to staging - why not to production?&quot;. There's one final obstacle in your way - the fear of everyone outside the tech team.</p> <p>On some level, management/clients are going to have to get comfortable with the idea of code being deployed without their approval, and often even without their knowledge. You can use presentations like mine to help here. Remember, if you can win over just one person, they can help sell the idea to everyone else.</p> <p>At a base level, they most likely will fear the lack of control they will have. To combat this, tell them that you'll implement the system, but they have the right to demand change freezes - periods of time where you won't be allowed to deploy. They probably already have this right, but re-affirm it anyway.</p> <p>This is false security for them. Martyn has a sign on his desk, which reads <em>&quot;Change Freeze (n): A time in which more changes are made to production than usual&quot;</em>. When there are urgent problems, change freezes be damned - changes are made, and we all know it. The next time they have a change freeze, they'll ask for a change, and instead of flailing about you'll be able to make it with a smile. They'll never have another freeze again.</p> <hr class="docutils" /> <p>I also mentioned during the talk a USB rocket launcher for retaliating against people who break the build. <a href="http://www.papercut.com/blog/chris/2011/08/19/who-broke-the-build/">Enjoy!</a></p> </div> Closing The Loop - The Canterbury Quakes, eq.org.nz And The SBTF http://nigel.mcnie.name/blog/closing-the-loop-the-canterbury-quakes-eq-org-nz-and-the-sbtf <p><em><a href="http://www.flickr.com/photos/greenfluoro/5472977899/in/photostream/">Image Source</a></em></p> <img align="right" alt="http://stuff.nigel.mcnie.name/ccnz.jpg" class="align-right" src="http://stuff.nigel.mcnie.name/ccnz.jpg" /> <p>It's been a year and a day since the first of a series of earthquakes hit Canterbury, which peaked in February this year with a <a href="http://en.wikipedia.org/wiki/February_2011_Christchurch_earthquake">6.3 magnitude quake that took over 180 lives</a>. While I was unaffected by the quakes themselves, I have since thought that I (along with everyone else in Wellington) have &quot;gotten off lightly&quot;, due to the simple fact that <a href="http://db.nzsee.org.nz/2010/Paper23.pdf">Wellington straddles a known fault line with an 11% chance of rupture in the next 100 years</a> <a class="footnote-reference" href="#id5" id="id1">[1]</a>.</p> <p>The events following the quakes show that Wellingtonians, and indeed all New Zealanders, take this risk seriously. The <a href="http://eq.org.nz">eq.org.nz</a> project was a fantastic grass-roots effort that showed the &quot;can-do&quot; attitude in our country. People weren't happy just to donate or leave it to the government - instead they pitched in and did something that helped everyone.</p> <p>After eq.org.nz was shut down, there were some postmortems. Richard Clark wrote on <a href="http://phirate.posterous.com/real-time-communication-in-a-crisis">realtime communication in a crisis</a>, <a href="http://phirate.posterous.com/development-at-crisis-speed">development at crisis speed</a> and a look at <a href="http://phirate.posterous.com/many-minds-the-value-of-the-crowd-in-a-disast">how technology could automatically swing into effect in a disaster</a>. InternetNZ facilitated a <a href="http://internetnz.net.nz/news/blog/2011/Helping-Wellington-Get-Thru">discussion between the internet community and the Wellington City Council</a>, and <a href="http://www.youtube.com/watch?v=Zv7gEhKEMmw">a great video was made summarising the project</a>. As I was a part of the tech team for the site, I took one lesson away in particular - <strong>the tech was not as ready as it should have been</strong>.</p> <p>Yes, we managed to set up a site in a matter of hours after the quake. Yes, the site got significant use, and yes it was helpful to some. But I feel that we can do much better in the future, particularly with regards to the technology. The two areas we could improve on are <strong>being prepared</strong> and <strong>improving the tools</strong>.</p> <div class="section" id="being-prepared"> <h1>Being Prepared</h1> <p>From quake to functioning map took us a few hours - a commendable effort, made possible thanks to <a href="http://www.ushahidi.com/">Ushahidi</a>, their hosted <a href="http://www.crowdmap.com/">Crowdmap</a> product, and of course the efforts of everyone involved. However, I feel this was much too long.</p> <p>Firstly, we were working in an emotionally charged, chaotic environment - one in which mistakes would have been easy to make. Second, we were given a lot more luck than we may have otherwise expected, thanks to the Ushahidi team co-operating with us over the migration away from Crowdmap, and the help of <a href="http://crisiscommons.org/">CrisisCommons</a> in getting us started <a class="footnote-reference" href="#id6" id="id2">[2]</a>. And third, during the first few hours, stuff.co.nz and NZHerald (the two major online news sources in NZ) both set up their own maps, and had it not been for our connections and their willingness to help, three competing maps could easily have sprung up, none of which would have been much help at all.</p> <p>Spending hours setting up the site manually is the wrong way to do things, and we all know it. However, we can't predict disasters in advance. So it seems to me that we either need the ability to set up a site in minutes, or a &quot;general&quot; site prepared in advance, which can be adapted within minutes for whatever disaster is taking place.</p> <p>Let's explore both options briefly. The ability to set a site up in minutes requires some prior planning, however it makes no assumptions about the nature of the disaster, and allows us to set up and tear down sites as we see fit. Disaster hits two cities? Just deploy two sites. During down time, as long as a minimal amount of maintenance is done, we can keep the process well oiled. Furthermore, we can develop &quot;template&quot; sites - one for earthquakes, one for tornadoes etc., and deploy the right &quot;type&quot; of site during a disaster, providing great agility.</p> <p>Setting up a site (or sites) in advance has a different set of complications and benefits. A URL can be published in advance of any disaster, that everyone knows, so when a disaster hits more people know where to go (e.g. wellington.disaster.org.nz <a class="footnote-reference" href="#id7" id="id3">[3]</a>). However, it may take more work to be able to re-configure the site for different disasters, as opposed to simply deploying a fresh site. And naturally, the site will need maintaining during the time between disasters, which could be many years.</p> <p>The other aspect of being prepared, as alluded to here, is <strong>having data prepared</strong>. That doesn't just mean having categories prepared - that means having map data for everything of significance ready. Why should it take until a disaster for us to plot Wellington's ATM locations on a map? Why can't we prepare this data beforehand, ready for use the moment it's needed?</p> </div> <div class="section" id="improving-the-tools"> <h1>Improving the Tools</h1> <p>We made great strides towards improving Ushahidi for the task of disaster mapping on the eq.org.nz project. However, I fear that a lot of our good work hasn't made it as far as it could. <a href="https://github.com/ccnz/Ushahidi_Web">The code</a> sits (rots?) in a github repository, many of our changes having not made it anywhere near upstream - despite still being needed.</p> <p>I've helped out with a couple of disasters since the quakes, and in each case, I found myself doing the same setup work, then fixing the same bugs, as we did previously. Clearly, pushing some fixes upstream would be a good start. However, the issues with using Ushahidi for crisis mapping run deeper than just a few fixes.</p> <p>We made many changes that improved the usability of the map and homepage, <a href="fixing-website-performance-issues-whack-a-mole">the stability/scalability of the platform</a>, and to streamline the workflows of people approving reports. We didn't push these changes upstream, and I'm reminded of this with every disaster has occurred since then. The eq.org.nz site really did finish as a much better product, but it's a height we haven't attained since.</p> <p>To be clear, I'm not blaming anyone for this. People burned out, interest waned and we all had our lives to get back to. But I think it would be great if we could &quot;close the loop&quot;. After all, you never know when it's you who will be needing the map...</p> </div> <div class="section" id="closing-the-loop"> <h1>Closing the Loop</h1> <p>I found out about the <a href="http://blog.standbytaskforce.com/about/">Standby Task Force</a> after the Christchurch Quake. They're a network of <strong>&quot;adhoc groups of tech-savy mapping volunteers that emerge around crises into a flexible, trained and prepared network ready to deploy.&quot;</strong> In other words, they're the eq.org.nz team, except larger, more organised, and with a global focus. When disaster strikes, if the locals ask for help, the SBTF are ready to respond, providing a map, volunteers and expertise to get things rolling. Now I think of it, they're a little like Internation Rescue (from Thunderbirds), in its infancy. No rocket ships I'm afraid ;), but a group of people willing and able to help in times of emergency.</p> <p>Somehow, through a combination of my motivation to improve the tech for future disasters, and the prodding of <a href="http://twitter.com/georgechamales">George Chamales</a> and <a href="http://twitter.com/Jasper_Johns">Kirk Morris</a>, I've fallen into the position of SBTF Tech Team Leader. My plan is to, as part of this team, work on both of the technical aspects outlined above. The goal: that <strong>anyone will be able to deploy a map ready to handle a disaster within minutes of it occuring</strong>.</p> <p>I've donated some code from <a href="http://www.getyourgameon.co.nz/">Get Your Game On</a> to get started on a <a href="https://github.com/StandbyTaskForce/sbtf-ops">system for performing one-command</a>. We're also planning on maintaining a branch of Ushahidi <a class="footnote-reference" href="#id8" id="id4">[4]</a> optimised for crisis mapping, to which I hope we can apply many of our patches from eq.org.nz and other sources over time.</p> <p>With this, I hope we can truly close the loop - so that when Wellington, or anywhere else, is struck by disaster, our response will be as good as it can be.</p> <p><em>Final thoughts: I'm only focusing on the tech. There are clearly other issues we need to work through, not least the political ones. I hope next time to see a much closer relationship between government and such volunteer efforts, although it's not an issue I feel I can influence personally.</em></p> <table class="docutils footnote" frame="void" id="id5" rules="none"> <colgroup><col class="label" /><col /></colgroup> <tbody valign="top"> <tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td>In researching this I was glad to discover that the 11% figure is a 50% decrease in what was commonly believed before the &quot;It's Our Fault&quot; study. All the same, 11% is not a particularly comforting figure.</td></tr> </tbody> </table> <table class="docutils footnote" frame="void" id="id6" rules="none"> <colgroup><col class="label" /><col /></colgroup> <tbody valign="top"> <tr><td class="label"><a class="fn-backref" href="#id2">[2]</a></td><td>In comparison, imagine what it would have been like if the disaster had struck 10 years ago. No CrisisCommons, no Ushahidi, barely any internet to speak of. We are truly lucky.</td></tr> </tbody> </table> <table class="docutils footnote" frame="void" id="id7" rules="none"> <colgroup><col class="label" /><col /></colgroup> <tbody valign="top"> <tr><td class="label"><a class="fn-backref" href="#id3">[3]</a></td><td>Naturally, something shorter would be better!</td></tr> </tbody> </table> <table class="docutils footnote" frame="void" id="id8" rules="none"> <colgroup><col class="label" /><col /></colgroup> <tbody valign="top"> <tr><td class="label"><a class="fn-backref" href="#id4">[4]</a></td><td>Naturally, we will try and push as much upstream as possible, however there's a simple reality that not all patches may be suitable.</td></tr> </tbody> </table> </div> Continuous Deployment - A Better Software Deployment Strategy http://nigel.mcnie.name/blog/continuous-deployment-a-better-software-deployment-strategy <p><em>Note: I have updated the presentation since first giving it - <a href="/blog/continuous-deployment-reprise">check out the new one</a>. The new post also includes tips for converting an existing software project to CD.</em></p> <p>I gave a talk on Continuous Deployment today. Here are the <a href="http://stuff.nigel.mcnie.name/continuous-deployment/">slides</a> and here are my <a href="https://docs.google.com/document/d/1S-rftz6BmPosEw2T5Qf2-JA88wcnyTIfGogvbUEJSjg/edit?hl=en_US">speakers notes</a> - which will probably be more interesting.</p> <p>My thinking on CD has advanced since last year, but the essentials remain the same. To do CD is to make a strategic decision to remove fear from the deployment process; to treat your test suite as an asset of the highest value; to truly value user feedback; to remove deployment as an obstacle to any other activity.</p> <p>What has changed? Over the last year, I've done CD on one project and worked on another using a fortnightly release schedule. I've been able to compare the two and observe first hand just how beneficial CD can be.</p> <div class="section" id="continuous-deployment-going-fast-with-confidence"> <h1>Continuous Deployment - Going Fast With Confidence</h1> <p>On the CD project, the complete lack of effort required to deploy changes has been a <em>huge</em> timesaver. I have never felt the need to wait before deploying one change, even if I was about to work on another. Little fixes, in other words, made it to production very quickly, pleasing my customers far more than assurances of &quot;it'll be fixed next week&quot; would have.</p> <p>I think the best moment was when I noticed a user trying and failing to complete a wizard due to a bug. I fixed the bug and deployed - allowing them, on their sixth try (and probably to their complete surprise), to complete it. If ever there was a moment where I appreciated the value of a robust, quick deployment process, this was it.</p> <p>Furthermore, this experience highlights one of the key benefits of CD. I could have hacked a fix on production - <strong>but it was easier to use the CD process, which included a full test suite run</strong>. There's simply no way any other process could have provided the same speed with the same level of assurance - hacking on production would have been the only faster option, and it would have been wildly dangerous. <a class="footnote-reference" href="#id5" id="id1">[1]</a></p> </div> <div class="section" id="fortnightly-deployment-the-lie"> <h1>Fortnightly Deployment - The Lie</h1> <p>On the fortnightly deploy project on the other hand, we encountered all the same issues that I'm so tired of.</p> <p>We'd do a release, then for the next two weeks, some fixes would be marked as so urgent that we had to do a deployment of just that fix, immediately. We'd made sure deployment was as close to a one-command process as possible. However, the process of patching and testing the stable branch was an annoying break in rhythm, given that we were doing most development on trunk <a class="footnote-reference" href="#id6" id="id2">[2]</a>.</p> <p>This was actually a point raised by <a href="http://www.chilts.org/">Andy Chilton</a> at the talk today. It seems that many project teams realise that there are some fixes that just have to make it out fast, and as a result they build a separate &quot;hotpatch&quot; channel to accomodate them.</p> <p>In my view this is madness, no matter how well tended the &quot;hotpatch&quot; process is. Do your hotpatches go through the test suite? They certainly should! And why create a &quot;fast path&quot;, and then forbid its use in ways that would delight your customers?</p> <p>But I think my biggest objection is this: why have two processes when you could just have one? We coders know the evil that lies in needless duplication and complexity - which is exactly what a &quot;hotpatch&quot; system is. Duplication and complexity.</p> <p>The whole idea of having a separate deployment process exposes the &quot;fortnightly&quot; claim as a lie anyway. Who can honestly claim they deploy every fortnight, if they're hotpatching? <a class="footnote-reference" href="#id7" id="id3">[3]</a></p> </div> <div class="section" id="objections-to-cd"> <h1>Objections to CD</h1> <p>Perhaps the strongest objection that came up was that clients wouldn't tolerate the possibility of things breaking without them being aware of it. To me, this objection has a slight air of childishness about it - I'd give it more credit if clients ever bothered to hire a world-class QA team, but they never do, and they miss bugs slipping into production all the time even with their checking. I think there's just our old friend, the &quot;Cover Your Ass&quot; policy, at work here.</p> <p>Besides, nothing about CD precludes the possibility that they can still have a QA team checking things - with the able assistance of feature flags that limit features under development to just them. And I'd contend that the QA team would be just as delighted as the client themselves when told a bug they found half an hour ago was not only fixed on production and ready for them to check again, but that a test had been written to make sure it never happens again.</p> <p>Having said all of this, <a href="http://coffee.geek.nz/">Brenda Wallace</a> made the point that it all depends on the client, regardless of how good the idea sounds. Some simply won't change from what they know, and at the end of the day it's their project. Perhaps this is why CD is doing so well in the tech startup world - it's the startups themselves who are the clients <a class="footnote-reference" href="#id8" id="id4">[4]</a>.</p> </div> <div class="section" id="try-it-for-yourself-i-ll-help"> <h1>Try It For Yourself - I'll Help</h1> <p>All up, it was a great discussion, and it seemed like many there could at least see how CD could be better. If you count yourself among their number, I encourage you to try it out on the next project you do, and see how you go. I'm more than happy to chat with you about it and share experiences if you do, so feel free to contact me if you want to discuss anything about it.</p> <hr class="docutils" /> <p><em>As an aside, I do intend to continue my <a href="fixing-website-performance-issues-hardware">Web App Performance series</a>, I've just been focused on other things recently. Apart from business, I've joined the <a href="http://blog.standbytaskforce.com/">Standby Task Force</a> and am developing scripts to automatically deploy an <a href="http://www.ushahidi.com/">Ushahidi</a> within a few minutes of a disaster occuring. More on that in a future post.</em></p> <table class="docutils footnote" frame="void" id="id5" rules="none"> <colgroup><col class="label" /><col /></colgroup> <tbody valign="top"> <tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td>I'm the first to admit that this particular example was rather fortuitous, but I think it's even more relevant as your site gets busier. You'll see the errors occuring, diagnose and fix the problem, deploy - and it's inevitable that some customers will then begin to succeed at what they were doing. Contrast with hotpatching, where you could <em>break</em> the site for more people - or a slower deployment process where <em>more</em> people would encounter the problem.</td></tr> </tbody> </table> <table class="docutils footnote" frame="void" id="id6" rules="none"> <colgroup><col class="label" /><col /></colgroup> <tbody valign="top"> <tr><td class="label"><a class="fn-backref" href="#id2">[2]</a></td><td>The CD example (the wizard fix) just goes to show how artificial this problem is. We were pushing back because our process made it harder than it should have been. Software development teams around the world do this all the time - lowering customer expectations about how long it takes to fix problems. I think we're doing our clients a disservice.</td></tr> </tbody> </table> <table class="docutils footnote" frame="void" id="id7" rules="none"> <colgroup><col class="label" /><col /></colgroup> <tbody valign="top"> <tr><td class="label"><a class="fn-backref" href="#id3">[3]</a></td><td>Substitute &quot;weekly&quot;, &quot;monthly&quot; etc. as appropriate. If you tell me you deploy weekly, I bet you do more than 52 deployments in a year.</td></tr> </tbody> </table> <table class="docutils footnote" frame="void" id="id8" rules="none"> <colgroup><col class="label" /><col /></colgroup> <tbody valign="top"> <tr><td class="label"><a class="fn-backref" href="#id4">[4]</a></td><td>&quot;Client&quot; is defined here as &quot;the organisation that uses the project for their benefit&quot;. For example, Fairfax uses Catalyst IT to develop stuff.co.nz. Fairfax is the client. In a tech startup, it's the startup themselves that gets the benefit from the project, so they're their own client.</td></tr> </tbody> </table> </div> NZ Red Cross Website Performance http://nigel.mcnie.name/blog/nz-red-cross-website-performance <p>I've just seen <a href="http://computerworld.co.nz/news.nsf/news/emergency-response-red-cross-it-and-the-quake-aftermath">this article in ComputerWorld</a> about how the NZ Red Cross scaled its website to handle the aftermath of the Cantebury earthquake on February 22.</p> <p>I don't know how much the Red Cross themselves knows about the details of how their site was fixed, but I know one thing for sure. It was <a href="http://twitter.com/arjenlentz">Arjen Lentz</a> and I who fixed it. Not Netspace.</p> <div class="section" id="the-story"> <h1>The Story</h1> <p>Here is what the story says happened:</p> <blockquote> <p>To cope with the spike in traffic Netspace Services Limited added a reverse proxy to handle the static content delivery, tuned PHP to include an op-code cache and also trimmed the PHP modules to the bare minimum. The Apache daemon was also tuned for high turnover of processes to prevent memory bloat. The web server was put onto a new 64-bit Debian operating system and the database was shifted to its own 64-bit hardware and operating system.</p> <p>The MySQL database was tuned, with indexing optimisations applied and alterations to cache settings to increase performance while reducing overall load. The presentation code was rewritten to push more work to the SQL database, to optimise queries, and to streamline processing. Within 36 hours of the site getting hit with a wall of traffic ten times larger than the solution was specified to handle, Netspace had reduced processing and memory load on the servers while being able to serve the higher connection demands from the world, and return to business as usual.</p> </blockquote> <p>The above has some grains of truth in it. However, it wasn't Netspace that eventually fixed the site. Read on for what really happened.</p> </div> <div class="section" id="the-facts"> <h1>The Facts</h1> <p>Immediately after the quake, the NZ IT community pooled together to begin work on what would eventually be <a href="http://eq.org.nz">eq.org.nz</a> - the Christchurch Recovery Map. I had previously worked on scaling a <a href="http://queenslandfloods.crowdmap.com/">map for the Queensland Floods</a> earlier in the year, so I became part of the team that worked on the site.</p> <p>We migrated the site from a temporary home on <a href="http://www.crowdmap.com/">Crowdmap</a> to dedicated servers, configured and tuned the stack supporting the site (PHP under fastcgi, MySQL, the <a href="http://www.ushahidi.com/">Ushahidi</a> software that powered the site), and monitored it while the site reached 100,000 visits in a week. A <a href="http://www.nzcs.org.nz/newsletter/article/94">more detailed list of what happened</a> was published in the NZCS Newsletter on the 18th of March.</p> <p>After the quake, the <a href="http://www.redcross.org.nz/">NZ Red Cross</a> site was very slow to respond, if it responded at all, to most requests. It was obviously under high load with much of the country wanting to donate, but it was clear that most people were not even getting through to the site.</p> <p>While we were working on eq.org.nz, we saw the Red Cross was having trouble, and the group made the decision to reach out to help. As a result, I was put in touch with Gerard Creamer at Netspace <a class="footnote-reference" href="#id3" id="id1">[1]</a>. He connected me to Fletch, who works for them as a sysadmin.</p> <p>The site was struggling. Fletch was preparing to move the site to a new server, having seemingly abandoned the existing one as a lost cause. I got access to it to see what was going on. Digging around, I found out why it was so slow. The apache powering the site was <em>woefully</em> misconfigured.</p> <p>The hardware the site was on was, I believed, more than capable of handling the load the site was under. Unfortunately, the software hadn't been configured in any kind of sensible fashion. I made the decision that time would be better spent fixing the software rather than moving to new hardware - which wouldn't have made a button of difference if the site was again misconfigured!</p> <p>After a pause while we had to reboot the server when it completely locked up, I got to work. I started by configuring the apache to allow many more connections and turned off keepalive, which was just an immediate fix to make the site a little more responsive.</p> <p>I then began work on putting nginx in front of the site. Nginx is the reverse proxy mentioned in the story. The idea is that it handles the connections between visitors' browsers and the website, and serves any static files (images/CSS/javascript), instead of the slower, more memory-hungry apache.</p> <p>However, while working on nginx, I noticed that at random intervals every 30 seconds to 10 minutes after apache was restarted, it would race to MaxChildren and the site would lock up. Very strange behaviour, which I initially put down to the high load. I figured nginx would fix it, so I kept an eye on it and restarted apache every time it locked up, while continuing to work on nginx.</p> <p>After a while, I got nginx in front of the apache. That made an immediate difference to the site performance, now that apache was only handling PHP requests <a class="footnote-reference" href="#id4" id="id2">[2]</a>. However, the random apache lockups continued.</p> <p>Arjen came on the scene somewhere around this point, and began looking at the database. Everything that Netspace claimed they did regarding database tuning, he did. Index optimisations, alterations to cache settings, etc. In fact, Arjen even <a href="http://www.brisbanephp.net/events/16760466/?eventId=16760466&amp;action=detail">gave a talk about what he did at the Brisbane PHP &amp; MySQL user group</a>.</p> <p>We began investigating the lockups in more detail. I've actually <a href="fixing-website-performance-issues-evidence">blogged about how they were caused and how we fixed them before</a> - &quot;the case of the crashing website&quot; was actually &quot;the case of the crashing NZ Red Cross website&quot;. Once we fixed it, success! The site was running smoothly.</p> </div> <div class="section" id="the-story-vs-the-facts"> <h1>The Story vs The Facts</h1> <p>Any mention in the story of Netspace having tuned software is dubious at best. I honestly can't remember whether I installed an opcode cache, whether they did, or whether there even was one. However, the reverse proxy setup and apache configuration was me. The site was, according to Fletch, running on the existing hardware that it always used to run on, so I'm not convinced it was ever &quot;put onto a new 64-bit Debian operating system&quot; - and that wouldn't have made any difference if it had.</p> <p>Regarding the MySQL - apart from moving it to a separate machine (which may not have even happened - all I know is that it was on a separate machine when I got there) - all of the tuning was done by Arjen. He was MySQL employee #25, and now runs <a href="http://openquery.com/">OpenQuery</a>, a MySQL consultancy. There are tickets in OpenQuery's issue tracking system where he detailed all of the work he did.</p> <p>The funniest part about the whole story is this:</p> <blockquote> Within 36 hours of the site getting hit with a wall of traffic ten times larger than the solution was specified to handle, Netspace had reduced processing and memory load on the servers while being able to serve the higher connection demands from the world...</blockquote> <p>The solution was specified to handle? The solution was a horribly misconfigured apache, an unconfigured MySQL and no reverse proxy, as well as a broken search system (see <a href="fixing-website-performance-issues-evidence">the case of the crashing website</a>). The &quot;wall&quot; of traffic would only have had to have been 10 visitors at once to floor it.</p> <p>The solution Arjen and I put in place dealt with the wall with ease. When I logged out of the web server, load was 0.1, and it was using just 700M of ram - in comparison to the swapping-like-crazy, constantly-locking-up mess it was in (aka: load of 20+, all 8G of ram + all swap used).</p> <p>Gerard sent me an e-mail afterwards:</p> <pre class="literal-block"> Hi Nigel, Thank you so much for your help with Red Cross - we really appreciate it, and so do the folks at Red Cross. Your efforts have enabled Red Cross to accept over 7500 donations since noon yesterday - over $800,000 going directly to where it's needed. And that number will continue to climb. You seem to still have 10 open sessions on the redxprdww02 server - let me know if you still need them and I can kill them off if you're done. If there are any expenses or costs please don't hesitate to invoice them to us. Netspace Services Limited PO Box 404 Palmerston North New Zealand Thank you again for all of your help. Regards, Gerard </pre> </div> <div class="section" id="q-a"> <h1>Q &amp; A</h1> <div class="section" id="q-nigel-isn-t-this-sour-grapes"> <h2>Q: Nigel, isn't this sour grapes?</h2> <p>I don't think so. There's no mention of us in the article. People will get the impression that it's Netspace that tuned the site if I don't say anything.</p> <p>I wouldn't have had a problem with it if the article had said that Netspace accepted the help of the team that worked on eq.org.nz, who did [insert all the configuration stuff here].</p> <p>ComputerWorld is a widely read publication, it doesn't seem right to let such inaccuracies go without correction.</p> </div> <div class="section" id="q-what-about-the-other-stuff-in-the-article"> <h2>Q: What about the other stuff in the article?</h2> <p>We only ever dealt with the website. Everything else wasn't us, and I know nothing about it.</p> <p>I never actually dealt with Charles Ranby, the Red Cross IT Manager. I wonder how much he knows.</p> </div> <div class="section" id="q-so-what-do-you-think-of-netspace"> <h2>Q: So what do you think of NetSpace?</h2> <p>I hope they simply forgot to mention they had so much help.</p> <p>Fletch was good to work with - responsive and helpful.</p> <table class="docutils footnote" frame="void" id="id3" rules="none"> <colgroup><col class="label" /><col /></colgroup> <tbody valign="top"> <tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td>I was actually connected to him at <a href="http://face.co.nz/">Face</a>, but Face and Netspace seem to be interchangeable - see how <a href="http://face.co.nz/our-work">Face claims Red Cross New Zealand as their client</a>, at the bottom of the page.</td></tr> </tbody> </table> <table class="docutils footnote" frame="void" id="id4" rules="none"> <colgroup><col class="label" /><col /></colgroup> <tbody valign="top"> <tr><td class="label"><a class="fn-backref" href="#id2">[2]</a></td><td>Unfortunately though, not as much as it could have, because the CMS they're using serves images through PHP scripts</td></tr> </tbody> </table> </div> </div> Fixing Website Performance Issues V: Server Resources http://nigel.mcnie.name/blog/fixing-website-performance-issues-server-resources <p><em>This is the fifth post in a series on fixing website performance issues. See <a href="fixing-website-performance-issues-hardware">here</a> for an index of all posts in the series. Previously: <a href="fixing-website-performance-issues-page-request-theory">Page Request Theory</a>.</em></p> <div class="section" id="server-resources"> <h1>Server Resources</h1> <p><a href="fixing-website-performance-issues-page-request-theory">Last time</a>, we walked through what happens when a visitor requests a page from your website, at a high level. Now we are going to &quot;zoom in&quot; on the server side, and examine what resources are required to handle the page request on the server <a class="footnote-reference" href="#id7" id="id1">[1]</a>.</p> <p>Knowing how resources are used is useful, because many performance issues boil down to an overuse of one or more of them, and fixing the cause of the overuse is how you fix the issue.</p> <p>We saw last time what happens when someone visits a page on your website:</p> <ul class="simple"> <li>Around 6 connections will be made, all of which will be involved with serving the page for until the visitor's browser has downloaded all the resources required - which could take a few seconds.</li> <li>Generally, at least one of the connections will result in a dynamic page needing to be served. Even more if you're using AJAX.</li> <li>After the page is downloaded, the connections may be kept open, reserved for the visitor, if keepalive is on.</li> </ul> <p>So, what resources does the server need to handle this? Let's go through each step of a request and list them all.</p> <div class="section" id="step-1-visitor-starts-a-request"> <h2>Step 1: Visitor starts a request</h2> <p>To handle a request, a server needs a connection. To handle the entire page, ideally the server needs six free connections. Connections are finite, and it is definitely possible to run out of them.</p> <p>Remember that initially, only one connection is required. If the visitor is asking for a web page or some other resource that needs more resources to be correctly rendered, the browser will try to open more connections for those.</p> <p>Once the connection is established, the visitor will send their request. This can actually be reasonably slow, for a few reasons:</p> <ul class="simple"> <li>On slow connections, or connections with asynchronous bandwidth (e.g. a 2M down/128K up ADSL connection), the headers simply cannot be sent very quickly <a class="footnote-reference" href="#id8" id="id2">[2]</a>.</li> <li>HTTP headers are not sent compressed; and</li> <li>If a request involves cookies, the request size could potentially be up to a few kilobytes (larger than some pages and many small images!)</li> </ul> <p>During this time, apart from holding the connections open, the server doesn't have much to do <a class="footnote-reference" href="#id9" id="id3">[3]</a>.</p> </div> <div class="section" id="step-2-server-inteprets-the-request"> <h2>Step 2: Server inteprets the request</h2> <p>For each request, the server needs to decide what to do. This should hopefully be quick, although inefficient server implementations or configuration can slow this down.</p> <p>For example, <tt class="docutils literal"><span class="pre">mod_rewrite</span></tt> can cause slowdowns, if the various rewrite rules are inefficiently written or targetted. If you're checking every single request to see if it matches a certain regular expression, or for whether it should be modified to be a request for some other resource, this all adds up.</p> <p>There's always going to have to be some processing done, however. We all like friendly URLs (and in particular, Google does). You just have to be careful that you're not causing more work than necessary.</p> <div class="section" id="requests-for-dynamic-resources-e-g-php-pages"> <h3>Requests for dynamic resources (e.g. PHP pages)</h3> <p>For these requests, the code needs to be run. This is easily the most expensive part of the the entire request process:</p> <ul> <li><p class="first">First, an interpreter needs to be ready to run the script. Normally, this has been done by the web server before any requests arrive, but if the server has no free processes available to handle the request, it may have to start a new one. This can take CPU and RAM to accomplish, and any intepreters will generally cost at least a few MB of RAM just to be ready for processing.</p> </li> <li><p class="first">Then the code runs. This takes the most resource - CPU, RAM, network bandwidth. It also can cause disk reads and writes, both of which can be very slow.</p> <p>While the code is executing, it will want as much CPU as it can get. However, there will be times when it needs to wait for something in order to continue - for example, when it has issued a query to a database. During those times, it won't consume much CPU, although it will still hold any RAM it requires.</p> <p>The RAM requirements will depend largely on how the code is written and what it's doing. One way of solving a problem may involve grabbing some data, processing it, then grabbing some more until all the data is gone. Another way may be to grab all of the data into RAM at once, then process it all (contrast SAX vs DOM parsing for XML). In general, most PHP based applications I've worked on need 20MB or less of RAM for the average page request <a class="footnote-reference" href="#id10" id="id4">[4]</a>.</p> <p>Network bandwidth is generally only used between the web server and any services running on other servers (e.g. database server). This generally isn't a problem unless you're hauling massive datasets into your script.</p> <p>IO bandwidth is, depending on the system, one of the rarest resources you have. In particular, safely writing to disk is likely to be the slowest operation you normally do <a class="footnote-reference" href="#id11" id="id5">[5]</a>. Reducing the number of database/filesystem writes you do is a great way to improve performance, as is making sure you only read the information you need to.</p> </li> </ul> <p>Processing dynamic requests is almost certainly the most intensive part of serving a page. Therefore, it's an area ripe for performance optimisations to be made (though <a href="fixing-website-performance-issues-evidence">don&#39;t rush in until you know it&#39;s the real problem</a>).</p> </div> <div class="section" id="requests-for-static-resources-e-g-css-files"> <h3>Requests for static resources (e.g. CSS files)</h3> <p>For static requests it's easier. The file just has to be read. The kernel may have already cached the file contents in RAM, making it very quick.</p> <p>Seriously, it's that simple. Because it's such a different task to handling the dynamic requests, it makes sense to make sure these requests aren't handled in the same way. Splitting how they're handled is often at the heart of initial performance bottlenecks, which is something I'll blog about in future.</p> </div> </div> <div class="section" id="step-3-the-response-is-sent-back"> <h2>Step 3: The response is sent back</h2> <p>This can be the slowest part, as it's affected by the download speed of the visitor. This holds the connection open, so if sending back lots of data, remember that this connection is used all that time! Fortunately, it's not very cpu intensive.</p> <p>Finally, the request/response cycle is done; the connection can close or be left open based on keepalive.</p> </div> <div class="section" id="improving-performance"> <h2>Improving Performance</h2> <p>We've seen what resources can be consumed while serving a web page. Knowing what is consumed, and when it might be consumed, is good background information when you go looking to improve the performance of a website.</p> <p>For example, if a site is slow, you now understand that it could be caused by the overuse of one or more of the resources mentioned. It could be that all RAM is used, causing the machine to swap; too many disk reads/writes causing slow disk operations; or perhaps the machine is simply using all CPU flat out.</p> <p>However, it could be none of these reasons. Sites can also be slow due to reasons that examining the server wouldn't notice. For example, missing expires headers or lack of gzip on page components can cause sites downloaded over slower connections to crawl.</p> <p>Next time, we'll look at the ways in which you can inspect a server &amp; website to find performance issues <a class="footnote-reference" href="#id12" id="id6">[6]</a>.</p> <table class="docutils footnote" frame="void" id="id7" rules="none"> <colgroup><col class="label" /><col /></colgroup> <tbody valign="top"> <tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td>I'm aware that last time I said the next post would be about what happens when many people visit your site, but I decided we need a little more background before that post will make sense.</td></tr> </tbody> </table> <table class="docutils footnote" frame="void" id="id8" rules="none"> <colgroup><col class="label" /><col /></colgroup> <tbody valign="top"> <tr><td class="label"><a class="fn-backref" href="#id2">[2]</a></td><td>You may not think this matters much, but remember that many requests might be sent in parallel - combined with large request sizes and a lack of compression - can add up. Not to mention, two people using the internet on the same connection...</td></tr> </tbody> </table> <table class="docutils footnote" frame="void" id="id9" rules="none"> <colgroup><col class="label" /><col /></colgroup> <tbody valign="top"> <tr><td class="label"><a class="fn-backref" href="#id3">[3]</a></td><td>SSL introduces some complexity, and of course a little RAM is needed to hold the request data, but ignoring these there's very little happening.</td></tr> </tbody> </table> <table class="docutils footnote" frame="void" id="id10" rules="none"> <colgroup><col class="label" /><col /></colgroup> <tbody valign="top"> <tr><td class="label"><a class="fn-backref" href="#id4">[4]</a></td><td>However, there are some cases where a script simply needs more. <a href="http://qbnz.com/highlighter">GeSHi</a>, needs to store the HTML for syntax highlighting source code in RAM, and if the input script is very large, the HTML for highlighting is a whole bunch larger again. It all depends on what you're doing.</td></tr> </tbody> </table> <table class="docutils footnote" frame="void" id="id11" rules="none"> <colgroup><col class="label" /><col /></colgroup> <tbody valign="top"> <tr><td class="label"><a class="fn-backref" href="#id5">[5]</a></td><td>If your script makes network requests to remove hosts that'll likely be slower again, although you wouldn't normally do this in the context of serving a web page precisely for this reason.</td></tr> </tbody> </table> <table class="docutils footnote" frame="void" id="id12" rules="none"> <colgroup><col class="label" /><col /></colgroup> <tbody valign="top"> <tr><td class="label"><a class="fn-backref" href="#id6">[6]</a></td><td>This is perhaps getting a little off the topic of showing how most websites on the 'net can be served off a $40 VPS, like <a href="fixing-website-performance-issues-page-request-theory">I promised I&#39;d do</a>, but I promise we'll get there in the end!</td></tr> </tbody> </table> </div> </div> Fixing Website Performance Issues IV: Page Request Theory http://nigel.mcnie.name/blog/fixing-website-performance-issues-page-request-theory <p><em>This is the fourth post in a series on fixing website performance issues. See <a href="fixing-website-performance-issues-hardware">here</a> for an index of all posts in the series. Previously: <a href="fixing-website-performance-issues-whack-a-mole">Whack-a-mole</a>. Next: <a href="fixing-website-performance-issues-server-resources">Server Resources</a>.</em></p> <div class="section" id="page-request-theory"> <h1>Page Request Theory</h1> <p>I previously made a claim that <a href="fixing-website-performance-issues-hardware">99% of sites on the internet can be served by a $40/mo VPS with 500M of RAM</a>. Now I'm going to back that up.</p> <p>It will take a few posts. The first two are theory. In this one, we'll examine what happens when someone requests a page from a website. The next will show what happens when <em>lots</em> of people request pages.</p> <p>Then we'll move to practice. The third will disect a typical LAMP application, and analyse what goes wrong when many visitors are using it. And for the finale, will examine how we can remove the bottlenecks we found to improve performance &amp; scalability.</p> </div> <div class="section" id="what-happens-when-a-page-is-requested"> <h1>What Happens When A Page Is Requested?</h1> <p>I promise I'll keep this readable (at the expense of a few details, so experts - please no mocking kthx).</p> <div class="section" id="step-1-visitor-starts-a-request"> <h2>Step 1: Visitor starts a request</h2> <p>A visitor chooses to visit a page on your site, by clicking on a link or bookmark, typing the page address into their browser, etc. However they choose to do this, the result is that they are now requesting some resource (a page, image, video or whatever) from you.</p> <p>Their browser opens a connection to your server. The inner workings of this aren't important for now, but the end result is one connection is now established between them and you.</p> </div> <div class="section" id="step-2-visitor-asks-for-something"> <h2>Step 2: Visitor asks for something</h2> <p>The visitor then sends information about what they want, and how they want it. These instructions are known as <strong>the request</strong>.</p> <p>The request is a list of instructions that says something like &quot;I want resource <tt class="docutils literal"><span class="pre">yoursite.com/photos/index.php</span></tt>&quot;, and then a bunch of info about how they'd like it. &quot;I could handle it if you sent it back <tt class="docutils literal"><span class="pre">gzip</span></tt> compressed; I don't mind if you send back an older one; you sent me a cookie that hasn't expired so here it is again&quot;...</p> <p>Think of the request as a list of demands. The server will try and fulfill these demands.</p> </div> <div class="section" id="step-3-server-responds"> <h2>Step 3: Server responds</h2> <p>Your server receives the list of instructions (request), and with it, works out how to respond.</p> <p>If the requested page was 'index.php', your server will probably be configured to run PHP over the <tt class="docutils literal"><span class="pre">index.php</span></tt> file, and return the output of the PHP script. If the requested page was actually an image or some other file, your server will probably just send it back as-is.</p> <p>Actually, what is done for each request is totally up by you. You could tell the server to send back <tt class="docutils literal"><span class="pre">/home/you/bananaphone.mp3</span></tt> for any request if you really wanted. While this might be annoying for the visitor, there's nothing really stopping the server doing it, other than that a server <em>should</em> be trying its best to fulfill the demands made by the visitor.</p> <p>Then your server sends the data - which is known as <strong>the response</strong> - back, over the same connection. Depending on how much data it is, and on the download speed of the visitor, this could take a while!</p> </div> <div class="section" id="step-4-server-finishes-request-maybe"> <h2>Step 4: Server finishes request (maybe!)</h2> <p>Finally, all the data is sent. The connection <em>may</em> be closed, depending on whether a feature of connections called Keepalive is on <a class="footnote-reference" href="#id3" id="id1">[1]</a>. If the visitor's browser said (in the request instructions) that they support keepalive, and the server is configured to allow it, then the connection will be kept open. You'll see why this can be a good thing shortly.</p> </div> <div class="section" id="summary-so-far"> <h2>Summary (so far)</h2> <p>Does it make sense? Connection opened, instructions sent to server, server responds, connection (possibly) closed. This is how any individual resource is requested from a web server.</p> <p>The story isn't quite finished however. Most pages are made up of <em>many</em> resources - images, stylesheets, javascript etc. Where do they fit in?</p> </div> <div class="section" id="step-5-visitor-processes-response"> <h2>Step 5: Visitor processes response</h2> <p>If the response is an HTML page, it can contain references to other resources.</p> <p>As the browser parses the page, it will find these references. As it does, it will make requests to the server for them.</p> <p>If Keepalive was not enabled, it will start at step 1 again for the new resource - by making a new connection.</p> <p>However, if Keepalive <em>was</em> enabled and the previous connection is still open, the browser will jump straight to step 2, re-using the existing connection.</p> <p>This is clever, because the process of starting a request can take a decent amount of time. When a page has many resources, that would otherwise mean many connections needed to be made, each incurring the delay.</p> <p>There's one last detail to know for now. Browsers don't normally just open one connection to a server. They'll open between two and 10 at once! <a class="footnote-reference" href="#id4" id="id2">[2]</a> This is because they can open the connections at the start, and then feed the requests for other resources into them immediately when they're spotted in the HTML.</p> </div> <div class="section" id="step-6-keepalive-d-connections-close"> <h2>Step 6: Keepalive'd connections close</h2> <p>Connections that are using Keepalive have a timeout. If the browser makes no new requests for a certain amount of time (chosen by the server), then the server will close the connection.</p> <p>If this time is quite long - e.g. 60 seconds - then if a visitor clicks on a new page within a minute of loading the last page, they'll again re-use the connection. Cool!</p> </div> </div> <div class="section" id="example-of-a-page-request"> <h1>Example Of A Page Request</h1> <p>So what does requesting a page look like, at a high level?</p> <p>Here's a screenshot of Google Chrome's network inspector, when loading <a class="reference external" href="http://nigel.mcnie.name/">http://nigel.mcnie.name/</a> .</p> <div align="center" class="figure"> <a class="reference external image-reference" href="http://f.dollyfish.net.nz/d01cea"><img alt="Screenshot of inspector. 17 requests, beginning with the page itself, then CSS, then images" src="http://f.dollyfish.net.nz/d01cea" style="width: 600px;" /></a> <p class="caption">Notice how the first request is on its own, then the stylesheet, then how a bunch of image requests begin simultaneously.</p> </div> <p>In the image, the bars represent the time when the resource was being requested and received. For each bar, the transparent bit on the left is (roughly) the time spent requesting the resource, and the solid bit is the time spent downloading the response.</p> <p>The first request was for the HTML page, and starts at time zero. See how while the request is being made, no other requests are in progress.</p> <p>Then in the response part of the first request, see how the next request starts (for <tt class="docutils literal"><span class="pre">style.css</span></tt>). This request is significantly faster (the bar is shorter), as Keepalive is enabled and no new connection has to be made. This request was started as soon as Chrome knew it needed <tt class="docutils literal"><span class="pre">style.css</span></tt> - which is part way through parsing the HTML returned by the first request.</p> <p>The next two requests come &quot;from cache&quot; - that is, the browser already had a copy of them and doesn't need to download them again. We can ignore those.</p> <p>Then, 11 requests begin for images. The last one is from google analytics, which we'll ignore for now.</p> <p>You can't actually see it in the picture, but what actually happens is the browser opens five more connections. The first image (<tt class="docutils literal"><span class="pre">html_repeat.png</span></tt>) is requested using the first connection we already had open, then the next five are requested using new connections. The last four images are requested as the first images complete downloading. They re-use the connections, saving time.</p> <p>The last requests (for the google analytics <tt class="docutils literal"><span class="pre">__utm.gif</span></tt> and the one from <tt class="docutils literal"><span class="pre">collect.clooso.org</span></tt>), are actually triggered from javascript, which is why they're so late. The browser will open new connections for those too, because the site they're getting those resources from is different. So it's quite possible for a browser to have dozens of connections open at once, although it will only tend to open about six to each individual server.</p> <div class="section" id="try-it-yourself"> <h2>Try It Yourself</h2> <p>If you're using Chrome, hit <tt class="docutils literal"><span class="pre">Ctrl+Shift+i</span></tt>. You'll see the Chrome Inspector. Click on the network tab, then reload this page. You'll see the requests come in, and even better, you can hover over the bars to get more information about what they're made up of.</p> <p>You can also try it in Firefox, if you have installed the <a href="http://getfirebug.com/">Firebug extension</a>.</p> <p>If you're using IE, <a href="http://firefox.com">here&#39;s a download link for Firefox</a>. You're welcome.</p> </div> </div> <div class="section" id="summary"> <h1>Summary</h1> <p>Requesting a page is quite an involved process, but I hope I've managed to explain it in reasonably clear terms. You should now know:</p> <ul class="simple"> <li>How many connections are initially opened by a browser</li> <li>What information a browser sends to the server</li> <li>What kind of things a server could send back</li> <li>How many connections a browser may end up opening</li> <li>Why some connections may stay open even after a browser has received a response</li> </ul> <p>Feel free to ask any questions in the comments. If you can think of a way this post could be clearer, let me know as I'm happy to update it.</p> <p>Also, a few people have been asking if I'm going to cover &lt;topic X&gt; in this series on performance. If there's something you would like to hear about, let me know!</p> <p><em>Next post: <a href="fixing-website-performance-issues-server-resources">Server Resources</a></em></p> <table class="docutils footnote" frame="void" id="id3" rules="none"> <colgroup><col class="label" /><col /></colgroup> <tbody valign="top"> <tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td>This is a feature of <em>HTTP</em> connections, not of all connections. My summary contains a number of simplifications and ommissions which aren't needed to help you grasp the basic concepts of how pages are requested.</td></tr> </tbody> </table> <table class="docutils footnote" frame="void" id="id4" rules="none"> <colgroup><col class="label" /><col /></colgroup> <tbody valign="top"> <tr><td class="label"><a class="fn-backref" href="#id2">[2]</a></td><td>The actual number depends on the browser, and what the server will let the browser get away with. Typically, 6 is not an uncommon number.</td></tr> </tbody> </table> </div> Fixing Website Performance Issues III: Whack-a-mole http://nigel.mcnie.name/blog/fixing-website-performance-issues-whack-a-mole <p><em>This is the third post in a series on fixing website performance issues. See <a href="fixing-website-performance-issues-hardware">here</a> for an index of all posts in the series. Previously: <a href="fixing-website-performance-issues-evidence">Follow The Evidence</a>. Next: <a href="fixing-website-performance-issues-page-request-theory">Page Request Theory</a>.</em></p> <div class="section" id="whack-a-mole"> <h1>Whack-a-mole</h1> <p>Fixing performance issues is a game of whack-a-mole.</p> <p>For anyone not familiar with this game, the idea is that moles poke their heads out of holes, and you severely beat them with a hammer until they go back in. Bad for the moles, great for your morale. If you like moles, pretend we're playing whack-a-boss instead.</p> <p>Why is it whack-a-mole? Because there are many steps in the chain of serving a website, and inevitably, one of them will be a bottleneck before the others.</p> <p>For example, a system may have plenty of CPU and RAM, but if apache is configured with default settings, it will only take a few concurrent users to max out the number of connections it can have open. The server will idle along just fine, while the website will effectively be unavailable.</p> <p>Alternatively, the system could be tweaked to perfection, but some poor javascript could cause the user's browser to stutter and hang.</p> </div> <div class="section" id="example-fixing-ushahidi-map-performance"> <h1>Example: Fixing Ushahidi map performance</h1> <p>As part of my work on <a href="http://eq.org.nz">eq.org.nz</a> - an instance of <a href="http://www.ushahidi.com/">Ushahidi</a> set up to help with the Christchurch Earthquake, I was told that people were finding navigating the map to be slow. A quick play with it revealed that if you zoomed in and out, it could take up to five seconds to recalculate and display the numbered red circles that indicated reports for the new zoom level. And that was on my broadband connection, let alone the poor 3g connections some of the people in Christchurch would have had. Ouch.</p> <div class="section" id="gathering-evidence"> <h2>Gathering Evidence</h2> <p>The first step was to identify the biggest problem. What happens when you zoom? Chrome's inspector revealed an AJAX request to <tt class="docutils literal"><span class="pre">/json/cluster</span></tt> that would take at least a second, and sometimes up to four seconds. Well that obviously wasn't good, but at least I thought I had a pretty good idea of why the map was so slow.</p> <p>PHP scripts, correctly configured on modern hardware, shouldn't take long to process. The server was doing fastcgi and had APC, and <a href="http://twitter.com/arjenlentz">Arjen Lentz</a> had configured the MySQL database <a class="footnote-reference" href="#id2" id="id1">[1]</a>. I was seeing processing times of 100-250ms for most scripts, which seemed reasonable in the environment. The requests to <tt class="docutils literal"><span class="pre">/json/cluster</span></tt> were taking more like 1000-4000ms. Plenty of room for improvement!</p> <p>After verifying the problem occurred on the staging server too (I didn't have a dev setup), the first step was to add timing information to the action serving that URL (Ushahidi is based off <a href="http://www.kohanaphp.com/">Kohana</a>, it's MVC). After every &quot;paragraph&quot; of code, I added a debugging line that output the accumulated time since the start of the action. The output looked something like this:</p> <pre class="literal-block"> [Tue Mar 01 04:05:04 2011] CLUSTER BEGIN, referer: http://dev.eq.org.nz/ [Tue Mar 01 04:05:04 2011] init: 0.000 s, referer: http://dev.eq.org.nz/ [Tue Mar 01 04:05:04 2011] category: 0.000 s, referer: http://dev.eq.org.nz/ [Tue Mar 01 04:05:04 2011] after incident sql: 0.002 s, referer: http://dev.eq.org.nz/ [Tue Mar 01 04:05:04 2011] after incident geometries: 0.078 s, referer: http://dev.eq.org.nz/ [Tue Mar 01 04:05:04 2011] after incident_categories: 0.078 s, referer: http://dev.eq.org.nz/ [Tue Mar 01 04:05:04 2011] after locations: 0.083 s, referer: http://dev.eq.org.nz/ [Tue Mar 01 04:05:04 2011] after creating markers: 0.084 s, referer: http://dev.eq.org.nz/ [Tue Mar 01 04:05:04 2011] after clustering: 0.161 s, referer: http://dev.eq.org.nz/ [Tue Mar 01 04:05:04 2011] after json for clusters: 0.167 s, referer: http://dev.eq.org.nz/ [Tue Mar 01 04:05:04 2011] after json for singles: 0.208 s, referer: http://dev.eq.org.nz/ [Tue Mar 01 04:05:04 2011] done: 0.208s, referer: http://dev.eq.org.nz/ </pre> <p>It's a bit messy, but you get the idea. The times in this example are from near the end, when I'd done most of the work.</p> <p>The nice thing about this format is that it's really easy to see where the big jumps are. For example, the difference between &quot;after incident sql&quot; and &quot;after incident geometries&quot; is about 75ms - an area to investigate. The same again for &quot;after clustering&quot;, and then for &quot;after json for singles&quot;.</p> </div> <div class="section" id="fixing-the-code"> <h2>Fixing The Code</h2> <div class="section" id="round-1-the-sqloop"> <h3>Round 1 - the Sqloop</h3> <p>I then picked the first major slowdown, and analysed the code further. What I found was an Sqloop. Simplified, it looked like this:</p> <pre class="literal-block"> foreach ($incidents as $incident) { $geometry_data = $db-&gt;query('SELECT ... FROM geometry WHERE incident_id = XXX'); // Do stuff with data } </pre> <p>This is evil. As the number of incidents grows, so does the number of queries. We had over 1,000 incidents on production. So that was 1,000 queries for each time anybody zoomed the map. MySQL's query cache was probably helping to mask the effects of this, but issuing queries still has a non-zero cost.</p> <p>The fix is to drag the SQL outside the loop. Get the data once, without the WHERE clause, and do the WHERE in code. It costs a little more RAM, but it's generally not a problem (and you can grab the data in chunks if you really want).</p> <p><a href="https://github.com/ccnz/Ushahidi_Web/commit/82e4ef3886ce93ff8f6e5b1fc3c03a7813df9677">Here&#39;s the patch that I produced</a>. You can see that it actually has little effect on the existing code. Instead of doing the SQL, it calls a function which returns exactly the same data that the SQL would have returned. That method will do a query to get data for all incidents the first time it's called, then cache it in <tt class="docutils literal"><span class="pre">self::$geometry_data</span></tt> for future calls.</p> <p>I like this because it has little effect on the existing code. Because I'm not familiar with what the code should do on a high level, I'm happy with simply making sure it behaves exactly as it did before. An Ushahidi developer may have a better idea for optimising this however.</p> </div> <div class="section" id="round-2-iterating"> <h3>Round 2 - Iterating</h3> <p>A little further down, some code involving the ORM was running pretty slowly, especially given it seemed to simply be a wrapper around one query:</p> <pre class="literal-block"> if (count($location_ids) &gt; 0) { $locations_result = ORM::factory('location')-&gt;in('id',implode(',',$location_ids))-&gt;find_all(); }else{ $locations_result = ORM::factory('location')-&gt;find_all(); } $locations = array(); foreach ($locations_result as $loc) { $locations[$loc-&gt;id]['lat'] = $loc-&gt;latitude; $locations[$loc-&gt;id]['lon'] = $loc-&gt;longitude; } </pre> <p>This segment of code was taking 270ms!</p> <p>My first instinct was that the first half would be the slow bit. <tt class="docutils literal"><span class="pre">ORM::factory('location')-&gt;</span></tt>.. blah blah blah that must be slow. But splitting the timing of the two segments revealed that it was actually the second half that was crawling. What could be wrong? It's a simple foreach loop - could there be millions of iterations? Nope <tt class="docutils literal"><span class="pre">count($locations_result)</span></tt> said 510.</p> <p>The only answer left was that whatever was being iterated was bad at it. It turns out the location result was not a simple array, but some kind of Iterable ORM object. I don't really know much about them, other than if they're that bad at being iterated over, we'd better not use them ;).</p> <p>I patched this by <a href="https://github.com/ccnz/Ushahidi_Web/commit/91b8f185c63444fe06b063929e161a4ee12a0847">doing a straight query rather than using the ORM</a>. Again, there are probably other ways, but in this case, this worked fine. The resulting iteration on the raw array was an order of magnitude faster.</p> </div> <div class="section" id="round-3-json-generation"> <h3>Round 3 - JSON generation</h3> <p>I found several instances in the code of <a href="https://github.com/ccnz/Ushahidi_Web/blob/7ccfb17210f8da7fcfdf467be4c3908af21d17d8/application/controllers/json.php#L396">JSON generation by string building</a>.</p> <p>Not only is this code copy/pasted several times throughout the file (leaving it open to bugs when someone fixes one part and not the others), there's an inbuilt PHP function for doing this: <a href="http://php.net/json_encode">json_encode</a>.</p> <p>While the json generation was definitely a slow point, actually refactoring all of that code to use <tt class="docutils literal"><span class="pre">json_encode</span></tt> would have taken more time than I had available, so I ignored it. Hopefully it'll get fixed upstream some time.</p> </div> <div class="section" id="round-4-caching"> <h3>Round 4 - Caching</h3> <p>The last thing I decided to do was implement a bit of caching.</p> <p>Caching is generally the <strong>last</strong> thing you should do. If you cache before you have dealt with any other bottlenecks, you'll mask their effects, but not kill them completely. This can come back and badly bite you later.</p> <p>For example, caching the JSON generation earlier would have meant we may never have spotted the wasteful sqloop, and the problem would have simply been suppressed until the site was large and active - not really a time when you want to be debugging performance issues! Also, the caching would have meant the problem would have been &quot;intermittent&quot; - with the poor users who get the uncached pages suffering while the others notice nothing wrong.</p> <p>When you cache, you're looking to eliminate as much work as you can, preferrably without ever serving stale data.</p> <p>One option is sending expires headers. This is a poor choice in this case, because the same URL can send back different data (e.g. when a new incident is added). It also gives no benefit as more visitors hit the site. They all have to request the data anyway.</p> <p>In this case, it's better to cache on the server, in the action itself. The reasons for this are that we can cache once for all users, and we can generate a sensible cache key - which means we can avoid ever serving stale data.</p> <p>The simplest caches are key/value stores. You put a value in the store (e.g. the JSON that the action generated), and the key you use to get it is something that you can quickly work out the next time you want the data back (e.g. the GET parameters to the script).</p> <p>The trick is to generate the cache keys in such a way that if the underlying data changes, the key will change. Sometimes this means you can't just use the input parameters to the script. <a href="https://github.com/ccnz/Ushahidi_Web/commit/7ccfb17210f8da7fcfdf467be4c3908af21d17d8#diff-0">Here&#39;s my patch</a>. The key is made up of the GET parameters, and the IDs of all of the incidents. It's then <tt class="docutils literal"><span class="pre">md5</span></tt>'d to make sure it only has known characters in it, and used to access a file on disk.</p> <p>There are few side effects of this caching implementation:</p> <ol class="arabic simple"> <li>As incidents are added or removed, all of the entries in the cache will effectively be worthless. But that's fine, there are many more users of the map then there are incidents being added, so we'll get benefits in most cases.</li> <li>If an incident changes (e.g. its latitude/longitude), the cache key will NOT change, which means it won't move on the map.</li> <li>The amount of space in <tt class="docutils literal"><span class="pre">/tmp</span></tt> that will be used is uncertain and should be monitored.</li> </ol> <p>I suggest a cache cleaning implementation in the commit message that will handle most of these issues. A cronjob that looks for files that haven't been changed in the last 120 minutes and removes them will clean out the old cache files as the incidents change, and the time (120 minutes) gives a rudimentary knob with which we can manage the cache size. The only thing not handled is if incidents change, however the usual usage pattern for an Ushahidi is for events to be constantly added, and the the 120 minute limit puts an upper bound on how long the information will be wrong for anyway.</p> </div> </div> <div class="section" id="results"> <h2>Results</h2> <p>This optimised the script down to a far more sensible runtime, normally between 70 and 250ms.</p> <p>This did prove to make a big difference to the perceived responsiveness of the map, however it wasn't a full fix. I found afterwards that the client-side map code (OpenLayers) seems to wait about 1.5 seconds after you finish a zoom operation (e.g. using the mousewheel) before bothering to fire the event which does the AJAX request.</p> <p>Though I looked around and asked in their IRC channel, there doesn't seem to be an event that can be hooked into earlier. I tried using other events and data in the JS, but couldn't come up with anything that would let the request be sent earlier.</p> <p>It's unfortunate, because while the map now performs &quot;acceptably&quot;, it's not fast. Fixing the map event firing problem is the next &quot;mole&quot;. Although there are other wins to be had, like the <tt class="docutils literal"><span class="pre">json_encode</span></tt> fix, it won't get nearly the benefits that fixing the map will.</p> <p>This finding points out the value of <em>checking your results</em>. The end goal was that the perceived performance of the map was faster. While all of the fixes I made were valuable improvements, they didn't fully fix the problem, and I couldn't really call &quot;mission accomplished&quot;. Although in this case I did, as everyone agreed that chasing down javascript mapping bugs wasn't going to have a major benefit compared to the time involved.</p> <p><em>Next post: <a href="fixing-website-performance-issues-page-request-theory">Page Request Theory</a></em></p> <table class="docutils footnote" frame="void" id="id2" rules="none"> <colgroup><col class="label" /><col /></colgroup> <tbody valign="top"> <tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td>I'm not entirely sure of everything he did, but when MySQL employee #25 configures your database, it's not going to be the bottleneck :)</td></tr> </tbody> </table> </div> </div> Fixing Website Performance Issues II: Evidence http://nigel.mcnie.name/blog/fixing-website-performance-issues-evidence <p><em>This is the second post in a series on fixing website performance issues. See <a href="fixing-website-performance-issues-hardware">here</a> for an index of all posts in the series. Previously: <a href="fixing-website-performance-issues-hardware">Hardware</a>. Next: <a href="fixing-website-performance-issues-whack-a-mole">Whack-a-mole</a>.</em></p> <div class="section" id="follow-the-evidence"> <h1>Follow The Evidence</h1> <p>In order to fix performance problems, you need evidence, and you need to understand what the evidence tells you. This is harder than it sounds, but an extremely valuable skill for a developer or sysadmin to have, if you can train yourself.</p> <p>I've seen people with pet solutions to performance issues before:</p> <ul class="simple"> <li>&quot;I just crank Yslow and get to 100. Works every time.&quot;</li> <li>&quot;Performance issues? Load balancing is the key!&quot;</li> <li>&quot;Tuning MySQL's query cache can give huge performance gains&quot;</li> <li>&quot;We should rewrite using MongoDB. It's web scale!&quot;</li> </ul> <p>Such suggestions make sense - <strong>if the evidence backs them</strong>. How is Yslow going to help if the problem is a lack of server children? How will tuning MySQL's cache help if the problem is a saturated network link in the cluster? And I think there's any number of things you should do <a href="fixing-website-performance-issues-hardware">before giving up and throwing more hardware at a problem</a>.</p> <p>Of course, a high Yslow score and correct configuration of server software are worth doing, but they're not the first thing you should do. You should gather information, then act on what it tells you.</p> </div> <div class="section" id="scenario-the-case-of-the-crashing-website"> <h1>Scenario: the case of the crashing website</h1> <p>A site I worked on recently was basically unresponsive. Some requests would respond after 30-60 seconds, others would just time out. Logging into the server, I had a look around. <tt class="docutils literal"><span class="pre">top</span></tt> showed that the machine was deep into swap (not really surprising, slow sites often have one machine deep into swap), but also that there were some apache processes that were <strong>1.3G</strong> resident.</p> <p>Whoa - how does that happen? The site is a PHP/MySQL application. Typically, you'll see 10-20M apache children. This is a very interesting piece of evidence, and given the machine had 8G of ram, we can't really call the performance problems solved until we can stop the apaches blowing out to such a huge size.</p> <p>Unfortunately, at this point, the server locked up completely and had to be restarted. After the restart, the site worked - albeit slowly - for a few minutes. Then I observed a bunch of load, apache raced to MaxChildren processes (which I observed was set way too low), and the whole site hung.</p> <p>Restarting apache lead to the same behaviour, repeatably. The site would work for a random amount of time from 30 seconds to almost 10 minutes, then race to MaxChildren and hang. Given the site needed to stay up, I didn't have the luxury of examining the 'hang' deeply, but in hindsight that would have been a worthwhile exercise.</p> <p>What could cause such behaviour? Maybe you can guess by now. Sadly, I didn't. I decided the first thing to do was to configure apache properly. While this was almost certainly the cause of the slowness while the site was running, it wasn't the most critical problem. But, I reasoned, it was easy to do, and might help.</p> <p>It took about an hour of configuring apache, then putting nginx in front of it, to realise that the problem wasn't going away. The site would now perform quickly for 30 seconds to 10 minutes, then die. I needed more evidence.</p> <p>Eventually, prodding about with <tt class="docutils literal"><span class="pre">jnettop</span></tt> revealed that at about the time the site failed, there was a phenomenal amount of traffic being done transferring data from the database server to the web server. Belatedly, it dawned on me - rogue script. Some page, that wasn't being hit too often, was causing all the trouble.</p> <p><tt class="docutils literal"><span class="pre">jnettop</span></tt> gave me a port number, <tt class="docutils literal"><span class="pre">lsof</span></tt> matched that port to a process, and apache's <tt class="docutils literal"><span class="pre">server-status</span></tt> mapped that process to a script. search.php! After removing the search boxes from the site, it remained stable.</p> <p>Follow-up investigation showed that a bad join in search.php was resulting in a query asking for a cartesian product of two rather large tables in the search script, and due to how the data was being collected it wasn't hard to imagine how this would eventually cause 1.3G resident apache processes <a class="footnote-reference" href="#id2" id="id1">[1]</a>.</p> </div> <div class="section" id="the-lesson-learned"> <h1>The Lesson Learned</h1> <p>I know this already, but continually have to re-learn it. <strong>Follow the evidence</strong>. In this case, the evidence was the 1.3G processes and the irregular times between site hangs.</p> <p>Irregular can (and did in this case) indicate &quot;something a site visitor did&quot;. There's also &quot;regular&quot; which generally would mean a scheduled task, and &quot;immediately&quot;, which could say any number of things, the simplest of which is &quot;the site is receiving too much traffic to handle!&quot;.</p> <p>The consequence of this was an extra hour of intermittent outages for the site. This could be costly. If I'd have worked out the problem earlier, I could have stopped the crashing quickly, then fixed the slothly response time soon afterwards.</p> <p><em>Next post: <a href="fixing-website-performance-issues-whack-a-mole">Whack-a-mole</a></em></p> <table class="docutils footnote" frame="void" id="id2" rules="none"> <colgroup><col class="label" /><col /></colgroup> <tbody valign="top"> <tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td>It just occured to me while writing this that I never found a PHP fatal error for a script running out of memory, nor did I check what the limit was. Of course, I wasn't thinking it was a script problem until late in the process. Also, if an apache process had made it to 1.3G, it's possible the memory limit wasn't in play for some reason. Not that it matters - if you know a rogue script is causing the problem, disable it first and analyse it second - if you even want to keep it!</td></tr> </tbody> </table> </div> Fixing Website Performance Issues I: Hardware http://nigel.mcnie.name/blog/fixing-website-performance-issues-hardware <p>I hate slow websites. You hate them too. If you need to fix a slow site, this series is for you!</p> <p>Making sites perform is something I do. Since there's not enough fast websites out there, I figured that if I blogged about it, maybe somebody (you?) will listen, contribute ideas of their own, and we can all get better at it. Then, maybe, I can finally stop getting so angry when I click on a link and nothing happens!</p> <div class="section" id="series-index"> <h1>Series Index</h1> <ol class="arabic simple"> <li>(this post) - Hardware Is A Last Resort</li> <li><a href="fixing-website-performance-issues-evidence">Follow The Evidence</a></li> <li><a href="fixing-website-performance-issues-whack-a-mole">Whack-a-mole</a></li> <li><a href="fixing-website-performance-issues-page-request-theory">Page Request Theory</a></li> </ol> <p>Today's topic is...</p> </div> <div class="section" id="hardware-is-a-last-resort"> <h1>Hardware Is A Last Resort</h1> <p>99% of sites on the internet can be served by a $40/mo VPS with 500M of RAM.</p> <p>Unless you work for a major website - in which case you probably already have a performance ninja or sysadmin team - you don't need more. Throwing hardware at a site to make it scale is worthwhile <em>after you have exhausted all other options to make the site perform</em>. Cloud servers may be cheap, but 10 minutes spent looking for the biggest performance holdup on your site has an exponential ROI.</p> <p>How does that work? Let's say your bottleneck is RAM, and your site can handle 20 concurrent requests per server. If you just buy servers, you'll get to 100 concurrent requests at five servers. But if you find an easy change that lets you handle 30 concurrent requests, you get there with just four servers. And if you want 1,000, you would have needed 50 servers - but now only need 34.</p> <p>Where does the exponential ROI come in? It's in the accumulated cost of running those servers as your site grows:</p> <!-- image: --> <div align="center" class="figure"> <img alt="Server cost ROI" src="http://stuff.nigel.mcnie.name/server-cost-roi.png" /> <p class="caption">Over time and site growth, small time investments add up to big savings</p> </div> <p>This graph assumes a 10%/mo increase in requests. While this growth isn't limitless of course, you can see why the likes of Google and Facebook put so much effort into squeezing just a little more out of their hardware. If you get big, you'll save plenty of money over time.</p> <p>The other reason it's exponential is that in general, there's plenty of low hanging fruit around, which can allow you to get a 10x increase in concurrent requests. I'll share some ideas on those in future posts.</p> <p>What about smaller sites? There's a cost even from moving from one server to two. Not only did you just double your hosting bill, you made the architecture relatively much more complicated (as now you have firewalls to deal with as well as two machines to keep patched).</p> <p>If you're serving sites en mass, the maths works out again - if you can split the load down the two servers in a sensible way. This is probably a web/db split. But again, this only makes sense if you know that your one server can't handle web and db loads. Until you know, look for easier wins.</p> <p><em>Next post: <a href="fixing-website-performance-issues-evidence">Follow The Evidence</a></em></p> </div>