May 20 2011

I've just seen this article in ComputerWorld about how the NZ Red Cross scaled its website to handle the aftermath of the Cantebury earthquake on February 22.

I don't know how much the Red Cross themselves knows about the details of how their site was fixed, but I know one thing for sure. It was Arjen Lentz and I who fixed it. Not Netspace.

The Story

Here is what the story says happened:

To cope with the spike in traffic Netspace Services Limited added a reverse proxy to handle the static content delivery, tuned PHP to include an op-code cache and also trimmed the PHP modules to the bare minimum. The Apache daemon was also tuned for high turnover of processes to prevent memory bloat. The web server was put onto a new 64-bit Debian operating system and the database was shifted to its own 64-bit hardware and operating system.

The MySQL database was tuned, with indexing optimisations applied and alterations to cache settings to increase performance while reducing overall load. The presentation code was rewritten to push more work to the SQL database, to optimise queries, and to streamline processing. Within 36 hours of the site getting hit with a wall of traffic ten times larger than the solution was specified to handle, Netspace had reduced processing and memory load on the servers while being able to serve the higher connection demands from the world, and return to business as usual.

The above has some grains of truth in it. However, it wasn't Netspace that eventually fixed the site. Read on for what really happened.

The Facts

Immediately after the quake, the NZ IT community pooled together to begin work on what would eventually be eq.org.nz - the Christchurch Recovery Map. I had previously worked on scaling a map for the Queensland Floods earlier in the year, so I became part of the team that worked on the site.

We migrated the site from a temporary home on Crowdmap to dedicated servers, configured and tuned the stack supporting the site (PHP under fastcgi, MySQL, the Ushahidi software that powered the site), and monitored it while the site reached 100,000 visits in a week. A more detailed list of what happened was published in the NZCS Newsletter on the 18th of March.

After the quake, the NZ Red Cross site was very slow to respond, if it responded at all, to most requests. It was obviously under high load with much of the country wanting to donate, but it was clear that most people were not even getting through to the site.

While we were working on eq.org.nz, we saw the Red Cross was having trouble, and the group made the decision to reach out to help. As a result, I was put in touch with Gerard Creamer at Netspace [1]. He connected me to Fletch, who works for them as a sysadmin.

The site was struggling. Fletch was preparing to move the site to a new server, having seemingly abandoned the existing one as a lost cause. I got access to it to see what was going on. Digging around, I found out why it was so slow. The apache powering the site was woefully misconfigured.

The hardware the site was on was, I believed, more than capable of handling the load the site was under. Unfortunately, the software hadn't been configured in any kind of sensible fashion. I made the decision that time would be better spent fixing the software rather than moving to new hardware - which wouldn't have made a button of difference if the site was again misconfigured!

After a pause while we had to reboot the server when it completely locked up, I got to work. I started by configuring the apache to allow many more connections and turned off keepalive, which was just an immediate fix to make the site a little more responsive.

I then began work on putting nginx in front of the site. Nginx is the reverse proxy mentioned in the story. The idea is that it handles the connections between visitors' browsers and the website, and serves any static files (images/CSS/javascript), instead of the slower, more memory-hungry apache.

However, while working on nginx, I noticed that at random intervals every 30 seconds to 10 minutes after apache was restarted, it would race to MaxChildren and the site would lock up. Very strange behaviour, which I initially put down to the high load. I figured nginx would fix it, so I kept an eye on it and restarted apache every time it locked up, while continuing to work on nginx.

After a while, I got nginx in front of the apache. That made an immediate difference to the site performance, now that apache was only handling PHP requests [2]. However, the random apache lockups continued.

Arjen came on the scene somewhere around this point, and began looking at the database. Everything that Netspace claimed they did regarding database tuning, he did. Index optimisations, alterations to cache settings, etc. In fact, Arjen even gave a talk about what he did at the Brisbane PHP & MySQL user group.

We began investigating the lockups in more detail. I've actually blogged about how they were caused and how we fixed them before - "the case of the crashing website" was actually "the case of the crashing NZ Red Cross website". Once we fixed it, success! The site was running smoothly.

The Story vs The Facts

Any mention in the story of Netspace having tuned software is dubious at best. I honestly can't remember whether I installed an opcode cache, whether they did, or whether there even was one. However, the reverse proxy setup and apache configuration was me. The site was, according to Fletch, running on the existing hardware that it always used to run on, so I'm not convinced it was ever "put onto a new 64-bit Debian operating system" - and that wouldn't have made any difference if it had.

Regarding the MySQL - apart from moving it to a separate machine (which may not have even happened - all I know is that it was on a separate machine when I got there) - all of the tuning was done by Arjen. He was MySQL employee #25, and now runs OpenQuery, a MySQL consultancy. There are tickets in OpenQuery's issue tracking system where he detailed all of the work he did.

The funniest part about the whole story is this:

Within 36 hours of the site getting hit with a wall of traffic ten times larger than the solution was specified to handle, Netspace had reduced processing and memory load on the servers while being able to serve the higher connection demands from the world...

The solution was specified to handle? The solution was a horribly misconfigured apache, an unconfigured MySQL and no reverse proxy, as well as a broken search system (see the case of the crashing website). The "wall" of traffic would only have had to have been 10 visitors at once to floor it.

The solution Arjen and I put in place dealt with the wall with ease. When I logged out of the web server, load was 0.1, and it was using just 700M of ram - in comparison to the swapping-like-crazy, constantly-locking-up mess it was in (aka: load of 20+, all 8G of ram + all swap used).

Gerard sent me an e-mail afterwards:

Hi Nigel,

Thank you so much for your help with Red Cross - we really appreciate it,
and so do the folks at Red Cross.  Your efforts have enabled Red Cross to
accept over 7500 donations since noon yesterday - over $800,000 going
directly to where it's needed.  And that number will continue to climb.

You seem to still have 10 open sessions on the redxprdww02 server - let me
know if you still need them and I can kill them off if you're done.

If there are any expenses or costs please don't hesitate to invoice them to
us.

Netspace Services Limited
PO Box 404
Palmerston North
New Zealand

Thank you again for all of your help.

Regards,
Gerard

Q & A

Q: Nigel, isn't this sour grapes?

I don't think so. There's no mention of us in the article. People will get the impression that it's Netspace that tuned the site if I don't say anything.

I wouldn't have had a problem with it if the article had said that Netspace accepted the help of the team that worked on eq.org.nz, who did [insert all the configuration stuff here].

ComputerWorld is a widely read publication, it doesn't seem right to let such inaccuracies go without correction.

Q: What about the other stuff in the article?

We only ever dealt with the website. Everything else wasn't us, and I know nothing about it.

I never actually dealt with Charles Ranby, the Red Cross IT Manager. I wonder how much he knows.

Q: So what do you think of NetSpace?

I hope they simply forgot to mention they had so much help.

Fletch was good to work with - responsive and helpful.

[1]I was actually connected to him at Face, but Face and Netspace seem to be interchangeable - see how Face claims Red Cross New Zealand as their client, at the bottom of the page.
[2]Unfortunately though, not as much as it could have, because the CMS they're using serves images through PHP scripts

Like this post? Subscribe to my RSS feed and follow me on twitter to hear about new posts early.

Want to share this post?

Apr 18 2011

This is the fifth post in a series on fixing website performance issues. See here for an index of all posts in the series. Previously: Page Request Theory.

Server Resources

Last time, we walked through what happens when a visitor requests a page from your website, at a high level. Now we are going to "zoom in" on the server side, and examine what resources are required to handle the page request on the server [1].

Knowing how resources are used is useful, because many performance issues boil down to an overuse of one or more of them, and fixing the cause of the overuse is how you fix the issue.

We saw last time what happens when someone visits a page on your website:

  • Around 6 connections will be made, all of which will be involved with serving the page for until the visitor's browser has downloaded all the resources required - which could take a few seconds.
  • Generally, at least one of the connections will result in a dynamic page needing to be served. Even more if you're using AJAX.
  • After the page is downloaded, the connections may be kept open, reserved for the visitor, if keepalive is on.

So, what resources does the server need to handle this? Let's go through each step of a request and list them all.

Step 1: Visitor starts a request

To handle a request, a server needs a connection. To handle the entire page, ideally the server needs six free connections. Connections are finite, and it is definitely possible to run out of them.

Remember that initially, only one connection is required. If the visitor is asking for a web page or some other resource that needs more resources to be correctly rendered, the browser will try to open more connections for those.

Once the connection is established, the visitor will send their request. This can actually be reasonably slow, for a few reasons:

  • On slow connections, or connections with asynchronous bandwidth (e.g. a 2M down/128K up ADSL connection), the headers simply cannot be sent very quickly [2].
  • HTTP headers are not sent compressed; and
  • If a request involves cookies, the request size could potentially be up to a few kilobytes (larger than some pages and many small images!)

During this time, apart from holding the connections open, the server doesn't have much to do [3].

Step 2: Server inteprets the request

For each request, the server needs to decide what to do. This should hopefully be quick, although inefficient server implementations or configuration can slow this down.

For example, mod_rewrite can cause slowdowns, if the various rewrite rules are inefficiently written or targetted. If you're checking every single request to see if it matches a certain regular expression, or for whether it should be modified to be a request for some other resource, this all adds up.

There's always going to have to be some processing done, however. We all like friendly URLs (and in particular, Google does). You just have to be careful that you're not causing more work than necessary.

Requests for dynamic resources (e.g. PHP pages)

For these requests, the code needs to be run. This is easily the most expensive part of the the entire request process:

  • First, an interpreter needs to be ready to run the script. Normally, this has been done by the web server before any requests arrive, but if the server has no free processes available to handle the request, it may have to start a new one. This can take CPU and RAM to accomplish, and any intepreters will generally cost at least a few MB of RAM just to be ready for processing.

  • Then the code runs. This takes the most resource - CPU, RAM, network bandwidth. It also can cause disk reads and writes, both of which can be very slow.

    While the code is executing, it will want as much CPU as it can get. However, there will be times when it needs to wait for something in order to continue - for example, when it has issued a query to a database. During those times, it won't consume much CPU, although it will still hold any RAM it requires.

    The RAM requirements will depend largely on how the code is written and what it's doing. One way of solving a problem may involve grabbing some data, processing it, then grabbing some more until all the data is gone. Another way may be to grab all of the data into RAM at once, then process it all (contrast SAX vs DOM parsing for XML). In general, most PHP based applications I've worked on need 20MB or less of RAM for the average page request [4].

    Network bandwidth is generally only used between the web server and any services running on other servers (e.g. database server). This generally isn't a problem unless you're hauling massive datasets into your script.

    IO bandwidth is, depending on the system, one of the rarest resources you have. In particular, safely writing to disk is likely to be the slowest operation you normally do [5]. Reducing the number of database/filesystem writes you do is a great way to improve performance, as is making sure you only read the information you need to.

Processing dynamic requests is almost certainly the most intensive part of serving a page. Therefore, it's an area ripe for performance optimisations to be made (though don't rush in until you know it's the real problem).

Requests for static resources (e.g. CSS files)

For static requests it's easier. The file just has to be read. The kernel may have already cached the file contents in RAM, making it very quick.

Seriously, it's that simple. Because it's such a different task to handling the dynamic requests, it makes sense to make sure these requests aren't handled in the same way. Splitting how they're handled is often at the heart of initial performance bottlenecks, which is something I'll blog about in future.

Step 3: The response is sent back

This can be the slowest part, as it's affected by the download speed of the visitor. This holds the connection open, so if sending back lots of data, remember that this connection is used all that time! Fortunately, it's not very cpu intensive.

Finally, the request/response cycle is done; the connection can close or be left open based on keepalive.

Improving Performance

We've seen what resources can be consumed while serving a web page. Knowing what is consumed, and when it might be consumed, is good background information when you go looking to improve the performance of a website.

For example, if a site is slow, you now understand that it could be caused by the overuse of one or more of the resources mentioned. It could be that all RAM is used, causing the machine to swap; too many disk reads/writes causing slow disk operations; or perhaps the machine is simply using all CPU flat out.

However, it could be none of these reasons. Sites can also be slow due to reasons that examining the server wouldn't notice. For example, missing expires headers or lack of gzip on page components can cause sites downloaded over slower connections to crawl.

Next time, we'll look at the ways in which you can inspect a server & website to find performance issues [6].

[1]I'm aware that last time I said the next post would be about what happens when many people visit your site, but I decided we need a little more background before that post will make sense.
[2]You may not think this matters much, but remember that many requests might be sent in parallel - combined with large request sizes and a lack of compression - can add up. Not to mention, two people using the internet on the same connection...
[3]SSL introduces some complexity, and of course a little RAM is needed to hold the request data, but ignoring these there's very little happening.
[4]However, there are some cases where a script simply needs more. GeSHi, needs to store the HTML for syntax highlighting source code in RAM, and if the input script is very large, the HTML for highlighting is a whole bunch larger again. It all depends on what you're doing.
[5]If your script makes network requests to remove hosts that'll likely be slower again, although you wouldn't normally do this in the context of serving a web page precisely for this reason.
[6]This is perhaps getting a little off the topic of showing how most websites on the 'net can be served off a $40 VPS, like I promised I'd do, but I promise we'll get there in the end!

Like this post? Subscribe to my RSS feed and follow me on twitter to hear about new posts early.

Want to share this post?

Mar 31 2011

This is the fourth post in a series on fixing website performance issues. See here for an index of all posts in the series. Previously: Whack-a-mole. Next: Server Resources.

Page Request Theory

I previously made a claim that 99% of sites on the internet can be served by a $40/mo VPS with 500M of RAM. Now I'm going to back that up.

It will take a few posts. The first two are theory. In this one, we'll examine what happens when someone requests a page from a website. The next will show what happens when lots of people request pages.

Then we'll move to practice. The third will disect a typical LAMP application, and analyse what goes wrong when many visitors are using it. And for the finale, will examine how we can remove the bottlenecks we found to improve performance & scalability.

What Happens When A Page Is Requested?

I promise I'll keep this readable (at the expense of a few details, so experts - please no mocking kthx).

Step 1: Visitor starts a request

A visitor chooses to visit a page on your site, by clicking on a link or bookmark, typing the page address into their browser, etc. However they choose to do this, the result is that they are now requesting some resource (a page, image, video or whatever) from you.

Their browser opens a connection to your server. The inner workings of this aren't important for now, but the end result is one connection is now established between them and you.

Step 2: Visitor asks for something

The visitor then sends information about what they want, and how they want it. These instructions are known as the request.

The request is a list of instructions that says something like "I want resource yoursite.com/photos/index.php", and then a bunch of info about how they'd like it. "I could handle it if you sent it back gzip compressed; I don't mind if you send back an older one; you sent me a cookie that hasn't expired so here it is again"...

Think of the request as a list of demands. The server will try and fulfill these demands.

Step 3: Server responds

Your server receives the list of instructions (request), and with it, works out how to respond.

If the requested page was 'index.php', your server will probably be configured to run PHP over the index.php file, and return the output of the PHP script. If the requested page was actually an image or some other file, your server will probably just send it back as-is.

Actually, what is done for each request is totally up by you. You could tell the server to send back /home/you/bananaphone.mp3 for any request if you really wanted. While this might be annoying for the visitor, there's nothing really stopping the server doing it, other than that a server should be trying its best to fulfill the demands made by the visitor.

Then your server sends the data - which is known as the response - back, over the same connection. Depending on how much data it is, and on the download speed of the visitor, this could take a while!

Step 4: Server finishes request (maybe!)

Finally, all the data is sent. The connection may be closed, depending on whether a feature of connections called Keepalive is on [1]. If the visitor's browser said (in the request instructions) that they support keepalive, and the server is configured to allow it, then the connection will be kept open. You'll see why this can be a good thing shortly.

Summary (so far)

Does it make sense? Connection opened, instructions sent to server, server responds, connection (possibly) closed. This is how any individual resource is requested from a web server.

The story isn't quite finished however. Most pages are made up of many resources - images, stylesheets, javascript etc. Where do they fit in?

Step 5: Visitor processes response

If the response is an HTML page, it can contain references to other resources.

As the browser parses the page, it will find these references. As it does, it will make requests to the server for them.

If Keepalive was not enabled, it will start at step 1 again for the new resource - by making a new connection.

However, if Keepalive was enabled and the previous connection is still open, the browser will jump straight to step 2, re-using the existing connection.

This is clever, because the process of starting a request can take a decent amount of time. When a page has many resources, that would otherwise mean many connections needed to be made, each incurring the delay.

There's one last detail to know for now. Browsers don't normally just open one connection to a server. They'll open between two and 10 at once! [2] This is because they can open the connections at the start, and then feed the requests for other resources into them immediately when they're spotted in the HTML.

Step 6: Keepalive'd connections close

Connections that are using Keepalive have a timeout. If the browser makes no new requests for a certain amount of time (chosen by the server), then the server will close the connection.

If this time is quite long - e.g. 60 seconds - then if a visitor clicks on a new page within a minute of loading the last page, they'll again re-use the connection. Cool!

Example Of A Page Request

So what does requesting a page look like, at a high level?

Here's a screenshot of Google Chrome's network inspector, when loading http://nigel.mcnie.name/ .

Screenshot of inspector. 17 requests, beginning with the page itself, then CSS, then images

Notice how the first request is on its own, then the stylesheet, then how a bunch of image requests begin simultaneously.

In the image, the bars represent the time when the resource was being requested and received. For each bar, the transparent bit on the left is (roughly) the time spent requesting the resource, and the solid bit is the time spent downloading the response.

The first request was for the HTML page, and starts at time zero. See how while the request is being made, no other requests are in progress.

Then in the response part of the first request, see how the next request starts (for style.css). This request is significantly faster (the bar is shorter), as Keepalive is enabled and no new connection has to be made. This request was started as soon as Chrome knew it needed style.css - which is part way through parsing the HTML returned by the first request.

The next two requests come "from cache" - that is, the browser already had a copy of them and doesn't need to download them again. We can ignore those.

Then, 11 requests begin for images. The last one is from google analytics, which we'll ignore for now.

You can't actually see it in the picture, but what actually happens is the browser opens five more connections. The first image (html_repeat.png) is requested using the first connection we already had open, then the next five are requested using new connections. The last four images are requested as the first images complete downloading. They re-use the connections, saving time.

The last requests (for the google analytics __utm.gif and the one from collect.clooso.org), are actually triggered from javascript, which is why they're so late. The browser will open new connections for those too, because the site they're getting those resources from is different. So it's quite possible for a browser to have dozens of connections open at once, although it will only tend to open about six to each individual server.

Try It Yourself

If you're using Chrome, hit Ctrl+Shift+i. You'll see the Chrome Inspector. Click on the network tab, then reload this page. You'll see the requests come in, and even better, you can hover over the bars to get more information about what they're made up of.

You can also try it in Firefox, if you have installed the Firebug extension.

If you're using IE, here's a download link for Firefox. You're welcome.

Summary

Requesting a page is quite an involved process, but I hope I've managed to explain it in reasonably clear terms. You should now know:

  • How many connections are initially opened by a browser
  • What information a browser sends to the server
  • What kind of things a server could send back
  • How many connections a browser may end up opening
  • Why some connections may stay open even after a browser has received a response

Feel free to ask any questions in the comments. If you can think of a way this post could be clearer, let me know as I'm happy to update it.

Also, a few people have been asking if I'm going to cover <topic X> in this series on performance. If there's something you would like to hear about, let me know!

Next post: Server Resources

[1]This is a feature of HTTP connections, not of all connections. My summary contains a number of simplifications and ommissions which aren't needed to help you grasp the basic concepts of how pages are requested.
[2]The actual number depends on the browser, and what the server will let the browser get away with. Typically, 6 is not an uncommon number.

Like this post? Subscribe to my RSS feed and follow me on twitter to hear about new posts early.

Want to share this post?