This is the third post in a series on fixing website performance issues. See here for an index of all posts in the series. Previously: Follow The Evidence. Next: Page Request Theory.
Whack-a-mole
Fixing performance issues is a game of whack-a-mole.
For anyone not familiar with this game, the idea is that moles poke their heads out of holes, and you severely beat them with a hammer until they go back in. Bad for the moles, great for your morale. If you like moles, pretend we're playing whack-a-boss instead.
Why is it whack-a-mole? Because there are many steps in the chain of serving a website, and inevitably, one of them will be a bottleneck before the others.
For example, a system may have plenty of CPU and RAM, but if apache is configured with default settings, it will only take a few concurrent users to max out the number of connections it can have open. The server will idle along just fine, while the website will effectively be unavailable.
Alternatively, the system could be tweaked to perfection, but some poor javascript could cause the user's browser to stutter and hang.
Example: Fixing Ushahidi map performance
As part of my work on eq.org.nz - an instance of Ushahidi set up to help with the Christchurch Earthquake, I was told that people were finding navigating the map to be slow. A quick play with it revealed that if you zoomed in and out, it could take up to five seconds to recalculate and display the numbered red circles that indicated reports for the new zoom level. And that was on my broadband connection, let alone the poor 3g connections some of the people in Christchurch would have had. Ouch.
Gathering Evidence
The first step was to identify the biggest problem. What happens when you zoom? Chrome's inspector revealed an AJAX request to /json/cluster that would take at least a second, and sometimes up to four seconds. Well that obviously wasn't good, but at least I thought I had a pretty good idea of why the map was so slow.
PHP scripts, correctly configured on modern hardware, shouldn't take long to process. The server was doing fastcgi and had APC, and Arjen Lentz had configured the MySQL database [1]. I was seeing processing times of 100-250ms for most scripts, which seemed reasonable in the environment. The requests to /json/cluster were taking more like 1000-4000ms. Plenty of room for improvement!
After verifying the problem occurred on the staging server too (I didn't have a dev setup), the first step was to add timing information to the action serving that URL (Ushahidi is based off Kohana, it's MVC). After every "paragraph" of code, I added a debugging line that output the accumulated time since the start of the action. The output looked something like this:
[Tue Mar 01 04:05:04 2011] CLUSTER BEGIN, referer: http://dev.eq.org.nz/ [Tue Mar 01 04:05:04 2011] init: 0.000 s, referer: http://dev.eq.org.nz/ [Tue Mar 01 04:05:04 2011] category: 0.000 s, referer: http://dev.eq.org.nz/ [Tue Mar 01 04:05:04 2011] after incident sql: 0.002 s, referer: http://dev.eq.org.nz/ [Tue Mar 01 04:05:04 2011] after incident geometries: 0.078 s, referer: http://dev.eq.org.nz/ [Tue Mar 01 04:05:04 2011] after incident_categories: 0.078 s, referer: http://dev.eq.org.nz/ [Tue Mar 01 04:05:04 2011] after locations: 0.083 s, referer: http://dev.eq.org.nz/ [Tue Mar 01 04:05:04 2011] after creating markers: 0.084 s, referer: http://dev.eq.org.nz/ [Tue Mar 01 04:05:04 2011] after clustering: 0.161 s, referer: http://dev.eq.org.nz/ [Tue Mar 01 04:05:04 2011] after json for clusters: 0.167 s, referer: http://dev.eq.org.nz/ [Tue Mar 01 04:05:04 2011] after json for singles: 0.208 s, referer: http://dev.eq.org.nz/ [Tue Mar 01 04:05:04 2011] done: 0.208s, referer: http://dev.eq.org.nz/
It's a bit messy, but you get the idea. The times in this example are from near the end, when I'd done most of the work.
The nice thing about this format is that it's really easy to see where the big jumps are. For example, the difference between "after incident sql" and "after incident geometries" is about 75ms - an area to investigate. The same again for "after clustering", and then for "after json for singles".
Fixing The Code
Round 1 - the Sqloop
I then picked the first major slowdown, and analysed the code further. What I found was an Sqloop. Simplified, it looked like this:
foreach ($incidents as $incident) {
$geometry_data = $db->query('SELECT ... FROM geometry WHERE incident_id = XXX');
// Do stuff with data
}
This is evil. As the number of incidents grows, so does the number of queries. We had over 1,000 incidents on production. So that was 1,000 queries for each time anybody zoomed the map. MySQL's query cache was probably helping to mask the effects of this, but issuing queries still has a non-zero cost.
The fix is to drag the SQL outside the loop. Get the data once, without the WHERE clause, and do the WHERE in code. It costs a little more RAM, but it's generally not a problem (and you can grab the data in chunks if you really want).
Here's the patch that I produced. You can see that it actually has little effect on the existing code. Instead of doing the SQL, it calls a function which returns exactly the same data that the SQL would have returned. That method will do a query to get data for all incidents the first time it's called, then cache it in self::$geometry_data for future calls.
I like this because it has little effect on the existing code. Because I'm not familiar with what the code should do on a high level, I'm happy with simply making sure it behaves exactly as it did before. An Ushahidi developer may have a better idea for optimising this however.
Round 2 - Iterating
A little further down, some code involving the ORM was running pretty slowly, especially given it seemed to simply be a wrapper around one query:
if (count($location_ids) > 0)
{
$locations_result = ORM::factory('location')->in('id',implode(',',$location_ids))->find_all();
}else{
$locations_result = ORM::factory('location')->find_all();
}
$locations = array();
foreach ($locations_result as $loc)
{
$locations[$loc->id]['lat'] = $loc->latitude;
$locations[$loc->id]['lon'] = $loc->longitude;
}
This segment of code was taking 270ms!
My first instinct was that the first half would be the slow bit. ORM::factory('location')->.. blah blah blah that must be slow. But splitting the timing of the two segments revealed that it was actually the second half that was crawling. What could be wrong? It's a simple foreach loop - could there be millions of iterations? Nope count($locations_result) said 510.
The only answer left was that whatever was being iterated was bad at it. It turns out the location result was not a simple array, but some kind of Iterable ORM object. I don't really know much about them, other than if they're that bad at being iterated over, we'd better not use them ;).
I patched this by doing a straight query rather than using the ORM. Again, there are probably other ways, but in this case, this worked fine. The resulting iteration on the raw array was an order of magnitude faster.
Round 3 - JSON generation
I found several instances in the code of JSON generation by string building.
Not only is this code copy/pasted several times throughout the file (leaving it open to bugs when someone fixes one part and not the others), there's an inbuilt PHP function for doing this: json_encode.
While the json generation was definitely a slow point, actually refactoring all of that code to use json_encode would have taken more time than I had available, so I ignored it. Hopefully it'll get fixed upstream some time.
Round 4 - Caching
The last thing I decided to do was implement a bit of caching.
Caching is generally the last thing you should do. If you cache before you have dealt with any other bottlenecks, you'll mask their effects, but not kill them completely. This can come back and badly bite you later.
For example, caching the JSON generation earlier would have meant we may never have spotted the wasteful sqloop, and the problem would have simply been suppressed until the site was large and active - not really a time when you want to be debugging performance issues! Also, the caching would have meant the problem would have been "intermittent" - with the poor users who get the uncached pages suffering while the others notice nothing wrong.
When you cache, you're looking to eliminate as much work as you can, preferrably without ever serving stale data.
One option is sending expires headers. This is a poor choice in this case, because the same URL can send back different data (e.g. when a new incident is added). It also gives no benefit as more visitors hit the site. They all have to request the data anyway.
In this case, it's better to cache on the server, in the action itself. The reasons for this are that we can cache once for all users, and we can generate a sensible cache key - which means we can avoid ever serving stale data.
The simplest caches are key/value stores. You put a value in the store (e.g. the JSON that the action generated), and the key you use to get it is something that you can quickly work out the next time you want the data back (e.g. the GET parameters to the script).
The trick is to generate the cache keys in such a way that if the underlying data changes, the key will change. Sometimes this means you can't just use the input parameters to the script. Here's my patch. The key is made up of the GET parameters, and the IDs of all of the incidents. It's then md5'd to make sure it only has known characters in it, and used to access a file on disk.
There are few side effects of this caching implementation:
- As incidents are added or removed, all of the entries in the cache will effectively be worthless. But that's fine, there are many more users of the map then there are incidents being added, so we'll get benefits in most cases.
- If an incident changes (e.g. its latitude/longitude), the cache key will NOT change, which means it won't move on the map.
- The amount of space in /tmp that will be used is uncertain and should be monitored.
I suggest a cache cleaning implementation in the commit message that will handle most of these issues. A cronjob that looks for files that haven't been changed in the last 120 minutes and removes them will clean out the old cache files as the incidents change, and the time (120 minutes) gives a rudimentary knob with which we can manage the cache size. The only thing not handled is if incidents change, however the usual usage pattern for an Ushahidi is for events to be constantly added, and the the 120 minute limit puts an upper bound on how long the information will be wrong for anyway.
Results
This optimised the script down to a far more sensible runtime, normally between 70 and 250ms.
This did prove to make a big difference to the perceived responsiveness of the map, however it wasn't a full fix. I found afterwards that the client-side map code (OpenLayers) seems to wait about 1.5 seconds after you finish a zoom operation (e.g. using the mousewheel) before bothering to fire the event which does the AJAX request.
Though I looked around and asked in their IRC channel, there doesn't seem to be an event that can be hooked into earlier. I tried using other events and data in the JS, but couldn't come up with anything that would let the request be sent earlier.
It's unfortunate, because while the map now performs "acceptably", it's not fast. Fixing the map event firing problem is the next "mole". Although there are other wins to be had, like the json_encode fix, it won't get nearly the benefits that fixing the map will.
This finding points out the value of checking your results. The end goal was that the perceived performance of the map was faster. While all of the fixes I made were valuable improvements, they didn't fully fix the problem, and I couldn't really call "mission accomplished". Although in this case I did, as everyone agreed that chasing down javascript mapping bugs wasn't going to have a major benefit compared to the time involved.
Next post: Page Request Theory
| [1] | I'm not entirely sure of everything he did, but when MySQL employee #25 configures your database, it's not going to be the bottleneck :) |
Like this post? Subscribe to my RSS feed and follow me on twitter to hear about new posts early.
Want to share this post? Tweet
Hi and welcome! In 2009 I quit my job to become an entrepreneur, founding 
