Gist of the Day: Analyzing Performance with Benchmark.pm

In my post on Inline::C a few days ago I mentioned The Rules of Optimization Club, and then I ranted a little bit about how if you cannot measure a performance problem then you don’t have a performance problem. That’s not to say that you’re incorrect in asserting that you have a performance problem, it is only to say that you cannot identify any particular part of the problem as a performance problem until you have measured it.
There a great number of ways to measure performance problems, profiling probably being the most useful in situations involving large applications where you want to test performance under real-life situations. For profiling, I would recommend you look at Devel::NYTProf. This profiler is exceptionally feature-rich and has a boatload of useful functionality. All that said, here I’m only going to talk about bench-marking small pieces of code.

Step One: Forming your hypothesis

Assume I have an application which does a lot of searching in big lists for specific items in memory. There are a number of reasons you might want to do this, and there are a number of ways to do it. So there’s the application that was already written, and they’ve got a function which does nothing but the search. Let’s take that function and wrap some benchmark code around it.

So the purpose of this snippet is to extract that function from the code for further analysis. We want to run it a lot of times, and we probably want a sizable list to search through. Let’s use 100,000 iterations with an list of 1,000 elements. Now, we don’t want to only test looking in one place in the array, we want to test various locations, so let’s look for a random item in the list.
Here is the output from that test:
timethis 100000:  5 wallclock secs ( 4.87 usr +  0.00 sys =  4.87 CPU) @ 20533.88/s (n=100000)
This (the bolded portion) tells us that with this function we processed approximately 5,390 iterations per second. That’s pretty fast, but I think we can do better. Let’s try using a different approach.

Step Two: Coming up with a fix, and then testing it

Using a binary search algorithm, we could perform a search. Now, I don’t want to have to change up the code all over the place, I want a drop-in replacement, and since a binary search algorithm requires sorted input we will just make our function do the sort on the input as we get it.
Note: We are only looking at one factor here, and that is very important. Software can get pretty big, and there are a lot of things which you could optimize, but doing too much at once increases risk breaking a lot of stuff. It’s better to do one thing at a time, and to measure and test all along the way. Failing to test will likely result in large flames shooting out of your application (metaphorically, of course).
So here’s our new search function, benchmarking exactly as we did before:

timethis 100000: 19 wallclock secs (18.54 usr + 0.01 sys = 18.55 CPU) @ 5390.84/s (n=100000)
Woah! We thought we were going to improve on performance but in reality we really destroyed it. What’s wrong? Well, if we analyze our first algorithm we see that it has a cost of about O(n), so the function touches every item in the list until it finds the one we’re looking for. This means that in the worst case – the item we want is at the end of the list – we will touch every item in the list. This is why we went with the O(log(n)) approach of the binary search, meaning the cost of the algorithm only goes up with the exponential growth of the list, so what went wrong?

Step Three: Refining the fix, and then testing it

Well, upon further analysis we remember that we’re doing our sort inline with our function. The default sort in Perl as of 5.7 is a stable mergesort, which I believe (and could totally be wrong) is O(n*log(n)). In addition to the sort we see that we are moving a lot of memory, specifically we are duplicating the list. That takes a lot of time and memory as well.
So, what would happen if we were to require the input be sorted before it comes in? Let’s try using an implementation of binary search which assumes sorted input and then operates on an arrayref rather than on a copy of an array.

Let’s try that and see what we get:
timethis 100000: 1 wallclock secs ( 0.81 usr + 0.00 sys = 0.81 CPU) @ 123456.79/s (n=100000)
Holy cow! That’s a big performance improvement. Now we have a proven improvement to performance, but it now means that I have to sort my input to the search prior to sending it in. Another way to do this would be to sort the input file coming in, or sort the select statement coming in, or to sort the array after you’re done adding to it.
After I put this into my application and get the code changed over to using sorted inputs I still need to make sure that I write my tests (or better-yet, update the ones I already had), and then run the profiler on the overall application to make sure that we now have achieved the performance we needed.

Conclusion

Benchmarking and profiling are tasks which aren’t done often enough. These techniques are excellent tools when you think you have performance problems, and they’re really the best way of finding and then proving a problem.
Here’s the full gist:

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.