Dealing with Performance Improvements

Dealing with Performance Improvements

I hope this post is educational and help the ones among us doing performance optimisations without any kind of measurement. If you do these things without a benchmark you are either a genuis or very likely your application is going to run slower. I’m not going to talk about performance analysis right now but tools like OProfile, callgrind, sysprof and speedprof are very handy utilities. The reason I’m writing this up is that I saw a performance regression in one of my testcase reductions and this is something which I don’t appreciate and in general I see a lot of claims about performance tuning but little bit in regard to measurements and this part is very worrying.

For QtWebKit we have the performance repository with utilities, high level tests and something I labeled reductions. In detail we do have the following things:

  1. Macros for benchmarking. I started with the QBENCHMARK macros but they didn’t really provide what I needed and changing them turned out to be a task I didn’t have time for. I create WEB_BENCHMARK macros that work the same as the QBENCHMARK macros. One of the benefits is to provide better statistics, it prints the mean, std deviation and these things at the end of the run. And it has a different metric for measuring time. I’m using the setitimer(2) syscall to measure the CPU time we are executing in userspace and kernelspace on behalf of the application. This metric is a robust way to avoid issues like CPU scheduling and such. It would be the wrong metric to measure latency and such though, as we are not executing anything when waiting.
  2. Pick the area you want to optimize. With the QtWebKit performance repository we do have a set of reductions. These reductions consist of real code, a test pattern and test data. The real code is coming from WebCore and is driving Qt, the test pattern comes from loading real webpages. It is created by adding printf and such to the code and the test data is the data that was used when creating the test pattern. We do have these reductions for all image decoding operations we are doing on the webpages, for our font usage, for QTextLayout usage.
    The really awesome bit about these reductions is that they generate stable timings, are/should be fully deterministic. This allows to really measure any change I’m doing to let’s say QImageReader and the decoders.

Using the setitimer(2) syscall we will have pretty accurate CPU usage of the benchmark, using the /lib/libmemusage.so of GLIBC we should have an accurate graph of the memory usage of the application. It is simple to create a benchmark, it is simple to run the benchmark, it is simple to run the benchmark with memory profiling. By looking both at CPU and Memory usage it will become pretty clear if and where you have tradeoffs between memory and CPU.

And I think that is the key of a benchmark. It must be simple so people can understand what is going on and it must be simple to execute so everyone can do their own measurements and verify your claims. And specially having a benchmark and having people verify your measurements is keeping you honest.

Finally the commit message should state that you have measured the change, it should show the result of the measurement and it should contain some interpretation. e.g. you are optimizing for memory usage and then a small CPU usage hit is acceptable…

Comments are closed.