Using setitimer of your favorite posix kernel

Using setitimer of your favorite posix kernel

As this was requested in a comment of a previous post and knowing your kernel helps to write better performing systems here is a small information of how to use the interval timers provided by your posix kernel.

What is the interval itime?r

The interval timer is managed and provided by your kernel. Everytime the interval of the timer expires the kernel wil send a signal to your application. The kernel is providing three different interval timers for every application. The different timers are for measuring the real time passed on the system, the time your application is actually executed and finally the profiling timer which tmes the time when your application is executed and when the system is executing on behalf of your application. More information can be found in the manpage with the name setitimer.

Why is it useful?

In the QtWebKit Performance Measurement Utilities we are using the interval itimer as the timing implementation for our Benchmark Macros. To be more precise we are using the ITIMER_PROF to measure the time we spend executing in the system and in the application, we are using the smallest possible precision of this timer with one microsecond. The big benefit os using this instead of elapsed real time, e.g. with QTime::elapsed, is that we are not depending so much on system scheduling. This can be really nice as even with a lightly crouded system we can generate stable times, the only thing influecing the timing is the MHZ of the CPU.

How is it implemented?

It is a kernel timer, this means that it is implemented in your kernel. In case of Linux you should be able to find a file called kernel/itimer.c, it defines the syscall setitimer at the bottom of the file. In our case the SIGPROF seems to be generated in kernel/posix-cpu-timers.c in the check_cpu_itimer routine. Of course the timer needs to be accounted by things like kernel/sched.c when scheduling tasks to run…

How to make use of it?

We want to use ITIMER_PROF, according to the manpage this will generate the SIGPROF. This means we need to have a signal handler for that, then we need to have a way to start the timer. So let us start with the SIGPROF handling.

Elapsed time handling
static unsigned int sig_prof = 0;
static void sig_profiling()
{
    ++sig_prof;
}

The signal handler
    struct sigaction sa;
    sa.sa_handler = sig_profiling;
    sigemptyset(&sa.sa_mask);
    sa.sa_flags = SA_RESTART;
    if (sigaction(SIGPROF, &sa, 0) != 0) {
        fprintf(stderr, “Failed to register signal handler.n”);
        exit(-1);
    }

Start the timer
tatic void startTimer()
{
    sig_prof = 0;
    struct itimerval tim;
    tim.it_interval.tv_sec = 0;
    tim.it_interval.tv_usec = 1;
    tim.it_value.tv_sec = 0;
    tim.it_value.tv_usec = 1;
    setitimer(ITIMER_PROF, &tim, 0);
}

Discussion of the implementation

What is missing? We are using the sigaction API… we should make use of the siginfo_t passed inside the signal handler.

What if we need a higher precision or need to handle overflows?
There is the POSIX.1b timer API which provides timers in the nanosec region and also providers information about overflows (e.g. when the signal could not be delivered in timer). More information can be found when looking at the timer_create functions.

When is the interval timer not useufl?

Imagine you want to measure time it takes to complete a download and someone wrote code like this:

QTimer::singleShot(this, SLOT(finishDownload())), 300000);

In this case to finish the download a lot of real time will pass and the app might be considered very slow, but it in terms of the itimer only little will be executed as the time we just sleep is not accounted on us. This means the itimer can be the wrong thing to use when you want to measure real time, e.g. latency or time to complete network operations.

Dealing with Performance Improvements

Dealing with Performance Improvements

I hope this post is educational and help the ones among us doing performance optimisations without any kind of measurement. If you do these things without a benchmark you are either a genuis or very likely your application is going to run slower. I’m not going to talk about performance analysis right now but tools like OProfile, callgrind, sysprof and speedprof are very handy utilities. The reason I’m writing this up is that I saw a performance regression in one of my testcase reductions and this is something which I don’t appreciate and in general I see a lot of claims about performance tuning but little bit in regard to measurements and this part is very worrying.

For QtWebKit we have the performance repository with utilities, high level tests and something I labeled reductions. In detail we do have the following things:

  1. Macros for benchmarking. I started with the QBENCHMARK macros but they didn’t really provide what I needed and changing them turned out to be a task I didn’t have time for. I create WEB_BENCHMARK macros that work the same as the QBENCHMARK macros. One of the benefits is to provide better statistics, it prints the mean, std deviation and these things at the end of the run. And it has a different metric for measuring time. I’m using the setitimer(2) syscall to measure the CPU time we are executing in userspace and kernelspace on behalf of the application. This metric is a robust way to avoid issues like CPU scheduling and such. It would be the wrong metric to measure latency and such though, as we are not executing anything when waiting.
  2. Pick the area you want to optimize. With the QtWebKit performance repository we do have a set of reductions. These reductions consist of real code, a test pattern and test data. The real code is coming from WebCore and is driving Qt, the test pattern comes from loading real webpages. It is created by adding printf and such to the code and the test data is the data that was used when creating the test pattern. We do have these reductions for all image decoding operations we are doing on the webpages, for our font usage, for QTextLayout usage.
    The really awesome bit about these reductions is that they generate stable timings, are/should be fully deterministic. This allows to really measure any change I’m doing to let’s say QImageReader and the decoders.

Using the setitimer(2) syscall we will have pretty accurate CPU usage of the benchmark, using the /lib/libmemusage.so of GLIBC we should have an accurate graph of the memory usage of the application. It is simple to create a benchmark, it is simple to run the benchmark, it is simple to run the benchmark with memory profiling. By looking both at CPU and Memory usage it will become pretty clear if and where you have tradeoffs between memory and CPU.

And I think that is the key of a benchmark. It must be simple so people can understand what is going on and it must be simple to execute so everyone can do their own measurements and verify your claims. And specially having a benchmark and having people verify your measurements is keeping you honest.

Finally the commit message should state that you have measured the change, it should show the result of the measurement and it should contain some interpretation. e.g. you are optimizing for memory usage and then a small CPU usage hit is acceptable…

Conclusions of my QtWebKit performance work

Conclusions of my QtWebKit performance work

My work on QtWebKit performance came to a surprising end late last month. It might be interesting for others how QtWebKit compares to the various other WebKit ports, where we have some strong points and where we have some homework left todo and where to pickup from where I had to leave it.

Memory consumption

Before I started our ImageDecoderQt was decoding every image as soon as the data was complete. The biggest problem with that is that the ImageSource we are embedded into does not tell the WebCore::Cache about the size of the images we already have decoded.

In this case there was no need to decode the whole image as soon as the date comes in but wait for the ImageSource to request the image size and the image data. This makes a noticable difference on memory benchmarks and allows us to have the WebCore::Cache control the lifetime of decoded image data.

We still have one case where we have more image data allocated than the WebCore::Cache thinks. This is the case for GIF images as we are decoding every frame to figure out how many images we have there.

To fix that we should patch the ImageSource to ask the ImageDecoder for “extra” allocated data, and we should fix/verify the GIF Image Reader so we can jump to a given GIF frame and decode it. This means we should remember where certain frames begin…

Performance

Networking

Markus Götz and Peter Hartmann are busy working on the QNetworkAccessManager stack. Their work includes improving the parsing speed of HTTP headers, making sure to start HTTP connections after the first iteration of the mainloop instead of the third.

In one of my tests wget is still twice as fast as the Qt stack to download the same set of files. And wget is using one connection at a time, no pipelining… and Qt is attempting to have up to 6 connections in parallel. This means there is still some work to do in reducing latency and improving scheduling of requests. I’m pretty confident that Markus and Peter will work on this!

Images

The biggest limitation of the Qt Image decoders is that in general progressive loading is not possible and unless I have messed up my reduction the Qt Image decoders are faster than the ones we have in WebCore.

With some of my reductions I can make some stuff twice as fast for the pattern QtWebKit is having on QImageReader. Currently when asking the QImageReader for the size, the GIF decoder will decode the full frame (size + image data). For the GIF decoder we start the JPEG decompression separately for getting the size, the image and the image format.

A proof of concept patch for the JPEGReader to reuse the decompression handler showed that I can cut the runtime of the image_cycling reduction by 50%.

Misc

One misc. performance goal is to remove temporary allocations. E.g. remove QString::detach() calls from the paint path, to not copy data when moving from QString to WebCore::String, QByteArray to WebCore::String. Some of these include not using WebCore::String::utf8(), but have a zero cost conversion of WebCore::String to QString and use Qt’s utf8()…

Text

But the biggest problem of QtWebKit performance is text and I statzed to work on this. For Qt we always have to go through the complex text path of WebCore which means we will end at QTextLayout, which will ask harfbuzz to shape the text.

There are two things to consider here. For QtWebKit we are using Lars’s QTextBoundaryFinder instead of ICU. I’m not sure if we have ever compared how ICU and QTextBoundaryFinder split text. We might do more work than is necessary, at least it would be good to know. Specially for Japanese and Korean we might split words too early creating more work for our complex text layout path.

The second part is to look at our QTextLayout usage pattern and start to optimize for it… the quick solutions of asking QFont to not do kerning, and not to do font merging (to not use the QFontEngingeMulti) didn’t really make a noticable difference… To get an idea of the size of the problem, on loading pages like the Wikipedia Article of the Maxwell Equations we are spending so much time in WebCore::Font::floatWidthForComplexText that other ports like WebKit/GTK+ takes to load the entire page. This also seems to be the case for sites like google news.

And this is exactly where I would have loved to continue to work on it, but that is now pushed back to my spare time where it needs to compete with the other hobby projects.

Reverse engineering with okteta

Reverse engineering with okteta

In the last week I was hacking on OpenBSC to make GSM 12.21 Software Load usable for the ip.access nanoBTS. The difficulta was not within GSM 12.21 as Harald had it implemented for the Siemens BS11 BTS. The difficulty was that some messages need to contain paramaters and these come directly from the firmware file which ultimately means that one needs to understand the firmware file format to extract these. okteta came to rescue me and it was extremely good at doing this.

Okteta has not only the hex view one expects but also some useful utilities. Selecting a couple of bytes and the “Decoding Table” can tell you the different values in different endinanesses. So whenever I thought this is a file length, I would look into the “Decoding table”, select bytes and see how many I selected and if it could make sense, it can calculate various checksums over a selection.

Thanks a lot for Okteta, it safed my day!

Looking back to 2009

Looking back to 2009

The second year as part time freelancer has passed.

Looking back the most significant things are:

  • Signing the contribution agreement for gdb and glibc with the Free Software Foundation and trying to contribute to both projects. So picking future work will always have to be compatible with this.
  • Hacked on OpenBSC. At first just simple stuff like a telnet interface, paging and later doing paid work for On Waves to add SCCP over IP, GSM 08.08 and other things for “toy” integration of OpenBSC into a real network.
  • Mid this year I asked Nokia if they have work for me in Asia, later I started focusing on QtWebKit performance. Allowing me to improve QtWebKit and Qt (which will benefit a lot more users), but also to look into various tools like OProfile, memprof, memusagestat and just know netfilter queue’s… more on this later.
  • I have done my usual things on OpenEmbedded, working on landing patches through the patchwork queue, finally redoing the Bitbake parser and working on the Qt recipes.
  • I didn’t manage to make a Linux Kernel contribution. I wanted to write a i2c driver for a fm radio chip but I fried my hardware with a broken power supply, my MIPS patches are not yet done. So if you know of any Kernel work where stuff can be released/upstreamed please let me know!
(Qt)WebKit Sprint in Wiesbaden

(Qt)WebKit Sprint in Wiesbaden

The sprint is over for some time. You can see summaries of the different sessions and some slides in the wiki. Besides talking about QtWebKit and how to improve it (API, features, speed, make people aware that they can contribute, influence the release schedule, policies.. *hint*) the thing that has impressed me the most is unrelated to coding.

We all hear when someone from our Community is leaving the Qt department, and we always wonder how life will continue, who will fill the gap. In the last year a couple of new people got hired at Oslo and I’m really impressed how they find such capable people that are technically skilled and willing to move to Oslo! kudos!

Talking about performance measurements at foss.in

Talking about performance measurements at foss.in

It is the second time I’m at foss.in and this time I was talking about the current work I’m doing on QtWebKit. Nokia is kind enough to give me enough time to explore the performance of QtWebKit (mostly on Qt Embedded Linux and ARM) and do fixes across the stack in WebKit, Qt or whereever we think it will be necessary.

Performance for me comes in memory footprint and runtime speed (how long does it take?) and for this I have experimented using OProfile, Memprof/Memusage, QBENCHMARK but also wrote some WebKit specific tools. E.g. a tool that allows me to mirror webpages to turn them into a benchmark (which still has quite some problems), a simple http server to serve the content, some test case reductions in order to look into specific areas like networking, image decoding, painting, fonts.

The slides and links can be found here and they link back to the WebKIt wiki where you can find an introduction to the (Qt)WebKit specific tools, a set of bugs and pending patches, and a set of issues that are known but not yet handled.

The main message of the talk is to not do optimisation by myth, but to use a stable environment and one of the existing tools to see what is going. It is really easy.

Attending foss.in

Attending foss.in

Thanks to generous sponsoring I managed to make it to Bangalore for FOSS.IN and Girish kindly agreed to provide accomodation. It is really great to be in India again, to see the streets, the local market, catch up with friends from India and Europe.

Girish is currently struggling with plugin code in QtWebKit for Mac, Win and X11 in both window and windowless mode, in the good ole wold and in QGraphicsView… I’m analyzing the loading behavior of a particular website and now I need to find out why we take a whole second to do layout and sometimes just actively do nothing… It is really awesome to have some clever company here in India.

Collecting hints to increase performance in Qt (and apps)

Collecting hints to increase performance in Qt (and apps)

I’m working part time on improving the performance of QtWebKit (memory usage and raw speed) and I have created some tools to create an offline copy of a number of webpages (gmail, yaho mail, google, news sites…).

Using these sites I have created special purpose benchmark reductions. E.g. only do the image operations we do while loading, while loading an painting, load all network resources. One thing I have noticed is that with a couple of small things one can achieve a stable and noticable speedup. These include not calling QImage::scanLine from within a loop, avoid using QByteArray::toLower or not use QByteArray::append(char) from a loop without a QByteArray::reserve.

I have created a small guide to Qt Performance, I will keep it updated and would like to hear more small hints that can be used to improve things. If it makes sense I can migrate it to the techbase as well.

Painting on ARM

Painting on ARM

I’m currently work on making QtWebKit faster on ARM (hopefully later MIPS hardware) and in my current sprint I’m focused on the painting speed. Thanks to Samuel Rødal my work is more easy than before. He added a new paintengine and graphicssystem that allows to trace the painting done with QPainter and then later replay that. Some of you might feel reminded of Carl Worth’s post that mostly did the same for cairo.

How to make painting faster? The Setup

  1. Record a paint trace of your favorite app with tst_cycler -graphicssystem trace, do the rendering and on exit the trace will be generated
  2. Use qttracereplay to replay the trace on your hardware (I had some issues on my target hardware though)
  3. Use OProfile to look where the time is spent and do something about it…
  4. Change code go back to qttracereplay..

What did I do so far?
Most samples are recorded in the comp_func_SourceOver routine. With some searching in the MMX optimized routines and talking to the rasterman I’m doing the following things to improve things on the const_alpha=255 path. In the qttracereplay I go from about 17.4 fps to around 26 fps on my beagleboard with Qt Embedded Linux on the plain OMAP3 fb but I still need to do a more careful visual inspection of the result.

  • Handle alpha=0x00 on the source special by not doing anything
  • Handle alpha=0xff on the source special by simply copying it to the dest
  • Unroll the above block eight times interleaved with preloads…

I will have to clean all this up, merge it with the symbian optimized copies (which sometimes require armv6 or later)… I will probably look at BYTE_MUL now and see if I can make it faster without taking a armv6 or later instruction… or honestly first understand how the current BYTE_MUL is working…