Collection of WebKit ports

Collection of WebKit ports

WebKit is a very successfull project. It is that in many ways. The code produced seems to very fast, the code is nice to work on, the people are great, the partys involved collaborate with each other in the interest of the project. The project is also very successfull in the mobile/smartphone space. All the major smartphone platforms but Windows7 are using WebKit. This all looks great, a big success but there is one thing that stands out.

From all the smartphone platforms no one has fully upstreamed their port. There might be many reasons for that and I think the most commonly heard reason is the time needed to get it upstreamed. It is specially difficult in a field that is moving as fast as the mobile industry. And then again there is absolutely no legal obligation to work upstream.

For most of today I collected the ports I am aware of, put them into one git repository, maybe find the point where they were branched, rebase their changes. The goal is to make it more easy to find interesting things and move them back to upstream. One can find the combined git tree with the tags here. I started with WebOS, moved to iOS, then to Bada and stopped at Android as I would have to pick the sourcecode for each android release for each phone from each vendor. I think I will just be happy with the Android git tree for now. At this point I would like to share some of my observations in the order I did the import.

Palm

Palm’s release process is manual. In the last two releases they call the file .tgz but forgot to gzip it, in 2.0.0 the tarball name was in camel case. The thing that is very nice about Palm is that they provide their base and their changes (patch) separately. From looking at the 2.1.0 release it looks that for the Desktop version they want to implement Complex Font rendering. Earlier versions (maybe it is still the case) lack the support for animated GIF.

iOS

Apple’s release process seems to be very structured. The source can be downloaded here. What I think is to note is that the release tarball contains some implementations of WebCore only as .o file and Apple has stopped releasing the WebKit sourcecode beginning with iOS 4.3.0.

Bada

This port is probably not known by many. The release process seems to be manual as well, the name of directories changed a lot between the releases, they come with a WML Script engine and they do ship something they should not ship.

I really hope that this combined tree is useful for porters that want to see the tricks used in the various ports and don’t want to spend the time looking for each port separately.

How to make the GNU Smalltalk Interpreter slower

How to make the GNU Smalltalk Interpreter slower

This is another post about a modern Linux based performance measurement utility. It is called perf, it is included in the Linux kernel sources and it entered the kernel in v2.6.31-rc1. In many ways it is obsoleting OProfile, in fact for many architectures oprofile is just a wrapper around the perf support in the kernel. perf comes with a few nice application. perf top provides a statistics about which symbols in user and in kernel space are called, perf record to record an application or to start an application to record it and then perf report to browse this report with a very simple CLI utility. There are tools to bundle the record and the application in an archive, a diff utility.

For the last year I was playing a lot with GNU Smalltalk and someone posted the results of a very simplistic VM benchmark ran across many different Smalltalk implementations. In one of the benchmarks GNU Smalltalk is scoring last among the interpreters and I wanted to understand why it is slower. In many cases the JavaScriptCore interpreter is a lot like the GNU Smalltalk one, a simple direct-threaded bytecode interpreter, uses computed goto (even is compiled with -fno-gcse as indicated by the online help, not that it changed something for JSC), heavily inlined many functions.

There are also some differences, the GNU Smalltalk implementation is a lot older and in C. The first notable is that it is a Stack Machine and not register based, there are global pointers for the SP and the IP. Some magic to make sure that in the hot loop the IP/SP is ‘local’ in a register, depending on the available registers also keep the current argument in one, the interpreter definition is in a special file format but mostly similar to how Interepreter::privateExecute is looking like. The global state mostly comes from the fact that it needs to support switching processes and there might be some event during the run that requires access to the IP to store it to resume the old process. But in general the implementation is already optimized and there is little low hanging fruits and most experiments result in a slow down.

The two important things are again: Having a stable benchmark, having a tool to help to know where to look for things. In my case the important tools are perf stat, perf record, perf report and perf annotate. I have put a copy of the output to the end of this blog post. The stat utility provides one with number of instructions executed, branches, branch misses (e.g. badly predicted), L1/L2 cache hits and cache misses.

The stable benchmark helps me to judge if a change is good, bad or neutral for performance within the margin of error of the test. E.g. if I attempt to reduce the code size the instructions executed should decrease, if I start putting __builtin_expect.. into my code the number of branch misses should go down as well. The other useful utility is to the perf report that allows one to browse the recorded data, this can help to identify the methods one wants to start to optimize, it allows to annotate these functions inside the simple TUI interface, but does not support searching in it.

Because the codebase is already highly optimized any of my attempts should either decrease the code size (and the pressure on the i-cache), the data size (d-cache), remove stores or loads from memory (e.g. reorder instructions), fix branch predictions. The sad truth is that most of my changes were either slow downs or neutral to the performance and it is really important to undo these changes and not have false pride (unless it was also a code cleanup or such).

So after about 14 hours of toying with it the speed ups I have managed to make come from inlining a method to unwind a context (callframe), reordering some compares on the GC path and disabling the __builtin_expect branch hints as they were mostly wrong (something the kernel people found to be true in 2010 as well). I will just try harder, or try to work on the optimizer or attempt something more radical…

$ perf stat gst -f Bench.st
219037433 bytecodes/sec; 6025895 sends/sec

Performance counter stats for ‘gst -f Bench.st’:

17280.101683 task-clock-msecs # 0.969 CPUs
2076 context-switches # 0.000 M/sec
123 CPU-migrations # 0.000 M/sec
3925 page-faults # 0.000 M/sec
22215005506 cycles # 1285.583 M/sec (scaled from 70.02%)
40593277297 instructions # 1.827 IPC (scaled from 80.00%)
5063469832 branches # 293.023 M/sec (scaled from 79.98%)
70691940 branch-misses # 1.396 % (scaled from 79.98%)
27844326 cache-references # 1.611 M/sec (scaled from 20.02%)
134229 cache-misses # 0.008 M/sec (scaled from 20.03%)

17.838888599 seconds time elapsed

PS: The perf support probably works best on Intel based platforms and the biggest other problem is that perf annotate has some issues when the code is included from other c files.

Using systemtap userspace tracing…

Using systemtap userspace tracing…

At the 27C3 we were running a GSM network and during the preparation I noticed a strange performance problem coming from the database library we are using running. I filled our database with some dummy data and created a file with the queries we normally run and executed time cat queries | sqlite3 file as a mini benchmark. I also hacked this code into our main routine and ran it with time as well. For some reason the code running through the database library was five times slower.

I was a bit puzzled and I decided to use systemtap to explore this to build a hypothesis and to also have the tools to answer the hypothesis. I wanted to find out if if it is slow because our database library is doing some heavy work in the implementation, or because we execute a lot more queries behind the back. I was creating the below probe:

probe process(“/usr/lib/libsqlite3.so.0.8.6”).function(“sqlite3_get_table”)
{
a = user_string($zSql);
printf(“sqlite3_get_table called ‘%s’n”, a);
}

This probe will be executed whenever the sqlite3_get_table function of the mentioned library will be called. The $zSql is a variable passed to the sqlite3_get_table function and contains the query to be executed. I am converting the pointer to a local variable and then can print it. Using this simple probe helped me to see which queries were executed by the database library and helped me to do an easy optimisation.

In general it could be very useful to build a set of probes (I think one calls set a tapset) that check for API misusage, e.g. calling functions with certain parameters where something else might be better. E.g. in Glib use truncate instead of assigning “” to the GString, or check for calls to QString::fromUtf16 coming from Qt code itself. On second thought this might be better as a GCC plugin, or both.

In the name of performance

In the name of performance

I tend to see people doing weird things and then claim that the change is improving performance. This can be re-ordering instructions to help the compiler, attempting to use multiple cores of your system, writing a memfill in assembly. On the one hand people can be right and the change is making things faster, on the other hand they could use assembly to make things look very complicated, justify their pay, and you might feel awkward to question if it is making any sense.

In the last couple of weeks I have stumbled on some of those things. For some reason I found this bug report about GLIBC changing the memcpy routine for SSE and breaking the flash plugin (because it uses memcpy in the wrong way). The breakage is justified that the new memcpy was optimized and is faster. As Linus points out with his benchmark the performance improvement is mostly just wishful thinking.

Another case was someone providing MIPS optimized pixman code to speed-up all drawing which turned out to be wishful thinking as well…

The conclusion is. If someone claims that things are faster with his patch. Do not simply trust him, make sure he refers to his benchmark, is providing numbers of before and after and maybe even try to run it yourself. If he can not provide this, you should wonder how he measured the speed-up! There should be no place for wishful thinking in benchmarking. This is one of the areas where Apple’s WebKit team is constantly impressing me.

Deploying WebKit, common issues

Deploying WebKit, common issues

From my exposure to people deploying QtWebKit or WebKit/GTK+ there are some things that re-appear and I would like to discuss these here.
  • Weird compile error in JavaScript?
  • It is failing in JavaScriptCore as it is the first that is built. It is most likely that the person that provided you with the toolchain has placed a config.h into it. There are some resolutions to it. One would be to remove the config.h from the toolchain (many things will break), or use -isystem instead of -I for system includes.
    The best way to find out if you suffer from this problem is to use -E instead of -c to only pre-process the code and see where the various includes are coming from. It is a strategy that is known to work very well.
  • No pages are loaded.
  • Most likely you do not have a DNS Server set, or no networking, or the system your board is connected to is not forwarding the data. Make sure you can ping a website that is supposed to work, e.g. ping www.yahoo.com, the next thing would be to use nc to execute a simple HTTP 1.1 get on the site and see if it is working. In most cases you simply lack networking connectivity.
  • HTTPS does not work
  • It might be either an issue with Qt or an issue with your system time. SSL Certificates at least have two dates (Expiration and Creation) and if your system time is after the Expiration or before the Creation you will have issues. The easiest thing is to add ntpd to your root filesystem to make sure to have the right time.
    The possible issue with Qt is a bit more complex. You can build Qt without OpenSSL support, you can make it link to OpenSSL or you can make it to dlopen OpenSSL at runtime. If SSL does not work it is most likely that you have either build it without SSL support, or with runtime support but have failed to install the OpenSSL library.
    Depending on your skills it might be best to go back to ./configure and make Qt link to OpenSSL to avoid the runtime issue. strings is a very good tool to find out if your libQtNetwork.so contains SSL support, together with using objdump -x and search for _NEEDED you will find out which config you have.
  • Local pages are not loaded
  • This is a pretty common issue for WebKit/GTK+. In WebKit/GTK+ we are using GIO for local files and to determine the filetype it is using the freedesktop.org shared-mime-info. Make sure you have that installed.
  • The page only displays blank
  • This is another issue that comes back from time to time. It only appears on WebKit/GTK+ with the DirectFB backend but sadly people never report back if and how they have solved it. You could make a difference and contribute back to the WebKit project.
    In general most of these issues can be avoided by using a pre-packaged Embedded Linux Distribution like Ångström (or even Debian). The biggest benefit of that approach is that someone else made sure that when you install WebKit, all dependencies will be installed as well and it will just work for your ARM/MIPS/PPC system. It will save you a lot of time.
    Coscup2010/GNOME.Asia with strong web focus

    Coscup2010/GNOME.Asia with strong web focus

    On the following weekend the Coscup 2010/GNOME.Asia is taking place in Taipei. The organizers have decided to have a strong focus on the Web as can be seen in the program.

    On saturday there are is a keynote and various talks about HTML5, node.js. The Sunday will see three talks touching WebKit/GTK+. There is one about building a tablet OS with WebKit/GTK+, one by Xan Lopez on how to build hybrid applications (a topic I have devoted moiji-mobile.com to) and a talk by me using gdb to explain how WebKit/GTK+ is working and how the porting layer interacts with the rest of the code.
    I hope the audience will enjoy the presentations and I am looking forward to attend the conference, there is also a strong presence of the ex-Openmoko Taiwan Engineering team. See you on Saturday/Sunday and drop me an email if you want to talk about WebKit or GSM…
    Hybrid Application Example with QtWebKit

    Hybrid Application Example with QtWebKit

    In general one of the fascinating aspects of WebKit is the focus on just being a Web Content Engine as it can be seen in the Project Goals. One of the results is that one can easily build a Web Browser around it, or embed it into a Mail Client, a Chat Client, or into your application to handle payment in amazon, display Wikipedia or similiar things.
    On the other hand it is possible to embed native widgets into the Web Content using WebKit/GTK+ and QtWebKit, or use the JavaScript Engine API to bind native objects into the JavaScript world. I have started using the moiji-mobile.com name to push WebKit usage for these kind of hybrid applications. I am going to create examples showing how to create applications using QtWebKit, how to deploy them on real hardware, howto handle content development on mobile devices and more.
    The first part of this series is a simple example of what could be a SIP based desk telephone using QtWebKit and native technology. In this example I am embedding QObjects to provide information normally not available in the Web, I embed a QWebView inside the content to provide a browser and I am painting on top of the HTML content from the QWebView to provide touch feedback and the code can be found in my git repository.

    shot0002

    Using setitimer of your favorite posix kernel

    Using setitimer of your favorite posix kernel

    As this was requested in a comment of a previous post and knowing your kernel helps to write better performing systems here is a small information of how to use the interval timers provided by your posix kernel.

    What is the interval itime?r

    The interval timer is managed and provided by your kernel. Everytime the interval of the timer expires the kernel wil send a signal to your application. The kernel is providing three different interval timers for every application. The different timers are for measuring the real time passed on the system, the time your application is actually executed and finally the profiling timer which tmes the time when your application is executed and when the system is executing on behalf of your application. More information can be found in the manpage with the name setitimer.

    Why is it useful?

    In the QtWebKit Performance Measurement Utilities we are using the interval itimer as the timing implementation for our Benchmark Macros. To be more precise we are using the ITIMER_PROF to measure the time we spend executing in the system and in the application, we are using the smallest possible precision of this timer with one microsecond. The big benefit os using this instead of elapsed real time, e.g. with QTime::elapsed, is that we are not depending so much on system scheduling. This can be really nice as even with a lightly crouded system we can generate stable times, the only thing influecing the timing is the MHZ of the CPU.

    How is it implemented?

    It is a kernel timer, this means that it is implemented in your kernel. In case of Linux you should be able to find a file called kernel/itimer.c, it defines the syscall setitimer at the bottom of the file. In our case the SIGPROF seems to be generated in kernel/posix-cpu-timers.c in the check_cpu_itimer routine. Of course the timer needs to be accounted by things like kernel/sched.c when scheduling tasks to run…

    How to make use of it?

    We want to use ITIMER_PROF, according to the manpage this will generate the SIGPROF. This means we need to have a signal handler for that, then we need to have a way to start the timer. So let us start with the SIGPROF handling.

    Elapsed time handling
    static unsigned int sig_prof = 0;
    static void sig_profiling()
    {
        ++sig_prof;
    }

    The signal handler
        struct sigaction sa;
        sa.sa_handler = sig_profiling;
        sigemptyset(&sa.sa_mask);
        sa.sa_flags = SA_RESTART;
        if (sigaction(SIGPROF, &sa, 0) != 0) {
            fprintf(stderr, “Failed to register signal handler.n”);
            exit(-1);
        }

    Start the timer
    tatic void startTimer()
    {
        sig_prof = 0;
        struct itimerval tim;
        tim.it_interval.tv_sec = 0;
        tim.it_interval.tv_usec = 1;
        tim.it_value.tv_sec = 0;
        tim.it_value.tv_usec = 1;
        setitimer(ITIMER_PROF, &tim, 0);
    }

    Discussion of the implementation

    What is missing? We are using the sigaction API… we should make use of the siginfo_t passed inside the signal handler.

    What if we need a higher precision or need to handle overflows?
    There is the POSIX.1b timer API which provides timers in the nanosec region and also providers information about overflows (e.g. when the signal could not be delivered in timer). More information can be found when looking at the timer_create functions.

    When is the interval timer not useufl?

    Imagine you want to measure time it takes to complete a download and someone wrote code like this:

    QTimer::singleShot(this, SLOT(finishDownload())), 300000);

    In this case to finish the download a lot of real time will pass and the app might be considered very slow, but it in terms of the itimer only little will be executed as the time we just sleep is not accounted on us. This means the itimer can be the wrong thing to use when you want to measure real time, e.g. latency or time to complete network operations.

    Dealing with Performance Improvements

    Dealing with Performance Improvements

    I hope this post is educational and help the ones among us doing performance optimisations without any kind of measurement. If you do these things without a benchmark you are either a genuis or very likely your application is going to run slower. I’m not going to talk about performance analysis right now but tools like OProfile, callgrind, sysprof and speedprof are very handy utilities. The reason I’m writing this up is that I saw a performance regression in one of my testcase reductions and this is something which I don’t appreciate and in general I see a lot of claims about performance tuning but little bit in regard to measurements and this part is very worrying.

    For QtWebKit we have the performance repository with utilities, high level tests and something I labeled reductions. In detail we do have the following things:

    1. Macros for benchmarking. I started with the QBENCHMARK macros but they didn’t really provide what I needed and changing them turned out to be a task I didn’t have time for. I create WEB_BENCHMARK macros that work the same as the QBENCHMARK macros. One of the benefits is to provide better statistics, it prints the mean, std deviation and these things at the end of the run. And it has a different metric for measuring time. I’m using the setitimer(2) syscall to measure the CPU time we are executing in userspace and kernelspace on behalf of the application. This metric is a robust way to avoid issues like CPU scheduling and such. It would be the wrong metric to measure latency and such though, as we are not executing anything when waiting.
    2. Pick the area you want to optimize. With the QtWebKit performance repository we do have a set of reductions. These reductions consist of real code, a test pattern and test data. The real code is coming from WebCore and is driving Qt, the test pattern comes from loading real webpages. It is created by adding printf and such to the code and the test data is the data that was used when creating the test pattern. We do have these reductions for all image decoding operations we are doing on the webpages, for our font usage, for QTextLayout usage.
      The really awesome bit about these reductions is that they generate stable timings, are/should be fully deterministic. This allows to really measure any change I’m doing to let’s say QImageReader and the decoders.

    Using the setitimer(2) syscall we will have pretty accurate CPU usage of the benchmark, using the /lib/libmemusage.so of GLIBC we should have an accurate graph of the memory usage of the application. It is simple to create a benchmark, it is simple to run the benchmark, it is simple to run the benchmark with memory profiling. By looking both at CPU and Memory usage it will become pretty clear if and where you have tradeoffs between memory and CPU.

    And I think that is the key of a benchmark. It must be simple so people can understand what is going on and it must be simple to execute so everyone can do their own measurements and verify your claims. And specially having a benchmark and having people verify your measurements is keeping you honest.

    Finally the commit message should state that you have measured the change, it should show the result of the measurement and it should contain some interpretation. e.g. you are optimizing for memory usage and then a small CPU usage hit is acceptable…

    Conclusions of my QtWebKit performance work

    Conclusions of my QtWebKit performance work

    My work on QtWebKit performance came to a surprising end late last month. It might be interesting for others how QtWebKit compares to the various other WebKit ports, where we have some strong points and where we have some homework left todo and where to pickup from where I had to leave it.

    Memory consumption

    Before I started our ImageDecoderQt was decoding every image as soon as the data was complete. The biggest problem with that is that the ImageSource we are embedded into does not tell the WebCore::Cache about the size of the images we already have decoded.

    In this case there was no need to decode the whole image as soon as the date comes in but wait for the ImageSource to request the image size and the image data. This makes a noticable difference on memory benchmarks and allows us to have the WebCore::Cache control the lifetime of decoded image data.

    We still have one case where we have more image data allocated than the WebCore::Cache thinks. This is the case for GIF images as we are decoding every frame to figure out how many images we have there.

    To fix that we should patch the ImageSource to ask the ImageDecoder for “extra” allocated data, and we should fix/verify the GIF Image Reader so we can jump to a given GIF frame and decode it. This means we should remember where certain frames begin…

    Performance

    Networking

    Markus Götz and Peter Hartmann are busy working on the QNetworkAccessManager stack. Their work includes improving the parsing speed of HTTP headers, making sure to start HTTP connections after the first iteration of the mainloop instead of the third.

    In one of my tests wget is still twice as fast as the Qt stack to download the same set of files. And wget is using one connection at a time, no pipelining… and Qt is attempting to have up to 6 connections in parallel. This means there is still some work to do in reducing latency and improving scheduling of requests. I’m pretty confident that Markus and Peter will work on this!

    Images

    The biggest limitation of the Qt Image decoders is that in general progressive loading is not possible and unless I have messed up my reduction the Qt Image decoders are faster than the ones we have in WebCore.

    With some of my reductions I can make some stuff twice as fast for the pattern QtWebKit is having on QImageReader. Currently when asking the QImageReader for the size, the GIF decoder will decode the full frame (size + image data). For the GIF decoder we start the JPEG decompression separately for getting the size, the image and the image format.

    A proof of concept patch for the JPEGReader to reuse the decompression handler showed that I can cut the runtime of the image_cycling reduction by 50%.

    Misc

    One misc. performance goal is to remove temporary allocations. E.g. remove QString::detach() calls from the paint path, to not copy data when moving from QString to WebCore::String, QByteArray to WebCore::String. Some of these include not using WebCore::String::utf8(), but have a zero cost conversion of WebCore::String to QString and use Qt’s utf8()…

    Text

    But the biggest problem of QtWebKit performance is text and I statzed to work on this. For Qt we always have to go through the complex text path of WebCore which means we will end at QTextLayout, which will ask harfbuzz to shape the text.

    There are two things to consider here. For QtWebKit we are using Lars’s QTextBoundaryFinder instead of ICU. I’m not sure if we have ever compared how ICU and QTextBoundaryFinder split text. We might do more work than is necessary, at least it would be good to know. Specially for Japanese and Korean we might split words too early creating more work for our complex text layout path.

    The second part is to look at our QTextLayout usage pattern and start to optimize for it… the quick solutions of asking QFont to not do kerning, and not to do font merging (to not use the QFontEngingeMulti) didn’t really make a noticable difference… To get an idea of the size of the problem, on loading pages like the Wikipedia Article of the Maxwell Equations we are spending so much time in WebCore::Font::floatWidthForComplexText that other ports like WebKit/GTK+ takes to load the entire page. This also seems to be the case for sites like google news.

    And this is exactly where I would have loved to continue to work on it, but that is now pushed back to my spare time where it needs to compete with the other hobby projects.