How to make the GNU Smalltalk Interpreter slower

How to make the GNU Smalltalk Interpreter slower

This is another post about a modern Linux based performance measurement utility. It is called perf, it is included in the Linux kernel sources and it entered the kernel in v2.6.31-rc1. In many ways it is obsoleting OProfile, in fact for many architectures oprofile is just a wrapper around the perf support in the kernel. perf comes with a few nice application. perf top provides a statistics about which symbols in user and in kernel space are called, perf record to record an application or to start an application to record it and then perf report to browse this report with a very simple CLI utility. There are tools to bundle the record and the application in an archive, a diff utility.

For the last year I was playing a lot with GNU Smalltalk and someone posted the results of a very simplistic VM benchmark ran across many different Smalltalk implementations. In one of the benchmarks GNU Smalltalk is scoring last among the interpreters and I wanted to understand why it is slower. In many cases the JavaScriptCore interpreter is a lot like the GNU Smalltalk one, a simple direct-threaded bytecode interpreter, uses computed goto (even is compiled with -fno-gcse as indicated by the online help, not that it changed something for JSC), heavily inlined many functions.

There are also some differences, the GNU Smalltalk implementation is a lot older and in C. The first notable is that it is a Stack Machine and not register based, there are global pointers for the SP and the IP. Some magic to make sure that in the hot loop the IP/SP is ‘local’ in a register, depending on the available registers also keep the current argument in one, the interpreter definition is in a special file format but mostly similar to how Interepreter::privateExecute is looking like. The global state mostly comes from the fact that it needs to support switching processes and there might be some event during the run that requires access to the IP to store it to resume the old process. But in general the implementation is already optimized and there is little low hanging fruits and most experiments result in a slow down.

The two important things are again: Having a stable benchmark, having a tool to help to know where to look for things. In my case the important tools are perf stat, perf record, perf report and perf annotate. I have put a copy of the output to the end of this blog post. The stat utility provides one with number of instructions executed, branches, branch misses (e.g. badly predicted), L1/L2 cache hits and cache misses.

The stable benchmark helps me to judge if a change is good, bad or neutral for performance within the margin of error of the test. E.g. if I attempt to reduce the code size the instructions executed should decrease, if I start putting __builtin_expect.. into my code the number of branch misses should go down as well. The other useful utility is to the perf report that allows one to browse the recorded data, this can help to identify the methods one wants to start to optimize, it allows to annotate these functions inside the simple TUI interface, but does not support searching in it.

Because the codebase is already highly optimized any of my attempts should either decrease the code size (and the pressure on the i-cache), the data size (d-cache), remove stores or loads from memory (e.g. reorder instructions), fix branch predictions. The sad truth is that most of my changes were either slow downs or neutral to the performance and it is really important to undo these changes and not have false pride (unless it was also a code cleanup or such).

So after about 14 hours of toying with it the speed ups I have managed to make come from inlining a method to unwind a context (callframe), reordering some compares on the GC path and disabling the __builtin_expect branch hints as they were mostly wrong (something the kernel people found to be true in 2010 as well). I will just try harder, or try to work on the optimizer or attempt something more radical…

$ perf stat gst -f Bench.st
219037433 bytecodes/sec; 6025895 sends/sec

Performance counter stats for ‘gst -f Bench.st’:

17280.101683 task-clock-msecs # 0.969 CPUs
2076 context-switches # 0.000 M/sec
123 CPU-migrations # 0.000 M/sec
3925 page-faults # 0.000 M/sec
22215005506 cycles # 1285.583 M/sec (scaled from 70.02%)
40593277297 instructions # 1.827 IPC (scaled from 80.00%)
5063469832 branches # 293.023 M/sec (scaled from 79.98%)
70691940 branch-misses # 1.396 % (scaled from 79.98%)
27844326 cache-references # 1.611 M/sec (scaled from 20.02%)
134229 cache-misses # 0.008 M/sec (scaled from 20.03%)

17.838888599 seconds time elapsed

PS: The perf support probably works best on Intel based platforms and the biggest other problem is that perf annotate has some issues when the code is included from other c files.

Comments are closed.