Boehm GC, virtual inheritance and finalizers.

I’m trying to get a speedup for the cpp backend for haxe by using garbage collection. Initial results are very promising – potentially about twice as fast. Howerver I spent a good few hours getting the the bottom of a little problem. Boehm garbage collection is a very impressive piece of work – it has all sorts of magic that does magical things, such as deal with virtual inheritance. This was a bit of a surprise because you do not always get “real” pointers, when you store an object pointer, you get one with an offset. However it seemed to work. Until I added finalizers to the external draw objectes used by the renderer. Apparently, you can only add finalizers to the “real” pointers (ie, those returned from “GC\_MALLOC” et al), rather than a pointer to the same object related by virtual inheritance. The symptom was that the object gets finalized in the first “gc_collect”, even though it was still “used” as far as I was concerned. I guess this is not too surpising, and the fix was pretty easy, but the fact that everything else worked so well lulled be into not suspecting this initially.

C++ backend for haXe

I have just completed an alpha release of a c++ backend for haxe. This means that you can complile haxe code into a 100% compiled executable. You can download the demo file in hxcpp-01.zip. Sorry, windows only at this stage.

The distribution contains a new cpp backend for haxe. It has been based on a 2.0 version of haxe, which may be a tiny bit out of date. Most of the changes are in the new “gencpp.ml”, and to the standard library files, with a few little extra bits here and there. You can re-compile the
haxe compiler if you have ocaml by using the supplied install.ml script.

To try this version for yourself, first backup your haxe distro and copy then supplied “compile/bin/haxe.exe” and “compiler/std/*” files over the top. Use the “-cpp cpp_directory” command line to generate a directory that contains src, include and nmake files. You can then compile these using the microsoft visual studio “nmake” utility. The build system requires the library, include, make and dlls from the “hxcpp” directory. To access these, you should set the environment variable “HXCPP” to point to hxcpp directory extracted from this distribution. This can be done from right-click-“My Computer”/Properties/Advanced/Environment Variables, or from the commandline before compiling.
These resulting “exe” file also needs the hxcpp.dll file from the hxcpp/dll directory. The should be in your “path”, or simply copy it next to your exe.

You can recompile the hxcpp.dll using the nmake file in the directory. You can change the compile flags from the $HXCPP/nmake.setup file (eg, turn on debug).

Demos

Two demos have been included – “perf”, a small benchmark program I found on the net
and a “Physaxe” demo. The source is included (slightly modified), and so are the binaries.
The cpp src and include directories have been included to give you taste of the
output if you can’t be bothered setting up the compiler yourself.
The binaries can be found in demos/bin, and are compiled for neko, swf and cpp.
The neko version can be run with “neko phx.n” or “neko TestRunner.n”. You do not
need a very recent version of neko, but you do need the included “nme.ndll” findable
by neko (next to it will work).

The cpp version of Physaxe uses the cpp verion of NME. This was compiled from
the same code base as the neko version, except it uses the “neko.h” file found
in the hxcpp directoty instead of the one that comes with neko. The nme.dll should
be next to the compiled exe.

If you want to compile the nme versions yourself, you will need the latest nme and neash
versions from code.google.com:
http://code.google.com/p/nekonme/source/checkout
http://code.google.com/p/neash/source/checkout

Performance
————
The flash version of physaxe runs the fastest, with the cpp version about 70% of the
speed (when using the opengl version), and neko about 20% of the speed.

One of the problems is that the cpp version uses the neko api, which required fields
to be looked up by name, which is quite slow in this implementation. A faster version
could link directly to the hxcpp objects – but then it could not use the same API.
This problem is made far worse by the fact the physaxe re-renders each point in each
object every frame, rather then simply adjusting the matrix of existing objects.

I think the most significant loss of perfromance is coming from the reference counting
housekeeping. I will look into a garbage collected system soon.

The results from the “TestRunner” are mixed with flash being faster for stings, but
cpp faster for maths and looping. Neko is fastest for the sting sort in this case,
but this is unusual because the stings are already sorted. When they are not, neko
is very slow. The cpp string code is very simple, so there is scope for improvement there.

TODO
—-
There is still plenty to do.

  • A lot of the operators (eg, “*=”) have not been looked at.
  • The actual formatting of the generated code needs a complete overhaul.
  • The ml code needs some simplifying/cleaning.
  • The standard libraries (eg, xml,regex)
  • Need some way of locating the various dlls etc.
  • Splitup/refactor the HObject.h et al files.
  • Returning values from blocks/swithes.
  • Complete neko.h
  • Look at GC.

Plenty more, I’m sure.

Neash/NME 0.8 released

While this blog may have been quiet, I have been busy. The next version of Neash and NME has been released on haxelib, making it very easy to upgrade. Some cool features include:

  • Fonts
  • Filters (some)
  • Bitmap Caching
  • Improved opengl speed
  • SWF reading/playing
  • Scroll rects (axis aligned)
  • Haxe 2.0 upgrade
  • But perhaps an introduction is in order. The purpose of Neash is to allow you to create programs that run on both flash and also natively, say as a downloadable program. You start with a simple (or complex) haxe program, targetting flash.


    import flash.display.Sprite;
    import flash.display.Shape;

    class Simple extends Sprite
    {

    public function new()
    {
    super();
    flash.Lib.current.addChild(this);

    // creating a new shape instance
    var circle:Shape = new Shape( );
    // starting color filling
    circle.graphics.beginFill( 0xff9933 , 1 );
    // drawing circle
    circle.graphics.drawCircle( 0 , 0 , 40 );
    // repositioning shape
    circle.x = 80;
    circle.y = 80;

    // adding displayobject to the display list
    addChild( circle );
    }

    static public function main()
    {
    new Simple();
    }
    }

    And you get a SWF that renders something like this. This works well in a browser, but what if you wanted to distrubute this as a stand-alone exe? You would of course use one of the may flash-to-exe tools around. These all, one way or another, involve packaging up the flash runtime, and this has licensing implications. Also, it can be difficult to add DRM or other native extensions to the code. So the alternative offered here is to compile it to neko!

    There are 3 simple steps for compiling to neko. 1. Get the libraries, 2. create the compiler command-line and 3. some very minor source code mods.

    1. Getting the libraries. Simply use the “haxelib” tool that comes with haxe to download and install the “NME” and “Neash” libraries. From the command (shell) prompt, type: haxelib install nme followed by haxelib install neash
    2. Create the compiler command. Rather that typing haxe -main …… every time, you can create a “.hxml” file that contains the commands, then you can simply use haxe file.hxml. The hxml file contains flags or key-value pairs of command-line arguements. You can also use the “–next” to compile to more than one target from the single invovation of haxe.

      To use neash, you will need the nme and neash libraries. To add these, you can use the command-line options “-lib neash” and “-lib nme”. For the neko target, you will also need to redirect the “flash” code to use “neash” instead. This is easily done with the “–remap flash:neash” command.

      So a hxml file that targets both flash and neko looks something like this:

      -main Simple
      -swf Simple.swf
      -swf-version 9
      -swf-header 640:480:100:334433
      -lib neash
      -cmd echo SWF done
      
      --next
      -main Simple
      -neko Simple.n
      --remap flash:neash
      -lib nme
      -lib neash
      -cmd echo Neko done
      

    3. Source code mod. You need to do some very minor modifications to run the neko version using neash. Specifically, you need to call “Init” and “Run” and the first and last things you do in your main routine. eg:

         static public function main()
         {
            neash.Lib.Init("Simple",640,480);
            neash.Lib.SetBackgroundColour(0x334433);
      
            new Simple();
      
            neash.Lib.Run();
         }
      


      Currently there is no way to get the command-line flash header data into the neko programme. The neash calls are perfectly safe under flash, so it is safe to include these in both flash and neko projects. However, you will then need “-lib neash” when compiling your flash version. The alternative is to have some “#if neko” directives in the static main routine, and carry on normally from there.

    So running haxe on this hxml file will produce both “Simple.swf” and a “Simple.n” files. You can run neko Simple.n to run the neko program, producing much the same result. You can use neko Simple.n -opengl to run with opengl acceleration – although that will no be much use in the simple case.

    All these project files, along with many others, can be found in the “samples” area of the neash library that you get when you use haxelib to install neash.

    Has It Really Been A Year?

    Well, one year on and I’ve had some fun, but still no games! The technology side of things has certainly come into focus in the last 12 months. I’ve seen haXe double and redouble performance to become my preferred option for web game development. I’ve been actively working on the NME to get the downloadable side of things going, and I’m almost done (just add one more feature…). Once that is done, I will pull my finger out and actually try to release a game – even if it is a little one.

    I’m really enjoying the haXe/flash development, and looking back at my original ideas about trying java, I think I will have to change my game-plan (or “mission statement”) to just look at haXe/as3 based programming. The technology is pretty good these days, so it just comes down to productivity, and having fun.

    Ogre Sprite Demo

    ogredemo.PNG

    Well, it’s been a while since I’ve written anything – too many projects. I have been looking at doing a 2D game using flash9/neko and have produced a small demo. You can see it action on this page,
    and you can download the windows version here.

    Looks like things could work out well!

    HaXe to AS3 converted.

    HaXe now comes with the ability to output AS3 code directly. Hopefully this ability will close the performance gap between haXe and AS3.

    I first used this option straight out fo the box to convert the three benchmarks used on this site. I had to modify the generated code to include the “override” and “virtual” keywords – but I imagine this will be fixed in later versions of haXe.
    Then I came accross a curious result – the AS3 code was no faster than the haXe for the CarDemo code. However, since the code was right there in front of me, I could compare the differences with the AS3 code before it was converted to hx. A quick inspection showed the the culprit was a type-cast used in the inner loop of the collision detection code. So I rewote the code to avoid type-casting if possible, and got a 2 times speedup on the haXe code! See [CarDemo hx v2](/?page_id=30) for this compiled haXe programme. I then ran the hx->as3 converter again and got something that ran faster than the original as3 code!

    AS3 haXe haXe typed haXe via AS3 haXe – less casting
    CarDemo 7.6 ms 38.6 ms 38.9 ms 6.5 ms 15.0 ms
    SimpleLoop – for 115 ms 1218 ms 74 ms 40 ms
    SimpleLoop – while 115 ms 1020 ms 76 ms 50 ms
    PerfTest 1 – string manipulation, join 238 ms 383 ms 720 ms 229 ms
    PerfTest 2 – string sort 620 ms 1284 ms 195 ms 1989 ms
    PerfTest 3 – string addition 125 ms 138 ms 130 ms 130 ms
    PerfTest 4 – bit string manipulation 125 ms 134 ms 125 ms 117 ms
    PerfTest 5 – floating point ops 2 ms 24 ms 2 ms 2 ms
    PerfTest 6 – substr 2 ms 7 ms 7 ms 3 ms
    PerfTest 7 – string indexOf 58 ms 75 ms 75 ms 56 ms
    PerfTest 8 – Math.round/random 6 ms 38 ms 24 ms 6 ms
    PerfTest 9 – integer addition 31 ms 343 ms 27 ms 29 ms
    PerfTest 10 – string function calls 34 ms 103 ms 104 ms 45 ms
    PerfTest 11 – MD5 68 ms 163 ms 171 ms 49 ms
    PerfTest 12 – integer function calls 1 ms 14 ms 9 ms 2 ms

    For all but the string-sort code, the results were as good as the AS3 code. This is an extremely satisfting result because it means that we can write our web programmes in haXe, without fear of performace loss.

    This concludes the investigation into the viability of using haXe for web based games, and next I will turn my attention to using the same haXe code base for stand alone game development.

    Adding some type info to haXe

    There are two good things about open source software; it’s open, and there’s source. So I had a hack with haXe and produced haxe-typed.exe, as a replacement for haxe.exe. This executable adds strong typing to strings, ints, floats and bools when they are local variables. It does not add types to function parameters and return types.

    The results were promising:

    AS3 haXe haXe typed
    CarDemo 7.6 ms 38.6 ms 38.9 ms
    SimpleLoop – for 115 ms 1218 ms 74 ms
    SimpleLoop – while 115 ms 1020 ms 76 ms
    PerfTest 1 – string manipulation, join 238 ms 383 ms 720 ms
    PerfTest 2 – string sort 620 ms 1284 ms 195 ms
    PerfTest 3 – string addition 125 ms 138 ms 130 ms
    PerfTest 4 – bit string manipulation 125 ms 134 ms 125 ms
    PerfTest 5 – floating point ops 2 ms 24 ms 2 ms
    PerfTest 6 – substr 2 ms 7 ms 7 ms
    PerfTest 7 – string indexOf 58 ms 75 ms 75 ms
    PerfTest 8 – Math.round/random 6 ms 38 ms 24 ms
    PerfTest 9 – integer addition 31 ms 343 ms 27 ms
    PerfTest 10 – string function calls 34 ms 103 ms 104 ms
    PerfTest 11 – MD5 68 ms 163 ms 171 ms
    PerfTest 12 – integer function calls 1 ms 14 ms 9 ms

    Here we see that where the processing is local, and in a tight loop, huge performance gains can be achieved. When functions calls are involved, there are no gains – sometimes it is even slower (presumably because of the conversion of specific type to generic type). Unfortunately, CarDemo makes extensive use of properties and object orientation nad therefore does not benefit at all from these optimisations.

    HaXe even outperforms in a few tests – the integer loop and the sort function. My initial guess is haXe’s use of local variables for constants, rather than integer constants is what helps this loop, and I have no idea about the sort function. Perhaps there is a bug somewhere.

    Implementation
    Use the above haxe-typed.exe at your own risk. There it was more hacking than science used to produce it. Also, there is a bunch of text-out that probably means nothing – this is basically my attempt at understanding what was going on. If you are after the source-code changes just let me know. The changes are probably not very nice to look at, or done very well – but at least I got it compiling.

    The exe is not very widely tested (only on these 3 programmes !). When there is a type mismatch, the flash player throws an exception and the execution stops – often leaving you looking at a white screen. Use the “player/debug/SAFlashPlayer.exe”, provided with the flex SDK, to see these exceptions. More often than not, they are something like “Int and * can not be reconciled”. This means there is a bug in bytecode created by the modified haXe. The process of fixing the exe was to basically track these down one by one.

    The code also relies on expressions having the correct type. This is generally the case with haXe, except for the case of Floats with constant ints, eg: “var f:Float = 2”. I had to change the source to “var f:Float = 2.0” to get these to compile.

    I had no reference to the AS3 assembly syntax other than what I could see in the haXe code and what I could deduce from the “abcdump.exe” of both haXe and flex outputs. The only principle for typing local varibles I can see is to initialise them with a specifically typed value. And this must be done at the “top” of the function – I’m not sure what logic is used to determine where the “top” finishes, but the following pseudo code works:

      var i:Int = 0;
      while (something)
      {
         ....
      }
    

    Whereas the following does not define i as an integer:

      While (something)
      {
         var i:Int = 0;
         ....
      }
    

    So the task of adding type information amounts to initialising all the local variables (including those that are limited in scope by blocks) at the top of the function.

    I did not add strong types to object variables. This should be quite possible, since haXe knows the obejct types. External information suggests that this would be a very good thing to do, and I may get around to it one day. Maybe.

    I also did not add types to function arguements or return values. I’m sure doing so would close the gap between haXe and AS3 native code, however the genswf9.ml code removes this type information reasonably early and it would take a bit of effort to get this working.

    I found editing the ocaml code a bit of a slog. Firstly, “ml” style languages were new to me and it took me a while to grok it. It took me a while to workout how to even decompose the syntax into a series of statements (oh, it’s returning a function!). Secondly, the code is pretty much bereft of comments so it was hard to know where to start – I couldn’t really look a a bit of code in isolation whithout a good feeling for the whole code base. And Lastly, the ocaml penchant for terse variable names (eg, “let field_access ctx get f t e p =”).

    This may be as far as I take these optimisations for now. The main goal was to see if the performance can be improved – and I think the answer is yes. I don’t think my changes are of high enough quality to re-submit them into the code base (in fact, I would almost certainly do a lot differently if I were to start again). It has been a great exercise and I’ve learnt a lot – including a new programming language.

    Decompiling the differences

    I have created about the simplest test I can, the “SimpleLoop” programmes.
    [AS3 version](http://gamehaxe.com/?page_id=19)
    [HaXe version](http://gamehaxe.com/?page_id=17)
    The difference in performace is dramatic – 10 to 20 fold. But all is not as bad as it seems…

    I have been looking at the *very* interesting tool, abcdump.exe,
    [as described here](http://www.5etdemi.com/blog/archives/2007/01/as3-decompiler/).

    You can really get a feel for the differences between the outputs. And there is a simple explaination for the differences in file size – haXe includes a small library in the SWF. Fair enough.

    I will concentrate on the “Run2” test – both using “while” loops, rather than “for” loops. HaXe’s iterator-style for loops are slower than its while style loops – I’m hoping that this will not always be the case _\[Edit: actually the timings both vary and seem about the same\]_. (Although as a side note, I hope haXe will support *both* styles of for loops, since it makes porting easier amongst other reasons).

    A quick look at the decompile of the Run2 function (two nested while… loops) shows the use of the command “iflt”, presumably “if less than”, which seems ideal for these loops. HaXe uses 3 statements here: coerce\_a, lessthan, iffalse. I believe that haXe could easily use this optimisation, especially considering the for(i in 1…1000) style syntax. Also the increment operation. AS3 uses “inclocal\_i”, where haXe uses 4 statements: getlocal, increment, coerce\_a, setlocal. Again some low-hanging fruit for haXe to pick up.

    Another trick is “pushshort” rather than “pushint” where size will allow, and it seems haXe integer constants are followed by “coerce_a”, whereas AS3 ones are not. AS3 used “convert\_i” whereas haXe uses “coerce\_a”. I’m not sure of the performace implications of this.

    So, after some initial doubts, now I think haXe could get about a 10 fold increase in speed (in these very tight loops) pretty easily. HaXe (especially for flash 9) is very new, and I’m condifent these optimisation will come soon enough.

    AS3 haXe
    function Run2():int	/* disp_id 0*/
    {
    // local_count=4 max_scope=1
    // max_stack=2 code_len=60
    0     getlocal0
    1     pushscope
    2     pushbyte      	0
    4     setlocal1
    5     pushbyte      	0
    7     setlocal2
    8     pushbyte      	0
    10    setlocal3
    11    pushbyte      	0
    13    setlocal1
    14    pushbyte      	0
    16    setlocal2
    17    jump          	L1
    
    
    L2:
    21    label
    22    pushbyte      	0
    24    setlocal1
    25    pushbyte      	0
    27    setlocal3
    28    jump          	L3
    
    
    L4:
    32    label
    33    getlocal1
    34    getlocal3
    35    add
    36    convert_i
    37    setlocal1
    38    inclocal_i    	3
    
    L3:
    40    getlocal3
    41    pushshort     	10000
    44    iflt          	L4
    
    48    inclocal_i    	2
    
    L1:
    50    getlocal2
    51    pushshort     	1000
    54    iflt          	L2
    
    58    getlocal1
    59    returnvalue
    }
    
    
        function Run2():*	/* disp_id 0*/
    {
    // local_count=4 max_scope=1
    // max_stack=2 code_len=70
    0     getlocal0
    1     pushscope
    2     pushbyte      	0
    4     coerce_a
    5     setlocal1
    6     pushbyte      	0
    8     coerce_a
    9     setlocal2
    10    jump          	L1
    
    
    L2:
    14    label
    
    L1:
    15    getlocal2
    16    pushint       	1000	// 0x3e8
    18    coerce_a
    19    lessthan
    20    iffalse       	L3
    
    24    pushbyte      	0
    26    coerce_a
    27    setlocal1
    28    pushbyte      	0
    30    coerce_a
    31    setlocal3
    32    jump          	L4
    
    
    L5:
    36    label
    
    L4:
    37    getlocal3
    38    pushint       	10000	// 0x2710
    40    coerce_a
    41    lessthan
    42    iffalse       	L6
    
    46    getlocal1
    47    getlocal3
    48    add
    49    coerce_a
    50    setlocal1
    51    getlocal3
    52    increment
    53    coerce_a
    54    setlocal3
    55    jump          	L5
    
    L6:
    59    getlocal2
    60    increment
    61    coerce_a
    62    setlocal2
    63    jump          	L2
    
    L3:
    67    getlocal1
    68    returnvalue
    69    returnvoid
    }
    
    

    Performance Anxiety

    The game code was heavy on maths, so I have decided to pinpoint where the difference in performance is coming from. I found an AS3 performance test on [AS3 performace test](http://www.onflex.org/perf/) and ported it to hAxe – again, not necessarily perfectly.

    You can check out the results here:

    * [AS3 test](/?page_id=11)
    * [haXe test](/?page_id=8)

    The tests can be seen at [http://www.onflex.org/perf/srcview](http://www.onflex.org/perf/srcview).

    As you can see, the haXe code is slower – sometimes a lot slower – than the AS3 code. Of note:

    * Test 2 : sort. This is perhaps not surprising because the haXe version is using a custom string compare function, rather than a built-in one.
    * Many of the functions are to do with string manipulation. I suspect the probelms here are introduced though the layering of the library via “Std.” and the internal string functions.
    * Function 9, the loop code, shows a huge difference. Here I think there must be some loop optimisation that hAxe is missing out on.
    * Function 12, a subroutine call, is also very very slow. I think there may be something simple that haXe is not doing.

    So there you have it. Loops and function calls – building blocks of most programmes, are very slow.

    These differences are way bigger than the file-size differences would suggest, so I think there is still hope that haXe speed can be greatly improved.

    Physical Test

    I am interested in many types of games, but I find physical simulations particularly cool. So I did some fishing on the internet for flash-based (flash9 = actionscript 3=AS3) physics code, and I found APE as a freeware physics engine. Since I had the source code, I though this would be a good test to compare haXe and AS3. This should be an interesting test because they both run on the same virtual machine, so it is just the code-generation differences that we are testing. I ported the demo to haXe – now I admit that this may not have been the best port, but still the results are a bit disappointing. Checkout the comparison for yourself:

    AS3 Version

    haXe Version

    I get AS3 version 7.05ms/frame calculation, and haXe 37.8ms. That’s about 5 times slower. Also of interest is the difference in file size. AS3 swf file is 10051bytes, and the haXe file is 11907. I will have to find the source of this difference if performance libraries are to be written in haXe