Huge speedups for flash9 with haXe 1.15, hxasm investigated.

Due to the great work of Nicolas Cannasse, most of the results below have to be re-written! HaXe now as stong typing in flash9, significantly improving performance. I also have a new machine, so some of the results will not be directly comparable, but you will get the idea. I have also added a new one: inline-grid-while, that uses while loops instead of for loops.

With the new version of haXe comes some very interestesting technology - hxasm. This allows you to use haXe syntax to write flash9 “bytecode”. This gives the possibility of decoupling the “per object” bit of the grid iteration from the looping bit by concatenating chunks of bytecode. In theory, you should be able to achieve optimal performance using this method, since you can write any bytecode you like. However, currently I can’t quite get the performance I think because ultimately the function is called through a “dynamic” interface, rather than a strongly typed one.

Writing hxasm from scratch can be quite difficult. For starters, the flash api requires time to compile the code, so the api involves a callback to complete the compilation. Also, the haXe syntax is not that of a “proper” assembler, so jumps etc take a bit of work. And sometimes it is a bit hard to know where to start. To help with this, I’ve written a tool that takes compiled hx code, via the output of “abcdump”, and converts it to hxasm. You can find this code in abctools.zip.

Examining the hxasm code, you can see the difference between the for and while loops. Interestingly, other “hand optimisations” did not seem to give much better results - I suspect the flash vm is doing some pretty good optimisation as it goes. So I think the way to optimise is probably to change the original hx code, rather than the hxasm code (eg, using while loops instead of for loops). Another optimisation I looked at was to “burn in” runtime values. So rather than using the op code to get a member variable, you can burn this variable in as a constant into the bytecode. I think this gave a small improvement - I could not really tell. Infact, this last optimisation is really the only performace increment to be gained from runtime compilation - the rest could in theory be done in the production of the swf file. However, it does present a very interesting solution the the code decoupling!

The source code can be found in src2.zip. Unfortunately, this breaks the ability to compile for neko. Also, it requires a small mod to hxasm 1.03, using an additional offset of -4 on the “backwardJump” call in Context.hx.

MethodTime (ms/frame)ProsCons
Object List 8.1 Easy to understand/debug. Slowest. Causes stutter while garbage collection runs
HaXe Iterator 10.1 Improved performace over Object List. Direct “drop in” replacement for Object List. Decoupled data. Slightly complex to write. Slightly slower than most.
While Iterator 7.1 Slightly faster than for-iterator. Slightly easier to write Slightly more complex to use.
Closure/Callback 13.9 Slightly faster than for-iterator. Decoupled. Interesting way of writing code. Interesting way of writing code.
Member Callback 6.0 Faster than anonymous callback. Member function name is explicit in code.
Inline GOB 6.4 Faster. Couples GOB code to grid implementation. Requires separate code for each function
Inline Grid - for 4.5 Fast. Easy to understand/debug. Not as badly coupled as Inline GOB. Couples Grid code to GOB implementation. Requires separate code for each function
Inline Grid - while 4.0 Fastest. Same as “for” loop, but slightly faster, and slightly more verbose. Couples Grid code to GOB implementation. Requires separate code for each function
HxASM inline code 5.1 Fast and decoupled. Requires writing “raw” hxasm callback. 2-phase setup

Out of all this, the conclusion is pretty similar - the tighter coupling creates faster code - but all the code is faster now, which is great. The inline hxasm is very interesting, and while probably not appropriate for this application, shows some promise for certain applications.

Iteration/looping

The following discussion is based on the source code :1000OgresOource.zip. This code uses the “xinf” haxelib module to provide support for cross platform (browser, downloadable) structures.

The Ogre demo uses a grid to check for collisions between objects. So,rather than checking 1000 sprites against 1000 others, requiring 1000000 checks per frame, each sprite only checks sprites in the local viscinity, running much faster. The 2D grid is independent of the tile grid, and its spacing can be optimised based on object size and density etc.

The code deals, in part, with “GOB”s (Game OBjects) and the GOBGrid. I tried to decouple the grid from the objects, but I could not, because the haXe template system is not powerful enough. The coding issue I’m going to talk about here is how to best separate the task of examining objects in the local visinity, from how the objects are stored in the grid. In other words, iterators.

The algorithm I’m going to talk about is something like the following pseudo code fragment:

GOB::Move()
{
   x += velocity_x
   y += velocity_y

   for_all_nearby_objects_in_grid
     if (obj_is_close_to_me)
        -> dont move.

The question is, what does the “for_all_nearby_objects_in_grid” look like. I have tried the following:

Object List. Here, the GOBGrid produces an Array of candidate objects. The GOB then iterates over these, checking distances between the potential move position and these candidate objects. An important point to note is that the following:

   var objs = mGrid.GetCloseObjs(x,y);
   for(obj in objs)
      ...

was much slower than:

   var objs = mGrid.GetCloseObjs(x,y);
   for(i in 0...objs.length)
   {
      var obj = objs[i];
      …

this should be considered when writing high-performance code.

HaXe Iterator. Writing the iterator was slightly tricky, because you need to think in a slightly different way than you would normally. Here I have made the assumption that “getNext” will be called exactly once after each successful “hasNext” call. I’m pretty sure this is right. This assumption places all the logic in “hasNext” and makes “getNext” trivial. The big advantage of the iterator is that it is syntactically identical to the object list code above (first example), eg:

   var objs = mGrid.GetCloseObjs(x,y);
   for(obj in objs)
      ...

and runs much faster. This leaves open the possibility of staring with a list and then moving to an iterator if the performace is required. The iterator code looks like this:

class GOBIterator
{
   var mGrid:GOBsList;
   var mGridPos:Int;
   var mGridEnd : Int;
   var mYStep:Int;
   var mWidth:Int;

   var mCurrentList : GOBs;
   var mListPos : Int;
   var mX:Int;

   var mNext : GOB;

   public function new(inGrid:GOBsList,
            inX0:Int,inY0:Int, inX1:Int,inY1:Int, inWidth:Int)
   {
      mGrid = inGrid;
      mWidth = inX1-inX0;
      mYStep = inWidth - mWidth + 1;
      mX = 0;
      mGridPos = inY0*inWidth + inX0;
      mGridEnd = (inY1-1)*inWidth + inX1;
      mCurrentList = mGrid[mGridPos];
      mListPos = 0;
   }

   // Haxe iterator interface
   public function hasNext()
   {
      if (mGridPos >= mGridEnd)
         return false;

      while(true)
      {
         if (mListPos=mGridEnd)
               return false;
         }
         else
         {
            mGridPos++;
         }
         mCurrentList = mGrid[mGridPos];
         mListPos = 0;
      }
      return false;
   }

   public function next() : GOB
   {
      return mNext;
   }

}

The mGrid is an Array of cells, each of which is an array of GOBs that are centred in that cell. To go from (x,y) coordinate to cell, the x and y are first quantised and then an index is calculated using cell=y*xcells + x. Another possiblity would be to have a 2D array of cells. I have not tried this, and it may be better or worse, I don’t know.

HaXe while loop. This is very similar to the above code, except that the getNext and hasNext code are combined, and return “null” at the end. The code is simiar- it uses the same constuctor and the function:

   // This combines hasNext with next, and returns null when done.
   public function getNext() : GOB
   {
      if (mGridPos >= mGridEnd)
         return null;

      while(true)
      {
         //var n = mWidth + mYStep - 1;
         //trace( "[" + (mGridPos%n) + "," + Math.floor( mGridPos/n ) + "]” );
         if (mListPos=mGridEnd)
               return null;
         }
         else
         {
            mGridPos++;
         }
         mCurrentList = mGrid[mGridPos];
         mListPos = 0;
      }
      return null;
   }

The problem with this is that you have to use the “while” loop, rather than the “for”, taking 3 lines instead of 1.

Closure/Callback. This method keeps the grid and GOB decoupled by asking the grid to iterate over the neadby objects, calling a callback function for each candidate object.

      var self = this;

      return mGrid.VisitCloseClosure( mMoveX, mMoveY, m2Rad,
                 function(inObj:GOB)
                 {
                    var obj:GOB = inObj;
                    if (obj==self) return true;

                    var dx = self.mMoveX-obj.mX;
                    var dy = self.mMoveY-obj.mY;
                    return dx*dx+dy*dy >= 2;
                 } );

This type of inline-function definition is just the sort of thing I’ve been craving in C++ for years. It takes a bit to get your brain around, but it does provide a very elegant way of decoupling code.

The above 4 methods are attractive because there is a large decoupling between the grid and the objects it stores. The grid could quite easily deal simply with dynamic objects, and the GOB need only know that the grid returns some kind of logical list. Unfortunately, they are not the fastest methods. The following methods introduce tighter coupling between the grid and the GOB in order to improve speed.

Visitor Callback. This method is very similar the the callback method above, except that the grid is passed an object of known type and calls a particular member function on it, rather than a anonymous function, for each candidate object. The problem is that this can only call one particular function, and thus can’t be adapted to a different function.

Inline GOB. This method, the GOB knows everything about the grid implementation and iterates over the elements directly. While it is not too much code in this case, this may soon grow unweildly if we consider such things as multi-resolution grids. This does not allow us the change the grid implementation without changing the GOB code too.

Inline Grid. Here the grid knows about GOB collisions and interrogates the objects directly. This binds part of the GOB implementation to the Grid and is also specialised for one particular function (eg, “collision detection”). However, it does let us change the grid implementation without changing the GOB code.

Results

The results are summarised in the following table.

MethodTime (ms/frame)ProsCons
Object List 31.8 Easy to understand/debug. Slowest. Causes stutter while garbage collection runs
HaXe Iterator 21.0 Improved performace over Object List. Direct “drop in” replacement for Object List. Decoupled data. Slightly complex to write. Slightly slower than most.
While Iterator 20.0 Slightly faster than for-iterator. Slightly easier to write Slightly more complex to use.
Closure/Callback 20.7 Slightly faster than for-iterator. Decoupled. Interesting way of writing code. Interesting way of writing code.
Member Callback 16.4 Faster than anonymous callback. Member function name is explicit in code.
Inline GOB 17.6 Faster. Couples GOB code to grid implementation. Requires separate code for each function
Inline Grid 14.8 Fastest. Easy to understand/debug. Not as badly coupled as Inline GOB. Couples Grid code to GOB implementation. Requires separate code for each function

So, there you have it. No definitive answers!. Decoupling is sacraficed for performance in most cases. Except perhaps that the grid should loop over the objects, rather than the other way around. I think I will use the Inline Grid method for collision detection.

However, if I need to write code like “All ogres run away from all skeletons” then I will need one of the first 4 generic ways of iterating. The iterator methods may get too complex if I have “multi-resolution” grids, in which case, the anonymous function callback may be the way to go. There may also be a way to bring the anonymous function performace up to match the member-function performace - this would be the best of all worlds (fully customisable, and only slightly slower than Inline Grid). Any ideas anyone?

You can download the code and comment/uncomment these various options in GOB.hx.

Source Code

Here is the source code for the APE post. The latest version of haXe supports now supports as3 virtual/override, so the resulting as3 code compiles without modification. However, I have modified it for performance reasons by changing the typecasting to the native function.

Ogre Sprite Demo

ogredemo.PNG

Well, it’s been a while since I’ve written anything - too many projects. I have been looking at doing a 2D game using flash9/neko and have produced a small demo. You can see it action on this page, and you can download the windows version here.

Looks like things could work out well!

HaXe to AS3 converted.

HaXe now comes with the ability to output AS3 code directly. Hopefully this ability will close the performance gap between haXe and AS3.

I first used this option straight out fo the box to convert the three benchmarks used on this site. I had to modify the generated code to include the “override” and “virtual” keywords - but I imagine this will be fixed in later versions of haXe. Then I came accross a curious result - the AS3 code was no faster than the haXe for the CarDemo code. However, since the code was right there in front of me, I could compare the differences with the AS3 code before it was converted to hx. A quick inspection showed the the culprit was a type-cast used in the inner loop of the collision detection code. So I rewote the code to avoid type-casting if possible, and got a 2 times speedup on the haXe code! See CarDemo hx v2 for this compiled haXe programme. I then ran the hx->as3 converter again and got something that ran faster than the original as3 code!

AS3 haXe haXe typed haXe via AS3 haXe - less casting
CarDemo7.6 ms38.6 ms38.9 ms 6.5 ms 15.0 ms
SimpleLoop - for115 ms1218 ms74 ms 40 ms
SimpleLoop - while115 ms1020 ms76 ms 50 ms
PerfTest 1 - string manipulation, join 238 ms 383 ms 720 ms 229 ms
PerfTest 2 - string sort 620 ms 1284 ms 195 ms 1989 ms
PerfTest 3 - string addition 125 ms 138 ms 130 ms 130 ms
PerfTest 4 - bit string manipulation 125 ms 134 ms 125 ms 117 ms
PerfTest 5 - floating point ops 2 ms 24 ms 2 ms 2 ms
PerfTest 6 - substr 2 ms 7 ms 7 ms 3 ms
PerfTest 7 - string indexOf 58 ms 75 ms 75 ms 56 ms
PerfTest 8 - Math.round/random 6 ms 38 ms 24 ms 6 ms
PerfTest 9 - integer addition 31 ms 343 ms 27 ms 29 ms
PerfTest 10 - string function calls 34 ms 103 ms 104 ms 45 ms
PerfTest 11 - MD5 68 ms 163 ms 171 ms 49 ms
PerfTest 12 - integer function calls 1 ms 14 ms 9 ms 2 ms

For all but the string-sort code, the results were as good as the AS3 code. This is an extremely satisfting result because it means that we can write our web programmes in haXe, without fear of performace loss.

This concludes the investigation into the viability of using haXe for web based games, and next I will turn my attention to using the same haXe code base for stand alone game development.

Adding some type info to haXe

There are two good things about open source software; it’s open, and there’s source. So I had a hack with haXe and produced haxe-typed.exe, as a replacement for haxe.exe. This executable adds strong typing to strings, ints, floats and bools when they are local variables. It does not add types to function parameters and return types.

The results were promising:

AS3 haXe haXe typed
CarDemo7.6 ms38.6 ms38.9 ms
SimpleLoop - for115 ms1218 ms74 ms
SimpleLoop - while115 ms1020 ms76 ms
PerfTest 1 - string manipulation, join 238 ms 383 ms 720 ms
PerfTest 2 - string sort 620 ms 1284 ms 195 ms
PerfTest 3 - string addition 125 ms 138 ms 130 ms
PerfTest 4 - bit string manipulation 125 ms 134 ms 125 ms
PerfTest 5 - floating point ops 2 ms 24 ms 2 ms
PerfTest 6 - substr 2 ms 7 ms 7 ms
PerfTest 7 - string indexOf 58 ms 75 ms 75 ms
PerfTest 8 - Math.round/random 6 ms 38 ms 24 ms
PerfTest 9 - integer addition 31 ms 343 ms 27 ms
PerfTest 10 - string function calls 34 ms 103 ms 104 ms
PerfTest 11 - MD5 68 ms 163 ms 171 ms
PerfTest 12 - integer function calls 1 ms 14 ms 9 ms

Here we see that where the processing is local, and in a tight loop, huge performance gains can be achieved. When functions calls are involved, there are no gains - sometimes it is even slower (presumably because of the conversion of specific type to generic type). Unfortunately, CarDemo makes extensive use of properties and object orientation nad therefore does not benefit at all from these optimisations.

HaXe even outperforms in a few tests - the integer loop and the sort function. My initial guess is haXe’s use of local variables for constants, rather than integer constants is what helps this loop, and I have no idea about the sort function. Perhaps there is a bug somewhere.

Implementation Use the above haxe-typed.exe at your own risk. There it was more hacking than science used to produce it. Also, there is a bunch of text-out that probably means nothing - this is basically my attempt at understanding what was going on. If you are after the source-code changes just let me know. The changes are probably not very nice to look at, or done very well - but at least I got it compiling.

The exe is not very widely tested (only on these 3 programmes !). When there is a type mismatch, the flash player throws an exception and the execution stops - often leaving you looking at a white screen. Use the “player/debug/SAFlashPlayer.exe”, provided with the flex SDK, to see these exceptions. More often than not, they are something like “Int and * can not be reconciled”. This means there is a bug in bytecode created by the modified haXe. The process of fixing the exe was to basically track these down one by one.

The code also relies on expressions having the correct type. This is generally the case with haXe, except for the case of Floats with constant ints, eg: “var f:Float = 2″. I had to change the source to “var f:Float = 2.0″ to get these to compile.

I had no reference to the AS3 assembly syntax other than what I could see in the haXe code and what I could deduce from the “abcdump.exe” of both haXe and flex outputs. The only principle for typing local varibles I can see is to initialise them with a specifically typed value. And this must be done at the “top” of the function - I’m not sure what logic is used to determine where the “top” finishes, but the following pseudo code works:

  var i:Int = 0;
  while (something)
  {
     ....
  }

Whereas the following does not define i as an integer:

  While (something)
  {
     var i:Int = 0;
     ....
  }

So the task of adding type information amounts to initialising all the local variables (including those that are limited in scope by blocks) at the top of the function.

I did not add strong types to object variables. This should be quite possible, since haXe knows the obejct types. External information suggests that this would be a very good thing to do, and I may get around to it one day. Maybe.

I also did not add types to function arguements or return values. I’m sure doing so would close the gap between haXe and AS3 native code, however the genswf9.ml code removes this type information reasonably early and it would take a bit of effort to get this working.

I found editing the ocaml code a bit of a slog. Firstly, “ml” style languages were new to me and it took me a while to grok it. It took me a while to workout how to even decompose the syntax into a series of statements (oh, it’s returning a function!). Secondly, the code is pretty much bereft of comments so it was hard to know where to start - I couldn’t really look a a bit of code in isolation whithout a good feeling for the whole code base. And Lastly, the ocaml penchant for terse variable names (eg, “let field_access ctx get f t e p =”).

This may be as far as I take these optimisations for now. The main goal was to see if the performance can be improved - and I think the answer is yes. I don’t think my changes are of high enough quality to re-submit them into the code base (in fact, I would almost certainly do a lot differently if I were to start again). It has been a great exercise and I’ve learnt a lot - including a new programming language.

Decompiling the differences

I have created about the simplest test I can, the “SimpleLoop” programmes. AS3 version HaXe version The difference in performace is dramatic - 10 to 20 fold. But all is not as bad as it seems…

I have been looking at the very interesting tool, abcdump.exe, as described here.

You can really get a feel for the differences between the outputs. And there is a simple explaination for the differences in file size - haXe includes a small library in the SWF. Fair enough.

I will concentrate on the “Run2″ test - both using “while” loops, rather than “for” loops. HaXe’s iterator-style for loops are slower than its while style loops - I’m hoping that this will not always be the case [Edit: actually the timings both vary and seem about the same]. (Although as a side note, I hope haXe will support both styles of for loops, since it makes porting easier amongst other reasons).

A quick look at the decompile of the Run2 function (two nested while… loops) shows the use of the command “iflt”, presumably “if less than”, which seems ideal for these loops. HaXe uses 3 statements here: coerce_a, lessthan, iffalse. I believe that haXe could easily use this optimisation, especially considering the for(i in 1…1000) style syntax. Also the increment operation. AS3 uses “inclocal_i”, where haXe uses 4 statements: getlocal, increment, coerce_a, setlocal. Again some low-hanging fruit for haXe to pick up.

Another trick is “pushshort” rather than “pushint” where size will allow, and it seems haXe integer constants are followed by “coerce_a”, whereas AS3 ones are not. AS3 used “convert_i” whereas haXe uses “coerce_a”. I’m not sure of the performace implications of this.

So, after some initial doubts, now I think haXe could get about a 10 fold increase in speed (in these very tight loops) pretty easily. HaXe (especially for flash 9) is very new, and I’m condifent these optimisation will come soon enough.

AS3 haXe
function Run2():int /* disp_id 0*/
{
// local_count=4 max_scope=1
// max_stack=2 code_len=60
0     getlocal0         
1     pushscope         
2     pushbyte          0
4     setlocal1         
5     pushbyte          0
7     setlocal2         
8     pushbyte          0
10    setlocal3         
11    pushbyte          0
13    setlocal1         
14    pushbyte          0
16    setlocal2         
17    jump              L1


L2: 
21    label             
22    pushbyte          0
24    setlocal1         
25    pushbyte          0
27    setlocal3         
28    jump              L3


L4: 
32    label             
33    getlocal1         
34    getlocal3         
35    add               
36    convert_i         
37    setlocal1         
38    inclocal_i        3

L3: 
40    getlocal3         
41    pushshort         10000
44    iflt              L4

48    inclocal_i        2

L1: 
50    getlocal2         
51    pushshort         1000
54    iflt              L2

58    getlocal1         
59    returnvalue       
}

    function Run2():*   /* disp_id 0*/
{
// local_count=4 max_scope=1
// max_stack=2 code_len=70
0     getlocal0         
1     pushscope         
2     pushbyte          0
4     coerce_a          
5     setlocal1         
6     pushbyte          0
8     coerce_a          
9     setlocal2         
10    jump              L1


L2: 
14    label             

L1: 
15    getlocal2         
16    pushint           1000    // 0x3e8
18    coerce_a          
19    lessthan          
20    iffalse           L3

24    pushbyte          0
26    coerce_a          
27    setlocal1         
28    pushbyte          0
30    coerce_a          
31    setlocal3         
32    jump              L4


L5: 
36    label             

L4: 
37    getlocal3         
38    pushint           10000   // 0x2710
40    coerce_a          
41    lessthan          
42    iffalse           L6

46    getlocal1         
47    getlocal3         
48    add               
49    coerce_a          
50    setlocal1         
51    getlocal3         
52    increment         
53    coerce_a          
54    setlocal3         
55    jump              L5

L6: 
59    getlocal2         
60    increment         
61    coerce_a          
62    setlocal2         
63    jump              L2

L3: 
67    getlocal1         
68    returnvalue       
69    returnvoid        
}

Performance Anxiety

The game code was heavy on maths, so I have decided to pinpoint where the difference in performance is coming from. I found an AS3 performance test on AS3 performace test and ported it to hAxe - again, not necessarily perfectly.

You can check out the results here:

The tests can be seen at http://www.onflex.org/perf/srcview.

As you can see, the haXe code is slower - sometimes a lot slower - than the AS3 code. Of note:

  • Test 2 : sort. This is perhaps not surprising because the haXe version is using a custom string compare function, rather than a built-in one.
  • Many of the functions are to do with string manipulation. I suspect the probelms here are introduced though the layering of the library via “Std.” and the internal string functions.
  • Function 9, the loop code, shows a huge difference. Here I think there must be some loop optimisation that hAxe is missing out on.
  • Function 12, a subroutine call, is also very very slow. I think there may be something simple that haXe is not doing.

So there you have it. Loops and function calls - building blocks of most programmes, are very slow.

These differences are way bigger than the file-size differences would suggest, so I think there is still hope that haXe speed can be greatly improved.

Physical Test

I am interested in many types of games, but I find physical simulations particularly cool. So I did some fishing on the internet for flash-based (flash9 = actionscript 3=AS3) physics code, and I found APE as a freeware physics engine. Since I had the source code, I though this would be a good test to compare haXe and AS3. This should be an interesting test because they both run on the same virtual machine, so it is just the code-generation differences that we are testing. I ported the demo to haXe – now I admit that this may not have been the best port, but still the results are a bit disappointing. Checkout the comparison for yourself:

AS3 Version

haXe Version

I get AS3 version 7.05ms/frame calculation, and haXe 37.8ms. That’s about 5 times slower. Also of interest is the difference in file size. AS3 swf file is 10051bytes, and the haXe file is 11907. I will have to find the source of this difference if performance libraries are to be written in haXe

The Plan

Well, as the URL of the site may suggest, my initial plan is to use the “haXe” programming language.  And I plan to do this by generating byte-code for flash-player 9 for the web, and linking the neko virtual machine to a suitable “engine” for the stand-alone executables.

The possibilities I see are:

  1.  Use haXe to generate flash9 and neko code. All logic written in haXe. Plus a simple C++ engine to link the neko into.
  2. If haXe is not fast enough on flash9, write performace code in AS3, and replicate this in C++.  Then use haXe for more high-level scripting.
  3. Write all in AS3, and buy a stand-alone flash player.
  4. Write in Java for the web and link a JVM into a C++ executable.
  5. Write all in Java and simply bundle it into a stand-alone player.

This is pretty much the order I think I will try things.  I’m hoping I can get point 1 to work - this will give the greatest ability to get a simple web version going, with the ability to download a stand-alone version that is far more complex and not performace bound.

I prefer Flash to Java mainly because of end-user acceptance.