How I Improved Hxcpp Speed 6x

Let me start by qualifying the title with the obligatory phrase in some situations. TL;DR – it was probably too slow to start with.

The performance story starts, like all good performance stories do, with a benchmark. I was looking for a benchmark to measure the allocation system, but also mix in a little calculation, in as few lines as possible – so I settled on the Mandelbrot program. But with a little “deoptimization” to use classes for the complex arithmetic rather than inlining the components. On a whim, I also decided to store the resulting image as an array of RGB classes, rather that an array of Int, which turned out to also have significant implications.

You can see the code in the haxe repo. You may note there are 2 versions controlled by a define, a class based one and an anonymous object (typedef) based one. The typedef option is for future optimization attempts. Also, the performance figures I will be talking about have the “SIZE” parameter increased to 40, to allow more accurate timing. The results are predictable, if not beautiful.


This benchmark is all about allocation and garbage collection (GC). The allocation part calls the object “new” routine to find some space on one of the memory blocks, and the collection part must scan the allocation space to reclaim object space that is no longer used. These phases have quite different optimization characteristics – the allocation part gets called literally millions of times, and therefore micro-optimizations at the code level may be appropriate, while the collection code get run less frequently, but must do quite a lot of work, so algorithmic optimizations might be more appropriate.

The initial results with hxcpp 3.2.102 were really quite bad. Hxcpp 3.2.102 takes approximately 35 seconds on a windows 32 build. Compare this to the node/js solution of 6.1 second. It should be noted that this test is measuring mostly allocations, and code base is small, so node can quite efficiently inline the code and use its highly-tuned single-thread allocation code. But still, it should be possible to improve things.

Before even looking at the low-level code, one thing that has a big effect on speed is the amount of free space available. The Gc allocates into its free memory, and when it’s full, it reclaims/collects the unused portions. The speed of the reclaiming depends mostly on the the amount of “live” memory after the collection. By increasing the amount of free memory, the time between collections increases, while keeping the collection time roughly the same, thereby improving the overall speed. Hxcpp uses a heuristic to determine the free space it should use depending on the “live” size at last collection. There was a fixed ratio in 3.2.103, but since then I have exposed the settings and increased the minimum working memory to 20M on desktop and 8M on mobile. This alone gives a nice improvement.

After creating a benchmark, the next step in optimization is to run a profiler. And the profiler did indeed show that most of the time was spent in the allocation code. It did not appear to be in one particular part of the code, but instead spread over the whole routine. This implies that I really need to get the line count down, and to that end I decided to simplify some the auxiliary data structures at the expense of memory. The hxcpp Gc uses conservative marking on the stack. That is, it scans the memory and looks for things that look like pointers to Gc objects. It then needs to know with 100% certainty that the thing pointed to is (or was recently) an actual object, otherwise the marking process could trash memory. The means that some extra information needs to be used to track this. Initially, I used what I thought was a clever trick to reuse the byte I was already using to indicate whether a line was marked as a head for a mini linked-list of live-objects. This worked and did not require extra memory, but the maintenance of the list complicated the allocation routine. I switched to using a extra bitmap, which uses 1 bit per 4 bytes of Gc memory to indicate whether an object starts at the particular location. This is 3% extra memory, but it greatly simplifies the allocation logic, allowing for a neater implementation.

The next point of simplification was pre-calculating the “gaps” in the Gc memory blocks. Initially, this was done in the allocation code, but by doing it in advance, it simplifies allocation code that is getting run millions of times. Theoretically, there are a similar number of operations going on here, but practically the separation makes the code much neater and therefore faster. Later, I moved this calculation to a separate thread so it can be done in parallel with the normal code.

I also rearranged the object header to allow the marking code to test whether an object needs marking with a single bit test, and therefore avoid an expensive marking function call in some circumstances. The header also includes a neater row count which makes the actual row marking faster.

These changes sped the allocations up considerably, but the profiler still showed most of the time running the allocation code. The main issue then became the function call depth. Modern c++ compilers can do a pretty good job at inlining code, but they are not magic, so some extra help is required. First a quick review of what happens when you call “new Complex” in haxe. Hxcpp generates a standard c++ “new” call, which calls into hx::Object “operator new” which is where the Gc memory allocation routine is called. C++ then initialises the object by inserting the vtable pointer and hxcpp then calls “__construct” on the object – which is the “new” function defined in haxe. It is split like this to allow the haxe code to call “super” in ways that c++ does not allow with its native constructors. The “operator new” code takes a parameter defining whether the resulting object contains pointers to other object – this allows the marking code to make some optimizations. In the original code, this then called an “InternalNew” function which checked the object size and delegated to either a thread-local allocator, or a large allocator. This allows independent threads to allocate at full speed without needing locks. The problem with this arrangement is that several function calls are required per allocation. The solution here was it inline quite a lot of code into the “operator new” function, and also to observe that haxe objects always allocate from the local allocator, so some some checks can be avoided. Inlining means that some of the guts of the allocator need to be exposed to the haxe code. However, this is kept under control by keeping the complex case (when there is not enough room) in the separate file, and only inlining the “easy” case, which is also helped by the simplifications made earlier. You can see the code here.

There are still a few things that could be done. There are some checks that could be avoided in single threaded app. Another idea I had was to move some of the housekeeping (setting of the header and start bits) to a separate thread to be done later. There are still 2 function calls going on here – the “operator new” (which gets inlined) and the “__construct” call (which does not). It should be possible to inline the construct call if it is simple enough.

These were the changes to the allocation code. As mentioned earlier, there were some improvements to the marking code. Additional things were also done in the collector. The reclaimed memory is cleared in a background thread. This is actually is quite a nice improvement because up to 7% of the time was spent doing the clear. One thing this particular benchmark highlighted was the marking of a single large array (the image array) actually got slower when using multi-threaded marking, due to a slight overhead. I’m not sure how applicable this is to general code, however there was a solution. The multi-threaded array marking code now splits large array marking up if there are idle marking threads.

So with all these improvements, did it beat the node/js speed? Just about – it was a close thing. One more change was really required – compile a 64 bit target instead of a 32 bit target. Then finally the time came in at 5.6 seconds, less than 1/6 the original time. Node is doing a pretty awesome job here and I’m sure it is only going to get better so it will take some work to keep up. If I change the benchmark to use member functions instead of static-inline functions, then hxcpp becomes significantly faster than node, which is nice to know. It should also be noted that I’m cheating a bit here by using multiple threads to get the job done. But I’m not above a bit of cheating.

This brings me to the Java target – 2 seconds. Wow. I’m not actually sure if it is doing any allocations here at all – maybe it has completely inlined the code. Or maybe it just really really good. That sets a seriously high benchmark

Is there still room for improvement? I think there is – particularly, a generational Gc would reduce the marking time to practically nothing in this case, but it requires a bit of effort on the haxe compiler side. Inlining the constructor for simple classes (like the “Complex” class in this code) should also be a nice win.

It has been an interesting experience getting the code faster. I think that it highlights one key thing – if you want the hxcpp code to run faster, then create a simple benchmark.

Posted in Blog, hxcpp | 5 Comments

Cross compile from Mac to Linux with hxcpp

It is possible to use the hxcpp build tool to compile linux 32/64 binaries from a Mac Box. “Why would you want to do this?” I hear you ask. Well, I use this so I can run all my builds from a mac-mini, with windows running in VirtualBox. I know there are other ways, but having a single little box doing all the work is pretty easy to manage.

To do this, you just need to setup the HXCPP_XLINUX variables. There are a number of ways you can do this, but the ~/.hxcpp_config.xml file is probably the cleanest place.

Having installed the cross-compilers from, is it just a matter of pointing to the executables using some xml entries:

<set name="HXCPP_XLINUX64_CXX" value="/usr/local/gcc-4.8.1-for-linux64/bin/x86_64-pc-linux-g++" />
<set name="HXCPP_XLINUX64_STRIP" value="/usr/local/gcc-4.8.1-for-linux64/bin/x86_64-pc-linux-strip" />
<set name="HXCPP_XLINUX64_RANLIB" value="/usr/local/gcc-4.8.1-for-linux64/bin/x86_64-pc-linux-ranlib" />
<set name="HXCPP_XLINUX64_AR" value="/usr/local/gcc-4.8.1-for-linux64/bin/x86_64-pc-linux-ar" />

<set name="HXCPP_XLINUX32_CXX" value="/usr/local/gcc-4.8.1-for-linux32/bin/i586-pc-linux-g++" />
<set name="HXCPP_XLINUX32_STRIP" value="/usr/local/gcc-4.8.1-for-linux32/bin/i586-pc-linux-strip" />
<set name="HXCPP_XLINUX32_RANLIB" value="/usr/local/gcc-4.8.1-for-linux32/bin/i586-pc-linux-ranlib" />
<set name="HXCPP_XLINUX32_AR" value="/usr/local/gcc-4.8.1-for-linux32/bin/i586-pc-linux-ar" />

Then you can build from a “build.n” (such as in $HXCPP/project) file using:

neko build.n linux

Posted in hxcpp, linux | Leave a comment

NME Still Rocks

NME is over 8 years old now, and stronger than ever.

As seen at the wwx2015, there are now many great frameworks you can use with haxe. I would like to outline what I believe are the core values of NME to help you, the developer, choose a framework for future developments, or entice you into trying out NME with your existing project.

  1. Stability. Stability is the #1 priority. If your code compiled with NME yesterday, then it will compile with NME tommorrow unless there is a very good reason. NME is older than a lot of programming languages and sometimes this could be considered a bad thing, however NME’s #2 core value means that stability translates into reliability, not obsolescence.
  2. Innovation. We love innovation in NME- we’re always looking at how new technology can be applied to existing problems. When NME was conceived, there were no iPhones and no Android phones, however once they came out, these were quickly targeted by NME. NME also tracks (and inspires) cutting-edge development in the haxe compiler and the hxcpp backend. This can be seen with the latest Cppia integration. This combination of stability and innovation means that you can try out these new technologies risk-free.
  3. Pragmatism. NME concentrates on practical solutions to real problems. For example, when it became apparent how painful it was to set up an Android project then the nme tool was developed to solve this problem. Later the android testing workflow was significantly improved with the Acadnme backend.
    NME is also very light (50% lighter than other “light” media engines), and includes everything you need to build an app – no “dependency hell”. The stable versioning and automated build process means that you can easily try side-by-side fully-working cutting-edge or historical versions simply by downloading a single haxelib-ready zip file from nmehost.
    The extent to which the flash Api is mirrored is also governed by practical considerations. NME is not emulous of the name or implementation of flash – NME will happily add functions like drawTiles to help out developers, or to add the OpenGLView to allow the default rendering to be mixed with or replaced by traditional OpenGL rendering.
    NME takes a pragmatic approach to licensing too – wherever possible (currently everywhere) libraries are chosen to avoid any potential GPL issues and workflows are designed to avoid any patent issues (particularly with audio).
    Pragmatism is also a consideration when choosing what targets are supported by NME. Some targets such as BlackBerry, Tizen and Windows sandbox apps are not supported due to what I perceive as their cost/benefit ratio. This can change in the future if the benefits go up – perhaps Windows 10 will see some more interesting applications.

Getting (re)acquainted with NME.

For existing haxe users, NME is super easy to get started with. First, install the latest release with “haxelib install nme”, then create an alias for running the nme tool with “haxelib run nme setup”. Now you are setup with the “nme” command. If you do not want to create the alias, you can simply use “haxelib run nme” in wherever the “nme” command is used in the following discussion.

You can run the demos without needing to actually clone any demo code, but you probably want to start in a clean directory since the intermediate and application files will be created in a subdirectory of the working directory. Use “nme demo” for a list of NME demos. This list includes the first few letters in square brackets – this shows the shortcut you can use run the demo. eg, “nme demo ca” will run the “Camera” demo, with the default target “neko”.

NME is still largely compatible with Openfl projects, including flixel projects. And, it can make guesses about demos from other haxelibs, eg: “nme demo flixel” will list the flixel demos (assuming you have them installed), and “nme demo flixel:mode” will run the demo with neko, or “nme demo flixel:neon cpp” will compile + run a demo with hxcpp.

If you have an existing Openfl project, you can simple try nme by running “nme” or “nme cpp” in your project directory.

If you want to start a simple test from scratch, you do not even need a project file. You can start with a single haxe file, like you might with flash. eg, start by typing in Test.hx:

class Test extends nme.display.Sprite {
    public function new() {

then just run “nme”.


One of the big innovations for this year is the cppia target and the Acadnme runtime.
Acadnme This scripting target has the advantages of the neko target – being lighting fast to compile – combined with the versatility of the cpp target – that is, run on almost any device.

Acadnme comes precompiled for desktops targets (currently, Windows and Mac) in the acadnme haxelib (haxelib install acadnme), and for Android via the Google Play store. Once installed, you can test your program with cppia with something like “nme demo flixel:neon cppia”. This particular demo has some trouble with neko due to Floats initializing to null, whereas the cppia script follow the hxcpp quirks extremely closely and initialize these variables (more usefully) to zero.

To test on Android you need to be on the same wireless network. First, install and run the app and take note of the ip address displayed on the startup page. Then provide this to the nme tool as the “deploy” variable, like: “nme demo Stage cppia deploy=”. Your app obviously needs to be “Android ready” for this to work well, that is, ready for touch input and screen scaling.

You can create your own cppia target to use instead of acadnme, with “nme set cppiaHost /path/to/your/host”. Have a look at the opensource acadnme project to get an idea of what might be involved.

Give it a go.

I understand that there is not a one-size-fits-all approach for app frameworks. Other projects place higher values on things such as console targets, modularity of multiple libraries, re-architecting to get things 100% right and either avoiding or embracing the flash Api. NME’s core values are stability, innovation and pragmatism. If you also value these things, then we would love to have you on the team. If you prefer another framework, you can still have NME installed to use for testing and comparison, or for quick scripting via Acadnme.

It is very easy try out NME to see if it suits you needs. See how you like it, and tweet me @gamehaxe.

Posted in acadnme, Blog, nme | 15 Comments


Hxcpp – state of the union enum.

Here are the slides from my wwx2015 talk.

You can compile them yourself with:

haxelib install nme
haxelib install gm2d
haxelib install acadnme
haxelib run nme demo gm2d:11 cppia

On linux, you may want to compile for neko or cpp, instead of cppia.

Posted in acadnme, hxcpp, nme, wwx | Leave a comment

Wwx2014 Talk – Hxcpp magic

Just got back from the haxe conference, wwx2014, after having a great time. Plenty of fantastic people to talk to.

The talk consists of some slides, with a demo at the end. To run the demo, you need to run the binary version on either mac or windows, but the slides can be viewed with flash.

Or, you can build it yourself for you preferred platform and explore the source code with:

haxelib install nme
haxelib install gm2d
haxelib run nme demo gm2d 10-wwx2014 cpp

Edit – must mention that you might need an dev version of haxe to build the demo.

Posted in hxcpp, wwx | Tagged , , | 9 Comments


The latest version of NME adds a new feature to the tool, “Three Letter Compiler” (TLC). This feature is designed to get you going quickly on a new or existing project.

First, make sure you have the latest NME installed. If are going to do some cpp compiling from windows, it is best to install the latest version of Microsoft Visual Studio (“Desktop” version). Now while you are waiting for that to download, you can have some fun with NME.

The haxe ‘haxelib’ utility allows you to run library scripts using the ‘run’ command, like:
haxelib run nme
NME has various features ‘commands’ that you can use, the first one is ‘help’, which prints a list of possible commands:
haxelib run nme help

The first thing you will notice that it is going to get pretty tedious typing “haxelib run nme …” every time you want to do something, so the best thing to do is to setup an alias. If you have your own admin style, you may choose to do this anyway you like (like the unix ‘alias’ command), but you can also achieve this using:
haxelib run nme setup
which will create a small shell script or batch file to save you some time. So now we can use the “nme” command instead of haxelib.

Now we have done some basic setup, it’s time to check out the demos. To see the list of installed demos, use the command:
nme demo
This should give you a list of 20-something demos to try. Listed on the left is the abbreviation you can use – just pass this command to the nme demo command (eg, ‘he’ is short for ‘HerokuShaders’). When you run a demo, the binary files will be created inside your current directory, so before we start, it is best to move to a ‘scratch’ area that we can delete later.

cd c:\temp
mkdir scratch
cd scratch

Has your compiler finished downloading yet? Maybe not, we will try it first with neko.

nme demo hero neko


This should get you up and running. You can explore the other demos like this. You will notice the last parameter, ‘neko’. This is the target – if you were not to specify anything here, nme would assume the default target, which is a ‘cpp’ compile. You can set the default target using the nme ‘set’ command:
nme set target neko
Now code will be compiled by default for neko. You can remove the set value with the ‘unset’ command. For advanced users, you can use the command:

nme set bin c:/temp/scratch

To set the output (binary) directory for all your nme projects. This will allow you to keep the generated output files outside your source file tree, separating the disposable files from the precious. You can also do this when compiling any project by using the ‘-bin path/to/output’ command line options. Don’t forget to “nme unset bin” to restore normal behaviour.

If you are every wondering what is going on under the hood, and where the files are actually going, you can use the ‘-v’ verbose flag to get nme to dump out a bunch of stuff as it goes. You can also set this on for always, using:

nme set verbose

Ok, the demos are working, but you want to get your own code going, and it is time to fire up the text editor. The main nme drawing APIs are based in the flash drawing APis. The easiest way to get started is to create a “DisplayObject” and draw something into it. If this is you “main” class, then NME will take care of getting your object onto the screen. The best place to start is in a empty directory with a “Sprite” class. So we can start with some code like this (in “MyGame.hx” – the name must match the class name):

class MyGame extends nme.display.Sprite
   public function new()

Now, in the directory you can issue the command:


With a little TLC, you can get your first game going (not very exciting mind you). When run on its own, the nme command will use the default target (which we have set to neko), it will run the default command (which is ‘test’, unless you use the ‘set’ command to say otherwise) and it will look for the best project file. If there is no project file, then it will look for a single “.hx” file. If the command gets confused by multiple files, it will simply just not work. You can also use this command in the demo directories since they have either a single project or single haxe file. Indeed, you can compile most projects like this.

There are also a few things you can do with source code using special comments to change some of the programs attributes. At some point it is worth switching over to a proper project file, but until then you can have some fun with ‘// nme:’:


// nme: background=0x000000
// nme: width=200
// nme: height=200
// nme: title=Cool Game

class MyGame extends nme.display.Sprite
   public function new()

Cool game indeed.

Nme also allows you to compile demos from some other projects. Just add the project name after the ‘demo’ command. Openfl is closely related to NME, so they are quite compatible. Firstly, install the openfl demos, using


haxelib install openfl-samples

Now you can get a list with:

nme demo openfl

And then run one with:

nme demo openfl pirate

Since HaxeFlixel uses openfl, it too shares compatibility with NME. However, you currently must be running the a version newer than 3.2.2 to get it to work. First install the flixel samples with:

haxelib install flixel-demos

Then get the development version by by cloning the repo from GitHub and using haxelib to point at this new download. This in turn needs the git version of flixel-addons, so updating requires:

haxelib git flixel
haxelib git flixel-addons
nme demo flixel

Will show the list. We can now try the classic bunnymark, but this time will will make sure we compile it for ‘cpp’ for maximum speed. (Has your download has finished yet?). Arrr!

nme demo flixel flxbunny cpp



Another project you can try is the “haxeui” project. With:

haxelib install haxeui
nme demo haxeui
nme demo haxeui accord

Hopefully the NME TLC can help your workflow a bit.

Posted in Blog | Leave a comment

NME 5.0 (re)Released!

It has been a little while coming, but I have released a new version of NME, v5.0, available on haxelib today (“haxelib install nme”). If you wish to do some development rebuilding the nme binaries, you can also download the companion project, nme-state, which provides the libraries required for developing NME.

The source code can be found on the official NME github repo. Bleeding edge/nightly builds can be found at the dedicated build site, and the discussion forum is the google NME host group.

NME is not Openfl. The NME and Openfl code bases forked towards the end of 2013 when new haxe code was placed in openfl-native. Later the run-tool, binaries and templates were also placed in separate projects. Today, NME is a lot closer to the lime project, which has the same native code DNA, but only has a subset of the haxe files. NME has also changed its remit to lessen its ties to the flash API, with “inspired by the flash API” probably being the closest description.

My main reason for resuming the NME project is focus. I have dropped support for all targets other than windows, mac, linux, iOS, android (x86 support has been added) and flash. I have also dropped support for opengles1 (fixed function) rendering, and everything is done with shaders now. There is currently no HTML target – this may change in the future – but also maybe not. Other binary platforms may be supported if there is significant and lasting support in the market. It has only been by reducing the scope of the project that I have managed to make forward progress, so I’m keen to keep this focus.

There are other reasons why a single, compact project is more appealing to me than several inter-dependent projects. That is, where the templates, binaries, haxe code and run tool are all 100% synchronized and edited as a whole. This coherence has made the additions of some new features much easier:

  • Android view. Instead of generating a full apk, NME can generate a self-contained fragment (jar, binary, java), which can be included in a larger project, which might contain native controls etc.
  • iOS view. NME can generate a library+header file that implements the UIViewController interface, which can then be linked into bigger iPhone applications.
  • StageVideo. The two mobile projects can use the new StageVideo API to play or stream videos, while using haxe for rendering controls or other graphics over the top.
  • Static linking. NME have moved to SDL2, which now has much more developer friendly licensing terms, so NME is now GPL free! This means that programs can be statically linked into a single exe with no external dependencies (if you so desire).

Some features have been disabled. Most notably, the extension system has been removed and not yet replaced. I’m hoping to implement a more simplified or integrated solution based on the static linking principles. Android audio has also been reverted to the Java based API while I’m hoping an opensl solution may be possible. Asset libraries have also been removed, to be replaced by using haxe code where appropriate.

There have been quite a few minor fixes to the code base, including:

  • Preemptive GC. This experimental feature can be enabled from the Stage and allows the garbage collection to happen in parallel with the render. Since the render uses native code (except for ‘direct renders’), the GC should be able to proceed with reduced impact on the frame rate.
  • Shader-based line anti-aliasing. This allows smooth edges on hardware accelerated lines without multi-sampling AA.
  • Vertex Buffers. Internally, hardware geometry is stored in vertex buffers, which should improve rendering performance.
  • Premultiplied alpha. Some initial support for premultiplied alpha (bitmap property) has been added to avoid halos on objects where alpha transitions to 0. There is still some work to be done on this feature.
  • JNI fixes. Improved android interaction with JNI and timing code
  • Integration with ‘waxe’ project. NME is forming a closer relationship with the waxe (wxWidgets native widget set) project.
  • Refactored run script. This is still a work-in-progress, with some features removed, but some features have been added – like three-letter-compiling (“nme”), and building without the need for a project file (for simple programs).

These changes are generally in the native code, which is similar to the code in the lime project. So it should be possible to port features from one project to the other.

I have also implemented an automated build system. This should mean more consistent and frequent releases – including nightly builds for the more experimental people, and makes supporting both static and dynamic linking easier. NME uses the companion developer project “nme-state” for rarely changing libraries. This project uses neko and the hxcpp compiler toolchain files to ensure that all builds have consistent compiler flags and all sub-targets are supported where required, while not requiring additional dependencies. Having automated builds also makes upgrading easier, such as upgrading libcurl to not block in ssl calls.

NME aims to be largely compatible with Openfl, so that developers can switch between the two implementations if required. There will be ongoing support for this.

So NME has gone through a few changes. Some things have been improved, some things may be still be in transition, and some things may be dropped. If you rely on one of the dropped features, I suggest you stick with Openfl. If you prefer this project structure, come and join the team.

Posted in Blog, nme | Tagged | 6 Comments

HXCPP Built-in Debugging

The latest version of hxcpp has increased control over how much debugging information is included in the compiled code, as well as some built-in debugging and profiling tools.

Debugging is normally added with the -debug flag to the compiler. This flag turns off compiler optimisation, adds source-line and call-stack tracking and adds additional runtime checks – eg for null pointer access. This is useful for locating problems, but it can make the code run several times slower.

Depending on the application, you may not need all these debugging features. For example, if you do not intend to attach a native debugger, then you probably do not need to turn off compiler optimisations. If you are trying to profile the code, you probably only want the call-stack symbols, but not the other runtime checks. These scenarios are now supported by HXCPP, via compiler defines.

Quick recap: To add a define to the haxe compiler, you can use the “-D DEFINE” syntax on the command line or hxml file, or <haxedef name="DEFINE" /> from an nmml file.

Now, for testing performance, I will be using BunnyMark, and add bunnies until there the a measurable drop in performance. In this case, I added 20000 bunnies to get a frame-rate of about 50fps on my macbook air.

Adding the -debug flag brings the frame-rate down to about 40fps. This is not a huge drop – I guess because the CPU is not that taxed, it the the GPU that is doing more work.

So instead of adding the -debug flag, we can add various defines to achieve the effect we are after.


Adding this define will keep the symbol tables in the final executable. This is probably not something you want to do in your final release build for external use, but it is something that that is good to have otherwise. This flag will allow you to get meaningful information if you attach your system debugger. Runtime performance hit: none.


This define will allow you to get a haxe stack trace that includes function names when an exception is thrown, or when the stack is queried directly. There is a small overhead incurred per function call when this is on. Runtime performance hit: very small.


This define implies DHXCPP_STACK_TRACE and adds line numbers to the function names. There is additional overhead per line of haxe code, however I had trouble measuring this overhead on the bunnymark – probably because it is not too CPU intensive. Runtime performance hit: small to medium.


This define explicitly checks pointers for null access and throws an informative exception, rather than just crashing. There is an overhead per member-access. Again, for the bunnymark, it was hard to measure a slowdown. Runtime performance hit: small.

So, these defines can be added without using the -debug flag, which removes the compiler optimisations, which accounts for over 80% of the performance drop.

Built-in Profiler

So where is all the time spent? To answer this question, you can use the built-in profiler. The profiler needs HXCPP_STACK_TRACE, and is started by calling cpp.vm.Profiler.start(filename) from the haxe thread you are interested in. You can call this, say, in response to a button-press, but in this example, I will call it from the mainline. You can provide a log filename, or you can just let it write to stdout. If you use a relative filename in an nme project, the file will be written into the executable directory.

  private static function create()
     Lib.current.addChild (new BunnyMark());

We can then analyse the log. The entries are sorted by total time (including child calls) spent in the routine. The first few are all about 100%, as they look for somewhere to delegate the actual work to. Soon enough, we get to this entry:

Stage::nmeRender 95.60%/0.07%
   DisplayObjectContainer::nmeBroadcast 39.6%
   extern::cffi 60.3%
   (internal) 0.1%

Here the Stage::nmeRender call is almost always active, and is broadcasting an event (the ENTER_FRAME event) for 40% of its time, and calling into cffi (nme c++ render code) for 60% of the time. So we have learnt something already.

You can trace ENTER_FRAME the calls though the event dispatcher until we get to the routine actually in bunnymark:

TileTest::enterFrame 36.55%/8.26%
   Lib::getTimer 0.1%
   Tilesheet::drawTiles 70.2%
   Graphics::clear 6.6%
   DisplayObject::nmeSetY 0.0%
   DisplayObject::nmeSetX 0.0%
   DisplayObject::nmeGetWidth 0.2%
   DisplayObject::nmeGetHeight 0.1%
   DisplayObject::nmeGetGraphics 0.2%
   GC::realloc 0.1%
   (internal) 22.6%

Here the “36.55%/8.26%” means that 35% of the total exe time is spent in this routine, including its children, but only 8% of the total time is spent internally (ie, not in child calls).

And, looking at the drawTiles call:

Tilesheet::drawTiles 25.65%/0.01%
   Graphics::drawTiles 99.9%
   (internal) 0.1%

We see 25% of the total time is spend in the NME Graphics::drawTiles call. So this app spends 60% of the time in drawing routine, and 25% of the time in preparing the draw call. From this, you can assume that it is pretty well optimised!

It is also interesting to note the GC entry:

GC::new 0.17%/0.15%
   GC::collect 12.0%
   (internal) 88.0%

Which is very small, indicating that techniques such as object pooling would have no effect here.

This technique is available on all platforms – you just might have to be a little bit careful about where you write your output file.

Built-in Debugger

The built-in debugger requires you to compile with the HXCPP_DEBUGGER define. Starting the debugger is similar to starting the profiler.

  private static function create()
     new hxcpp.DebugStdio(true);
     Lib.current.addChild (new BunnyMark());

You will note that the “DebugStdio” class is in the hxcpp project, so you will need to add the “-lib hxcpp” command, or via nmml: < haxelib name="hxcpp" />.

The “true” parameter means that the debugger stops as soon as you hit this line. This allows you to set breakpoints etc. Running the above, I get an empty display window (no draw calls have been made yet), and a prompt on the command line:

help  - print this message
break [file line] - pause execution of one thread [when at certain point]
breakpoints - list breakpoints
delete N - delete breakpoint N
cont  - continue execution
where - print call stack
files - print file list that may be used with breakpoints
vars - print local vars for frame
array limit N - show at most N array elements
mem - print memory usage
collect - run Gc collection
compact - reduce memory usage
exit  - exit programme
bye  - stop debugging, keep running

Entering “files” shows the input files with indexes 0 to 93. You can use either the filename or file number for setting breakpoints. For example:

debug>break BunnyMark.hx 41

Will set a breakpoint for line 41 on BunnyMark.hx. Currently, you are going to need to have the file handy to look up the line number. So now we want to go back to executing:

debug> c

And immediately, the debugger prints “stopped”. This is because the breakpoint has been hit. To find out where, you can use the “w” command:

Must break first.

The “you must break first” is a little bug in the debugger due to the fact that the break is async. But trying again seems to have fixed this. So here you can see the full call stack. Using the “vars” command shows the local variables, and using the “p” command (short for “print”) shows the values.

debug>p e
debug>p this
nmeChildren=2 elements
name=BunnyMark 3

So now we have a little fun (in a very nerdy sort of way):

debug>set fps.y = 200

And you can see that the fps counter has moved down the screen. The more astute reader will notice that the “y” member is actually a “property” and that setting this property has actually caused some haxe code to be run. You can also print the value of a function call to get arbitrary functions to run.
While the code is running you can use “break” to stop it wherever it happens to be.
Using the “exit” command is a quick way out – this can be useful on android when you want to make sure the process is actually dead.

You can jump into a different function on the stack via the "frame" command (notice the "*" has moved), and examine the vars there:

 => frame 17
 => where
 => vars
 => p title

The command line is all well and good for desktop apps, but it is not much use for mobile apps. To use the debugger on mobile, you can create a debug socket server - the mobile will then connect over WiFi. On your desktop, create and run the hxcpp DdebugTool:

haxe -main hxcpp.DebugTool -lib hxcpp -neko server.n
neko server.n
Waiting for connection on

You can see the ip:port that the server is waiting on. Then, back in your code, replace the "DebugStdio" line with a "DebugSocket" line, using the ip:port from the server:

private static function create()
   new hxcpp.DebugSocket("",8080,true);
   Lib.current.addChild (new BunnyMark());

And then the operation is exactly the same as before.

The debug code is designed to allow different protocols and backends. It should be possible to replace the code in the hxcpp library with code of your own to present the debug information in any way you like. The "worker" classes are in cpp.vm.Debugger, and you can build your own debugger on top of these.
As a final trick, you can call the debugger functions directly (eg, to always add a breakpoint), or cpp.vm.Debugger.breakBad() (I've been watching too much TV) to allow complex conditions to generate breakpoints, eg:

   if (items.length>0 && !found)
      Debugger.breakBad(); // WTF ?

Posted in hxcpp, nme | Tagged , , , | 22 Comments

WWX Conference

I recently got back from ‘WWX’, the World Wide Haxe conference. I had a fantastic time – got to meet a whole bunch of people from the community.

I gave a presentation about the architecture and some details of the hxcpp backend. The slide are in wwx2012-hxcpp.pdf.

Posted in hxcpp | Tagged , , | Leave a comment


I noticed that Google have a site that compares a few languages using a graph traversal benchmark.

So I thought it would be interesting to see how haxe/hxcpp fares. I started with a direct (ie, almost line-for-line) translation of their cpp implementation. I changed the list to arrays where appropriate, and implemented one “set” as an array because haxe does not have a native “dictionary” object for keying from generic objects.

The supplied cpp target was pretty easy to compile on windows by ignoring the makefile and using:
cl -O2 /EHsc

The haxe target uses (right-click, Save As) LoopTesterApp.hx and MaoLoops.hx, and compiles with:
haxe -main LoopTesterApp -cpp cpp

The java_pro code can be built and run (note that increased stack size is required when running) :

$JAVA_HOME/bin/jar -cvf LoopTesterApp.jar `find . -name \*.class`
java -Xss15500k LoopTesterApp

The java target requires additional stack space – and so does neko, however there is no option to increase the stack space with neko like there is in java, so neko just panics. I have not tried as3 – I guess there will be a script timeout which would need to be managed. Also a JS or v8 time would be very interesting.

The runtimes are:

cpp 19.2 seconds
cpp (claimed) 6 seconds ?
hxcpp 26.7 seconds
java (pro) 22.5 seconds

The google paper cites a 3x speedup with optimised c++ code, but there is no code for this.

The java implementation is the hand-optimized version. In hindsight, I should probably have ported this version, rather than the vanilla cpp version. It includes optimizations such as object pooling and task-specific containers. Also, this particular benchmark is designed to allow java to comfortably perform its JIT compiling.

So in conclusion, I would say I’m pretty happy with these results. I’m sure there are some micro-optimizations, such as the “unsafe get” on arrays, which does not check the bounds, and some higher-level stuff that profiling may reveal that could share quite a few percent from the hxcpp time. I’m not particularly interested in optmising for a particular benchmark, however a little bit of profiling here may help speed up the target as a whole, which would be a very good thing.

Posted in Blog | 10 Comments