Saturday, December 11, 2010

On the performance of F# on the Xbox 360

I recently mentioned on the App Hub (the gathering place for indies targeting Windows Phone 7 and the Xbox 360) that I was using F#. Someone asked what programming in F# was like.

Obviously, I enjoy it very much, and I sincerely hope that other XNA game programmers will decide to give F# a place in their development tools.

I bet there are programmers out there who wonder if all these nice features that F# supports don't come at too high a cost when it comes to performance. I managed to write a fairly computationally-intensive game for the Xbox 360 in F#, so obviously writing a game in F# is feasible. However, the question remains whether I could have done a significantly better job using C#?

In an attempt to answer this question, I am comparing in this article the performance of four implementations of an octree: Two in F# and two in C#.

Problem area


An octree is a data structure often used in 3d games for collision detection. There are different ways to design and implement an octree and its associated algorithms, depending on one's requirements. In my case, the requirements are, in increasing order of importance:

1. Fast querying. Building the octree should be reasonably fast, but need not be as fast as querying. Typically, octrees are built once (possibly at compile time) and queried every frame.
2. Generic. The implementation should be suitable for use with spheres, boxes, triangles, rays...
3. Maintainable source code. Note that this requirement has higher importance than efficiency. As an independent developer, I am not trying to compete with the big guns in number of polygons and objects. Moreover, my time available for programming games is very limited, and I don't want to get stuck with an implementation that may be fast now, but hard to maintain.

Data structure


With this in mind, I came up with the following design for the data structure in F#:

type Node<'TS> =
    { bbox : BoundingBox
      data : Data<'TS> }
and Data<'TS> =
    | Leaf of int * 'TS          // sequence length, sequence of items
    | Inner of Node<'TS> array   // Array of 8 items exactly

Six lines, all in all. Each node has a bounding box covering all volumes contained in this node and its descendants. Each node is either an internal node, in which case it has references to its 8 children, or a leaf, and it has references to the shapes it contains. The type TS denotes a sequence of shapes.

Construction


The function below is used to insert an item into a node of a tree (or into one of its descendants). In the case where the item to insert overlaps the boundaries of the bounding boxes, it will be inserted into multiple leaves. Some implementations avoid this by splitting shapes, but that's too advanced for my needs.

let rec insert max_count max_depth intersect node item depth =
    assert (
       intersect node.bbox item
    )
    
    match node.data with
    | Leaf(count, items) when (depth >= max_depth || count < max_count) ->
        { node with data = Leaf(count + 1, item :: items) }
    | Leaf(count, items) ->
        let node' = split intersect node
        insert max_count max_depth intersect node' item (depth + 1)
    | Inner(children) ->
        { node with
            data = Inner(
                    children
                    |> Array.map (fun child_node ->
                            if intersect child_node.bbox item
                            then insert max_count max_depth intersect child_node item (depth + 1)
                            else child_node
                        )
                    )
        }

The function above has quite a few arguments, but the first three are typically constant. One can therefore define an additional function using currying:

let myInsert = insert 20 5 myIntersectionPredicate

Using myInsert, the tired programmer is saved the effort of repeatedly passing the first three parameters.

Note that the function is limited to nodes which are of the type Node<'T list>. The Node type is immutable, and in this setting, F# lists are fairly efficient. An alternative would have been to use mutable nodes and resizeable arrays, an approach I picked for the implementations in C#.

There is another function named "split" which I am not showing in this article. Given a leaf node, it produces an inner node and 8 children with equivalent content.

My early measurements showed that lists might not be the most appropriate data structure to use once the octree is built. Arrays can be iterated over more efficiently. The function below translates an octree "under construction" into a "query-able fast octree".

let rec freeze (node : Node<'T list>) : Node<'T[]> =
    match node.data with
    | Leaf (count, ts) -> { bbox = node.bbox ; data = Leaf (count, Array.ofList ts) }
    | Inner (children) -> { bbox = node.bbox ; data = Inner (children |> Array.map freeze) }

Querying


Checking for overlap of items contained in the octree with another item is done using a recursive function, which goes into the relevant leaves, thus avoiding unnecessary item-to-item intersection checks.

let checkIntersection intersectBbox intersectSomeItem node =
    let rec work node =
        intersectBbox node.bbox
        &&
        match node.data with
        | Leaf(_, items) ->
            intersectSomeItem items
        | Inner(children) ->
            children |> Array.exists work
        
    work node

Something that might be puzzling about this function is that it seems it lacks an "item" parameter. Actually, it's hidden in the "intersectBbox" and "intersectSomeItem" predicates. Those are typically constructed using closures that capture the item of interest, as shown below:

let intersectBBox2 (spheres : BoundingSphere[]) (i : int) =
    let sph = spheres.[i]

    fun (box : BoundingBox) ->
        box.Intersects(sph)

let intersectBSphere (spheres1 : BoundingSphere[]) (spheres2 : BoundingSphere[]) (i1 : int) =
    let sph = spheres1.[i1]

    fun i2s ->
        i2s
        |> Array.exists (fun i2 -> sph.Intersects(spheres2.[i2]))

let checkIntersection_BoundingSpheres spheres_octree spheres2 node item =
    checkIntersection (intersectBBox2 spheres2 item) (intersectBSphere spheres2 spheres_octree item) node

Alternative implementations


The code showed above originates from an implementation which I labelled "Virtual". It makes heavy use of function values. Calls to the functions behind these function values are called "virtual calls", and are often slower than direct calls.

F# has a feature which other .NET languages lack, namely inlining. Inlining can be used to avoid virtual calls, which in theory allows to achieve the speed of direct calls without losing genericity. Moreover, it can enable compilation-time optimizations that are not available to the JIT at run-time. The down-side is that they increase the size of the generated code. Inlining can also lead to slower code when used inappropriately. I wrote a variant which uses inlining labelled "Inline". It uses the following replacement for Array.exists:

let inline exists (pred : 'T -> bool) (xs : 'T[]) : bool =
    let mutable i = 0
    while i < xs.Length && not (pred xs.[i]) do
        i <- i + 1
    i < xs.Length
The implementation of "insert" in this variant is also inlined, but "checkIntersection" isn't. Time measurements show that inlining clearly is beneficial in this case.

How do these implementations in F# fare compared to implementations in C#? I wrote two variants in C#. The first one is inspired from the F# code and maintains its level of genericity. It uses delegates where F# uses function values, and LINQ where F# uses functions from the Array and List modules. The second C# variant ("Specific") is based on the first one with genericity removed, thus allowing the removal of delegates (direct method calls are used instead). Hand-made loops (using foreach and for) are used instead of LINQ.

Benchmark


Three different octrees are used with different levels of space density. The number of shapes they contain are identical, but the size of each shapes is different. The larger the shapes, the denser the space. Three sets of external shapes are created, also with varying levels of density. The timings measure on the one hand the creation of each octree, the total amount of time to query each octree using each set of external shape. The queries are repeated in order to give significant run times, and avoid wide variation due to GC interruptions on the Xbox 360.

Results


Results vary depending on the specific sets of shapes, and vary from a run to the other. The results I'm listing here are qualitatively representative of the runs I made.

For the PC platform:

Virtual sparse: 0.016 / 1.185
Virtual: 0.011 / 1.136
Virtual dense: 0.096 / 1.029

Inline sparse: 0.016 / 0.777
Inline: 0.011 / 0.743
Inline dense: 0.093 / 0.651

Delegate sparse: 0.009 / 1.194
Delegate: 0.007 / 1.150
Delegate dense: 0.045 / 0.941

Specific sparse: 0.007 / 0.913
Specific: 0.007 / 0.890
Specific dense: 0.043 / 0.759 

For the Xbox platform:

Virtual sparse: 0.170 / 1.062
Virtual: 0.194 / 0.972
Virtual dense: 2.193 / 1.800

Inline sparse: 0.117 / 0.574
Inline: 0.146 / 0.558
Inline dense: 2.082 / 0.677

Delegate sparse: 0.082 / 1.605
Delegate: 0.098 / 1.543
Delegate dense: 0.752 / 2.172

Specific sparse: 0.068 / 0.568
Specific: 0.094 / 0.542
Specific dense: 0.656 / 0.423 

The pairs of numbers are the time to build and the time to query, in seconds. The datasets for the PC platform are much larger than those for the Xbox 360 platform. With identically sized datasets, the PC times were too low to be significant.

Observations


The time to build is always smaller for the C# implementation. This is not surprising, as I'm comparing two different algorithms, one using mutable structures (in the C# implementation), the other using immutable data structures requiring significantly more short-lived objects.

Using inlining improves the performance of the F# implementation, regardless of the platform. The generic C# implementation is the slowest, regardless of the platform. It's not showed here, but earlier measurements on the Xbox 360 showed that the problem was in LINQ.

The performance profile of F# on the Xbox 360 is not the same as on the PC. The F# does remarkably well on the PC, the inline implementation being the fastest. On the Xbox 360, the non-generic versions wins, but the F# versions holds its ground pretty well.

Conclusions


As far as I am concerned, F# is more than good enough when it comes to performance, compared to C#.

This does not mean that F# is better than C#, though. I spent my optimising efforts on the F# implementations, I can imagine a proficient C# programmer could do better than my variants in this language.

By the way, I used Red Gate's Reflector to analyse the code generated by the F# compiler and optimise it. I also used a profiler (Jet Brain's dotTrace Performance 4.0), but its measurements on my PC were not a good indication of the performance on the Xbox 360.

All source code is available on bitbucket.

Update: 2010 dec 14


I have updated the C# and F# implementations to match each other more closely:
- The C# implementation now uses List during construction of the octree, and arrays during querying. Apparently, iterating over an array is faster than iterating over List<t>.
- The F# implementation now uses mutable ResizeArray<t> (the F# synonym for List<t>, which I find more appropriate as the term "list" is coupled to "linked list" in my mind). Unlike the C# implementations, new nodes are still created when splitting leaves.

Updated results follow:

On the Xbox 360:


BatchBuild time (s)Query time (s)
Virtual sparse0,1550,954
Virtual normal0,2470,873
Virtual dense1,3311,472
Inline sparse0,1100,548
Inline normal0,1990,480
Inline dense1,2020,497
Delegate sparse0,0841,629
Delegate normal0,1651,537
Delegate dense0,9272,406
Specific sparse0,0680,503
Specific normal0,1490,404
Specific dense0,8550,405

On the PC:

BatchBuild time (s)Query time (s)
Virtual sparse0,0161,179
Virtual normal0,0170,994
Virtual dense0,0841,038
Inline sparse0,0140,809
Inline normal0,0160,671
Inline dense0,0820,685
Delegate sparse0,0101,199
Delegate normal0,0110,954
Delegate dense0,0530,955
Specific sparse0,0070,802
Specific normal0,0110,680
Specific dense0,0550,685

From these updated results, one can observe the following points:
- When it comes to querying, on the PC, the optimized F# and C# implementations (called Inline and Specific) are now running at more-or-less identical speeds. The non-optimized implementations are also very similar.
- On the Xbox 360, the C# optimized version is the fastest one, with the F# optimized close behind. The F# non-optimized version is noticeable faster than the C# non-optimized one. I think this is all due to the inefficiency of LINQ on the Xbox 360.
- Regarding build times, the F# versions are now closer to the C# versions, but still significantly slower. Although not a problem in practice (octrees are typically not re-built every frame), I wonder why.

The main conclusion from the first version of this article still holds: The performance of F# roughly matches that of C#.

I personally think that when it comes to beauty, the optimized F# version beats the C# version, but that's just my opinion :)