Obviously, I enjoy it very much, and I sincerely hope that other XNA game programmers will decide to give F# a place in their development tools.
I bet there are programmers out there who wonder if all these nice features that F# supports don't come at too high a cost when it comes to performance. I managed to write a fairly computationally-intensive game for the Xbox 360 in F#, so obviously writing a game in F# is feasible. However, the question remains whether I could have done a significantly better job using C#?
In an attempt to answer this question, I am comparing in this article the performance of four implementations of an octree: Two in F# and two in C#.
Problem area
An octree is a data structure often used in 3d games for collision detection. There are different ways to design and implement an octree and its associated algorithms, depending on one's requirements. In my case, the requirements are, in increasing order of importance:
1. Fast querying. Building the octree should be reasonably fast, but need not be as fast as querying. Typically, octrees are built once (possibly at compile time) and queried every frame.
2. Generic. The implementation should be suitable for use with spheres, boxes, triangles, rays...
3. Maintainable source code. Note that this requirement has higher importance than efficiency. As an independent developer, I am not trying to compete with the big guns in number of polygons and objects. Moreover, my time available for programming games is very limited, and I don't want to get stuck with an implementation that may be fast now, but hard to maintain.
Data structure
With this in mind, I came up with the following design for the data structure in F#:
type Node<'TS> = { bbox : BoundingBox data : Data<'TS> } and Data<'TS> = | Leaf of int * 'TS // sequence length, sequence of items | Inner of Node<'TS> array // Array of 8 items exactly
Six lines, all in all. Each node has a bounding box covering all volumes contained in this node and its descendants. Each node is either an internal node, in which case it has references to its 8 children, or a leaf, and it has references to the shapes it contains. The type TS denotes a sequence of shapes.
Construction
The function below is used to insert an item into a node of a tree (or into one of its descendants). In the case where the item to insert overlaps the boundaries of the bounding boxes, it will be inserted into multiple leaves. Some implementations avoid this by splitting shapes, but that's too advanced for my needs.
let rec insert max_count max_depth intersect node item depth = assert ( intersect node.bbox item ) match node.data with | Leaf(count, items) when (depth >= max_depth || count < max_count) -> { node with data = Leaf(count + 1, item :: items) } | Leaf(count, items) -> let node' = split intersect node insert max_count max_depth intersect node' item (depth + 1) | Inner(children) -> { node with data = Inner( children |> Array.map (fun child_node -> if intersect child_node.bbox item then insert max_count max_depth intersect child_node item (depth + 1) else child_node ) ) }
The function above has quite a few arguments, but the first three are typically constant. One can therefore define an additional function using currying:
let myInsert = insert 20 5 myIntersectionPredicate
Using myInsert, the tired programmer is saved the effort of repeatedly passing the first three parameters.
Note that the function is limited to nodes which are of the type Node<'T list>. The Node type is immutable, and in this setting, F# lists are fairly efficient. An alternative would have been to use mutable nodes and resizeable arrays, an approach I picked for the implementations in C#.
There is another function named "split" which I am not showing in this article. Given a leaf node, it produces an inner node and 8 children with equivalent content.
My early measurements showed that lists might not be the most appropriate data structure to use once the octree is built. Arrays can be iterated over more efficiently. The function below translates an octree "under construction" into a "query-able fast octree".
let rec freeze (node : Node<'T list>) : Node<'T[]> = match node.data with | Leaf (count, ts) -> { bbox = node.bbox ; data = Leaf (count, Array.ofList ts) } | Inner (children) -> { bbox = node.bbox ; data = Inner (children |> Array.map freeze) }
Querying
Checking for overlap of items contained in the octree with another item is done using a recursive function, which goes into the relevant leaves, thus avoiding unnecessary item-to-item intersection checks.
let checkIntersection intersectBbox intersectSomeItem node = let rec work node = intersectBbox node.bbox && match node.data with | Leaf(_, items) -> intersectSomeItem items | Inner(children) -> children |> Array.exists work work node
Something that might be puzzling about this function is that it seems it lacks an "item" parameter. Actually, it's hidden in the "intersectBbox" and "intersectSomeItem" predicates. Those are typically constructed using closures that capture the item of interest, as shown below:
let intersectBBox2 (spheres : BoundingSphere[]) (i : int) = let sph = spheres.[i] fun (box : BoundingBox) -> box.Intersects(sph) let intersectBSphere (spheres1 : BoundingSphere[]) (spheres2 : BoundingSphere[]) (i1 : int) = let sph = spheres1.[i1] fun i2s -> i2s |> Array.exists (fun i2 -> sph.Intersects(spheres2.[i2])) let checkIntersection_BoundingSpheres spheres_octree spheres2 node item = checkIntersection (intersectBBox2 spheres2 item) (intersectBSphere spheres2 spheres_octree item) node
Alternative implementations
The code showed above originates from an implementation which I labelled "Virtual". It makes heavy use of function values. Calls to the functions behind these function values are called "virtual calls", and are often slower than direct calls.
F# has a feature which other .NET languages lack, namely inlining. Inlining can be used to avoid virtual calls, which in theory allows to achieve the speed of direct calls without losing genericity. Moreover, it can enable compilation-time optimizations that are not available to the JIT at run-time. The down-side is that they increase the size of the generated code. Inlining can also lead to slower code when used inappropriately. I wrote a variant which uses inlining labelled "Inline". It uses the following replacement for Array.exists:
let inline exists (pred : 'T -> bool) (xs : 'T[]) : bool = let mutable i = 0 while i < xs.Length && not (pred xs.[i]) do i <- i + 1 i < xs.LengthThe implementation of "insert" in this variant is also inlined, but "checkIntersection" isn't. Time measurements show that inlining clearly is beneficial in this case.
How do these implementations in F# fare compared to implementations in C#? I wrote two variants in C#. The first one is inspired from the F# code and maintains its level of genericity. It uses delegates where F# uses function values, and LINQ where F# uses functions from the Array and List modules. The second C# variant ("Specific") is based on the first one with genericity removed, thus allowing the removal of delegates (direct method calls are used instead). Hand-made loops (using foreach and for) are used instead of LINQ.
Benchmark
Three different octrees are used with different levels of space density. The number of shapes they contain are identical, but the size of each shapes is different. The larger the shapes, the denser the space. Three sets of external shapes are created, also with varying levels of density. The timings measure on the one hand the creation of each octree, the total amount of time to query each octree using each set of external shape. The queries are repeated in order to give significant run times, and avoid wide variation due to GC interruptions on the Xbox 360.
Results
Results vary depending on the specific sets of shapes, and vary from a run to the other. The results I'm listing here are qualitatively representative of the runs I made.
For the PC platform: Virtual sparse: 0.016 / 1.185 Virtual: 0.011 / 1.136 Virtual dense: 0.096 / 1.029 Inline sparse: 0.016 / 0.777 Inline: 0.011 / 0.743 Inline dense: 0.093 / 0.651 Delegate sparse: 0.009 / 1.194 Delegate: 0.007 / 1.150 Delegate dense: 0.045 / 0.941 Specific sparse: 0.007 / 0.913 Specific: 0.007 / 0.890 Specific dense: 0.043 / 0.759 | For the Xbox platform: Virtual sparse: 0.170 / 1.062 Virtual: 0.194 / 0.972 Virtual dense: 2.193 / 1.800 Inline sparse: 0.117 / 0.574 Inline: 0.146 / 0.558 Inline dense: 2.082 / 0.677 Delegate sparse: 0.082 / 1.605 Delegate: 0.098 / 1.543 Delegate dense: 0.752 / 2.172 Specific sparse: 0.068 / 0.568 Specific: 0.094 / 0.542 Specific dense: 0.656 / 0.423 |
The pairs of numbers are the time to build and the time to query, in seconds. The datasets for the PC platform are much larger than those for the Xbox 360 platform. With identically sized datasets, the PC times were too low to be significant.
Observations
The time to build is always smaller for the C# implementation. This is not surprising, as I'm comparing two different algorithms, one using mutable structures (in the C# implementation), the other using immutable data structures requiring significantly more short-lived objects.
Using inlining improves the performance of the F# implementation, regardless of the platform. The generic C# implementation is the slowest, regardless of the platform. It's not showed here, but earlier measurements on the Xbox 360 showed that the problem was in LINQ.
The performance profile of F# on the Xbox 360 is not the same as on the PC. The F# does remarkably well on the PC, the inline implementation being the fastest. On the Xbox 360, the non-generic versions wins, but the F# versions holds its ground pretty well.
Conclusions
As far as I am concerned, F# is more than good enough when it comes to performance, compared to C#.
This does not mean that F# is better than C#, though. I spent my optimising efforts on the F# implementations, I can imagine a proficient C# programmer could do better than my variants in this language.
By the way, I used Red Gate's Reflector to analyse the code generated by the F# compiler and optimise it. I also used a profiler (Jet Brain's dotTrace Performance 4.0), but its measurements on my PC were not a good indication of the performance on the Xbox 360.
All source code is available on bitbucket.
Update: 2010 dec 14
I have updated the C# and F# implementations to match each other more closely:
- The C# implementation now uses List
- The F# implementation now uses mutable ResizeArray<t> (the F# synonym for List<t>, which I find more appropriate as the term "list" is coupled to "linked list" in my mind). Unlike the C# implementations, new nodes are still created when splitting leaves.
Updated results follow:
On the Xbox 360:
Batch | Build time (s) | Query time (s) |
Virtual sparse | 0,155 | 0,954 |
Virtual normal | 0,247 | 0,873 |
Virtual dense | 1,331 | 1,472 |
Inline sparse | 0,110 | 0,548 |
Inline normal | 0,199 | 0,480 |
Inline dense | 1,202 | 0,497 |
Delegate sparse | 0,084 | 1,629 |
Delegate normal | 0,165 | 1,537 |
Delegate dense | 0,927 | 2,406 |
Specific sparse | 0,068 | 0,503 |
Specific normal | 0,149 | 0,404 |
Specific dense | 0,855 | 0,405 |
On the PC:
Batch | Build time (s) | Query time (s) |
Virtual sparse | 0,016 | 1,179 |
Virtual normal | 0,017 | 0,994 |
Virtual dense | 0,084 | 1,038 |
Inline sparse | 0,014 | 0,809 |
Inline normal | 0,016 | 0,671 |
Inline dense | 0,082 | 0,685 |
Delegate sparse | 0,010 | 1,199 |
Delegate normal | 0,011 | 0,954 |
Delegate dense | 0,053 | 0,955 |
Specific sparse | 0,007 | 0,802 |
Specific normal | 0,011 | 0,680 |
Specific dense | 0,055 | 0,685 |
From these updated results, one can observe the following points:
- When it comes to querying, on the PC, the optimized F# and C# implementations (called Inline and Specific) are now running at more-or-less identical speeds. The non-optimized implementations are also very similar.
- On the Xbox 360, the C# optimized version is the fastest one, with the F# optimized close behind. The F# non-optimized version is noticeable faster than the C# non-optimized one. I think this is all due to the inefficiency of LINQ on the Xbox 360.
- Regarding build times, the F# versions are now closer to the C# versions, but still significantly slower. Although not a problem in practice (octrees are typically not re-built every frame), I wonder why.
The main conclusion from the first version of this article still holds: The performance of F# roughly matches that of C#.
I personally think that when it comes to beauty, the optimized F# version beats the C# version, but that's just my opinion :)