Friday, October 24, 2008

Tutorials moving to Google code

From now on, I will use the wiki on the Google code project for the tutorials. This has the advantage that readers will be able to see the history of corrections. Another good thing is that the wiki supports syntax highlighting. Blogger's software does not seem especially fit for posting code.

I think I will continue to use this blog for my thoughts and opinions about game programming and F#.

Sunday, October 19, 2008

Tutorial 11b: Efficient generic programming

Specialization is a programming technique used to improve the performance of generic code. It consists of specifying specific implementations for one or several specific instantiations of generic code.

Specialization using pattern matching

The "runge_kutta_4" function shown in these tutorials contains the following lines:

old + (1.0/6.0) * (v1 + 2.0 * (v2 + v3) + v4)

This is fine when dealing with floats, but with 3-dimensional vectors and larger types, it may be excessively slow due to the large amounts of intermediate values produced. Indeed, this code is the same as:

tmp1 = v2 + v3
tmp2 = 2.0 * tmp1
tmp3 = tmp2 + v4
tmp5 = v1 + tmp3
tmp6 = 1.0 / 6.0
tmp7 = tmp6 * tmp5
old + tmp7

Of these 7 intermediate values, 6 could be replaced by one.

What if we had a function to add 6 vectors creating only one value, the final result?

let add6 ((v0: Vec3d), (v1: Vec3d), (v2: Vec3d), (v3: Vec3d), (v4: Vec3d), (v5: Vec3d)) =
let x = v0.x + v1.x + v2.x + v3.x + v4.x + v5.x
let y = v0.y + v1.y + v2.y + v3.y + v4.y + v5.y
let z = v0.z + v1.z + v2.z + v3.z + v4.z + v5.z

Vec3d(x, y ,z)

Admittedly, it's not very pretty, and it would be nice if the user of "runge_kutta_4" was not forced to provide such a function.

F# has a parametric type called "option" useful to represent this kind of optional values. The C++ counter-part is to use a pointer, and let the null value denote the case when no value is provided.

The dictionary of operations is modified as follows:

type VecOps<'vector, 'scalar> = {
up: 'vector
add: 'vector * 'vector -> 'vector
add6: Add6<'vector> option
scale: 'scalar * 'vector -> 'vector

(By the way, another equivalent way of writing the declaration of add6 would be
add6: 'vector Add6 option

An instance of the dictionary using add6 is created as follows:

let Vec3d_ops = {
add = plus_Vec3d;
scale = scale_Vec3d;
add6 = Some add6;
up = Vec3d(1.0)

Or, to leave out add6:

let Vec3d_ops = {
add = plus_Vec3d;
scale = scale_Vec3d;
add6 = None;
up = Vec3d(1.0)

The last part of the "runge_kutta_4" function is modified to check if the dictionary of operations contains an "add6":

match vec_ops.add6 with
// No addMany available, use pair-wise additions (will create many intermediate vectors).
| None ->
old + (1.0/6.0) * (v1 + 2.0 * (v2 + v3) + v4)
// addMany available, use it.
| Some add ->
let tmp = add (v1, v2, v2, v3, v3, v4)
old + (1.0/6.0) * tmp

This is a new F# construct, called "pattern-matching". Basically, it means:

"Check the value of vec_ops.add6. If it is equal to None, then do the sum the usual way. otherwise, if the value of vec_ops.add6 is of the form Some add, then do the some by using 'add'".

I think pattern matching is a very nice construct. It allows you to match a value with fairly complex expressions called patterns. I won't dig deeper into patterns yet, let me just say that even though they look complex, the code generated by the compiler is as optimized as possible.

Optimization troubles

Unfortunately, not everything runs as fine as I had hoped. If I look at the generated code, I see that the optimizations that the F# compiler performed in the first part of Tutorial 11 have now disappeared. The problem seems to be that the compiler no longer sees through the record used for the dictionary of operations, and is therefore unable to inline the addition and scale operations. It is also unable to see if vec_ops.add6 is constantly "None" or if it has a constant function value.

As it turns out, the problem has nothing to do with pattern matching. Indeed, if I removed the "up" field from the record, or if use multiple parameters instead of a record, the generated code looks very nice, as shown below.

When an "add6" function is provided:

When no "add6" function is provided:

Was it worth it?
I get a 25% performance boost thanks to "add6" when using 3d vectors. On the other hand, I get a 65% performance drop when summing floats. This seems to be a clear case where tailoring generic code to specific types really helps to achieve the best performance.
There is a link to the full code available at the bottom of this page, if you want to try it out for yourself. I have made an habit of messing up my benchmark measurements, so don't take my word!


Time to conclude the part of these Tutorials dedicated to generic programming and low-level optimization.

To summarize:
1) Make sure your computations on floats are correct. NaN values kill performance.
2) Avoid curried functions when performance is critical, especially for methods.
3) Inline generic functions to avoid indirect calls to operations. However, do not take the habit of inlining all small functions. The compiler or the CLR do that for you.
4) Keep dictionaries of operations small, or avoid them altogether. Alternatives include static type constraints, shown in Tutorial 9, or function parameters.

Regarding item 4), I hope this is a problem that will be fixed in later releases of the F# compiler.

Source code
The complete source code used in this tutorial is available on google code (with comments, for a change).

Wednesday, October 15, 2008

Tutorial 11: Efficient generic programming

In Tutorial 8 I introduced a method for generic programmings which uses so-called dictionaries of operations (see also pp 110-111 in Expert F#). I concluded the article with the following judgment:

I'll close this up with a comment on performance. To put things bluntly, it's terrible. I am not quite sure if I have done something wrong, so I won't post scary numbers (yet), but I would say this method is not usable for real time physics simulations.

Bug fix

Well, it turned out that (1) I did do something wrong, and (2) there was a lot of untapped optimization potential in this code.

Before I go deeper into optimization techniques, let me correct a bug in the code:

let spring (vec_ops: VecOps<'a, 'b>) stiff pos =
vec_ops.scale stiff pos

should have been

let spring (vec_ops: VecOps<'a, 'b>) stiff pos =
vec_ops.scale -stiff pos

Without the minus sign, the simulated system quickly gains so much energy that the positions fall out of range. Apparently, floating point computations involving NaNs are very slow.

Faster, but still slow

With that problem fixed, the program now runs with the following execution times:

1D (struct)
Single (min, max, avg): 17.296000 17.382000 17.330000
Multi (min, max, avg): 8.244000 8.591000 8.412000
Single (min, max, avg): 78.617000 80.769000 79.340333
Multi (min, max, avg): 41.145000 41.741000 41.362333
2D (struct)
Single (min, max, avg): 18.918000 19.332000 19.158667
Multi (min, max, avg): 10.107000 10.411000 10.300667
Single (min, max, avg): 108.103000 110.466000 109.122333
Multi (min, max, avg): 56.108000 58.272000 57.140667
3D (struct)
Single (min, max, avg): 23.440000 23.683000 23.553667
Multi (min, max, avg): 12.671000 15.129000 13.726667
Single (min, max, avg): 112.160000 114.022000 113.266333
Multi (min, max, avg): 60.331000 65.190000 63.192667

Note that integration using 3-dimensional vectors is only ~1.5 times slower than when using 1-dimensional vectors, which is unexpected. The computations alone should really be 3 times slower. Where does the difference come from?

The time for the 1-dimensional computation is roughly fp + ovh, where fp is the time spend doing additions and multiplications on floats, and ovh is the overhead, i.e. time spent in calling functions and allocating objects.
Then the time for the 3-dimensional computation should be 3fp + ovh. For the sake of simplicity I'm assuming the overhead does not depend on the type of vector involved.
I've done the math for single-threaded RK4, and it turns out that the overhead time is about 78% of the total execution time.

It should be possible to do better, especially if we look at the performance of the code in Tutorial 9. Personally, I prefer the code in Tutorial 8, and would be glad if I could bring down the execution time to comparable levels.


A common (and valuable) advice on optimization techniques is to profile the code, using e.g. NProf. In the present case, however, NProf did not tell me anything I did not know.
At this point, .NET reflector comes in handy as it allows one to look at the generated IL code.

Let us look at the code for "ef_Vec1d", the Euler integration function using 1-dimensional vectors. It's not trivial to find, there is a bit of detective work to do. In the code for "_main", we find:

ef_Vec1d is a partially evaluated function, which is compiled into an object mimicking a function. This kind of object holds the evaluated parameters as data members and has an "Invoke" method which is the function itself.
There is a bit of overhead associated to partially evaluated functions, which means one should avoid them in performance critical sections of the code.
If we look at "clo@64_1", we will see that it lacks any data member. Moreover, the "Invoke" method contains:

Notice how the compiler is accessing Vec1D_ops directly, instead of using a data member or a function parameter. Would it be possible to get the compiler to do the same kind of optimization inside "euler_implicit"? In order for this to be possible, "euler_implicit" must be declared "inline". The new "clo@64" becomes:

The generated code is starting to be hard to read, but we can see that the body of "euler_implicit" was indeed inlined, and is now using Vec1D_ops directly. That is a nice optimization, but it would be better if the compiler had gone further and called "Vec1d.scale" directly. It might have even inlined "Vec1d.scale", as this method is very short.
What can we do to help the compiler do that? I had to take a guess here, but I suspected using a record for VecOps instead of a class could be a good idea.

That was a pretty good intuition. For some reason, classes prevent the F# compiler to "see through" them and remove them during optimization. Records, on the other hand, can be removed.

This piece of code, which is much more easily readable than the previous one, exposes a new problem:

Note that "plus" is not a function of two Vec1d returning a Vec1d. What it really is is a function of a single Vec1d, returning a FastFunc, which is itself an object mimicking a function. Every time "plus" is called, a new function object is created, which is a bad thing, performance wise. Happily, the solution is easy: Change the signature of plus to take two Vec1d and return a Vec1d. Instead of Vec1d -> Vec1d -> Vec1d

we want (Vec1d * Vec1d) -> Vec1d:

After applying the same change to "scale", we get the following performance:

1D (struct)
Single (min, max, avg): 1.959000 1.971000 1.964667
Multi (min, max, avg): 0.990000 1.004000 0.995000
Single (min, max, avg): 9.463000 9.509000 9.479667
Multi (min, max, avg): 4.939000 5.023000 4.975667
2D (struct)
Single (min, max, avg): 3.485000 3.522000 3.507333
Multi (min, max, avg): 1.783000 1.814000 1.794667
Single (min, max, avg): 20.306000 20.692000 20.468333
Multi (min, max, avg): 10.557000 10.657000 10.593000
3D (struct)
Single (min, max, avg): 6.089000 6.127000 6.104667
Multi (min, max, avg): 3.073000 3.201000 3.152333
Single (min, max, avg): 32.570000 32.616000 32.597333
Multi (min, max, avg): 16.424000 16.784000 16.657667
3D (record)
Single (min, max, avg): 6.910000 6.958000 6.940667
Multi (min, max, avg): 3.530000 3.575000 3.554667
Single (min, max, avg): 33.879000 34.373000 34.065000
Multi (min, max, avg): 17.662000 17.843000 17.758667

That was quite a performance improvement, wasn't it?
I have also implemented a 3d vector type using a record instead of a struct, in case it might magically improve performance. As you can see, it did not, there is no real difference with the struct.

These numbers could certainly be improved further by using the techniques described in Tutorial 7, but I'll stop here for now.

Complete source code

For the complete source code of the optimized version, see revision 8

To compare the changes with the code from Tutorial 8, see the diff.

Friday, October 10, 2008

Source code on Google code

Just a quick note to announce that I have created a google code project to host the source code used in this blog.
I have not yet added all tutorials, but the seven first are already there.