Wednesday, October 15, 2008

Tutorial 11: Efficient generic programming

In Tutorial 8 I introduced a method for generic programmings which uses so-called dictionaries of operations (see also pp 110-111 in Expert F#). I concluded the article with the following judgment:

I'll close this up with a comment on performance. To put things bluntly, it's terrible. I am not quite sure if I have done something wrong, so I won't post scary numbers (yet), but I would say this method is not usable for real time physics simulations.

Bug fix

Well, it turned out that (1) I did do something wrong, and (2) there was a lot of untapped optimization potential in this code.

Before I go deeper into optimization techniques, let me correct a bug in the code:

let spring (vec_ops: VecOps<'a, 'b>) stiff pos =
vec_ops.scale stiff pos

should have been

let spring (vec_ops: VecOps<'a, 'b>) stiff pos =
vec_ops.scale -stiff pos

Without the minus sign, the simulated system quickly gains so much energy that the positions fall out of range. Apparently, floating point computations involving NaNs are very slow.

Faster, but still slow

With that problem fixed, the program now runs with the following execution times:

1D (struct)
Single (min, max, avg): 17.296000 17.382000 17.330000
Multi (min, max, avg): 8.244000 8.591000 8.412000
Single (min, max, avg): 78.617000 80.769000 79.340333
Multi (min, max, avg): 41.145000 41.741000 41.362333
2D (struct)
Single (min, max, avg): 18.918000 19.332000 19.158667
Multi (min, max, avg): 10.107000 10.411000 10.300667
Single (min, max, avg): 108.103000 110.466000 109.122333
Multi (min, max, avg): 56.108000 58.272000 57.140667
3D (struct)
Single (min, max, avg): 23.440000 23.683000 23.553667
Multi (min, max, avg): 12.671000 15.129000 13.726667
Single (min, max, avg): 112.160000 114.022000 113.266333
Multi (min, max, avg): 60.331000 65.190000 63.192667

Note that integration using 3-dimensional vectors is only ~1.5 times slower than when using 1-dimensional vectors, which is unexpected. The computations alone should really be 3 times slower. Where does the difference come from?

The time for the 1-dimensional computation is roughly fp + ovh, where fp is the time spend doing additions and multiplications on floats, and ovh is the overhead, i.e. time spent in calling functions and allocating objects.
Then the time for the 3-dimensional computation should be 3fp + ovh. For the sake of simplicity I'm assuming the overhead does not depend on the type of vector involved.
I've done the math for single-threaded RK4, and it turns out that the overhead time is about 78% of the total execution time.

It should be possible to do better, especially if we look at the performance of the code in Tutorial 9. Personally, I prefer the code in Tutorial 8, and would be glad if I could bring down the execution time to comparable levels.


A common (and valuable) advice on optimization techniques is to profile the code, using e.g. NProf. In the present case, however, NProf did not tell me anything I did not know.
At this point, .NET reflector comes in handy as it allows one to look at the generated IL code.

Let us look at the code for "ef_Vec1d", the Euler integration function using 1-dimensional vectors. It's not trivial to find, there is a bit of detective work to do. In the code for "_main", we find:

ef_Vec1d is a partially evaluated function, which is compiled into an object mimicking a function. This kind of object holds the evaluated parameters as data members and has an "Invoke" method which is the function itself.
There is a bit of overhead associated to partially evaluated functions, which means one should avoid them in performance critical sections of the code.
If we look at "clo@64_1", we will see that it lacks any data member. Moreover, the "Invoke" method contains:

Notice how the compiler is accessing Vec1D_ops directly, instead of using a data member or a function parameter. Would it be possible to get the compiler to do the same kind of optimization inside "euler_implicit"? In order for this to be possible, "euler_implicit" must be declared "inline". The new "clo@64" becomes:

The generated code is starting to be hard to read, but we can see that the body of "euler_implicit" was indeed inlined, and is now using Vec1D_ops directly. That is a nice optimization, but it would be better if the compiler had gone further and called "Vec1d.scale" directly. It might have even inlined "Vec1d.scale", as this method is very short.
What can we do to help the compiler do that? I had to take a guess here, but I suspected using a record for VecOps instead of a class could be a good idea.

That was a pretty good intuition. For some reason, classes prevent the F# compiler to "see through" them and remove them during optimization. Records, on the other hand, can be removed.

This piece of code, which is much more easily readable than the previous one, exposes a new problem:

Note that "plus" is not a function of two Vec1d returning a Vec1d. What it really is is a function of a single Vec1d, returning a FastFunc, which is itself an object mimicking a function. Every time "plus" is called, a new function object is created, which is a bad thing, performance wise. Happily, the solution is easy: Change the signature of plus to take two Vec1d and return a Vec1d. Instead of Vec1d -> Vec1d -> Vec1d

we want (Vec1d * Vec1d) -> Vec1d:

After applying the same change to "scale", we get the following performance:

1D (struct)
Single (min, max, avg): 1.959000 1.971000 1.964667
Multi (min, max, avg): 0.990000 1.004000 0.995000
Single (min, max, avg): 9.463000 9.509000 9.479667
Multi (min, max, avg): 4.939000 5.023000 4.975667
2D (struct)
Single (min, max, avg): 3.485000 3.522000 3.507333
Multi (min, max, avg): 1.783000 1.814000 1.794667
Single (min, max, avg): 20.306000 20.692000 20.468333
Multi (min, max, avg): 10.557000 10.657000 10.593000
3D (struct)
Single (min, max, avg): 6.089000 6.127000 6.104667
Multi (min, max, avg): 3.073000 3.201000 3.152333
Single (min, max, avg): 32.570000 32.616000 32.597333
Multi (min, max, avg): 16.424000 16.784000 16.657667
3D (record)
Single (min, max, avg): 6.910000 6.958000 6.940667
Multi (min, max, avg): 3.530000 3.575000 3.554667
Single (min, max, avg): 33.879000 34.373000 34.065000
Multi (min, max, avg): 17.662000 17.843000 17.758667

That was quite a performance improvement, wasn't it?
I have also implemented a 3d vector type using a record instead of a struct, in case it might magically improve performance. As you can see, it did not, there is no real difference with the struct.

These numbers could certainly be improved further by using the techniques described in Tutorial 7, but I'll stop here for now.

Complete source code

For the complete source code of the optimized version, see revision 8

To compare the changes with the code from Tutorial 8, see the diff.

No comments: