Sunday, September 14, 2008

Tutorial 6: Inlining

As usual, code first:
#light

open System

let gravity =
    -9.81

let inline spring pos =
    let k = 100.0 in
        -k * pos

let inline drag k pos speed = -k * speed

let inline euler_implicit accel_func pos speed delta =
    let speed2 = speed + delta * accel_func pos speed in
        pos + delta * speed2, speed2

let inline adapt (f: 'num -> 'num -> 'num) =
    fun pos speed ->
        let accel = f pos speed in
            speed, accel

let inline runge_kutta_4 deriv_func pos speed delta =
    let half_delta = 0.5 * delta
    let s1, a1 = deriv_func pos speed
    let s2, a2 = deriv_func (pos + half_delta * s1) (speed + half_delta * a1)
    let s3, a3 = deriv_func (pos + half_delta * s2) (speed + half_delta * a2)
    let s4, a4 = deriv_func (pos + delta * s3) (speed + delta * a3)
    let pos1 = pos + delta/6.0 * (s1 + 2.0 * s2 + 2.0 * s3 + s4)
    let speed1 = speed + delta/6.0 * (a1 + 2.0 * a2 + 2.0 * a3 + a4) in
        pos1, speed1

let inline simulate intg_func t0 t_fin delta pos0 speed0 =
    let rec repeat t0 pos0 speed0 =
        if t0 > t_fin then (pos0, speed0)
        else
            let t1 = t0 + delta
            let pos1, speed1 = intg_func pos0 speed0 delta
            repeat t1 pos1 speed1
    repeat t0 pos0 speed0

let initial_states = [| for s in [1..1000] -> (float s, 0.0) |] 
let t0 = 0.0
let t_fin = 1000.0
let delta = 0.025

let accel pos speed = (drag 1.0) pos speed + spring pos + gravity in

let time0 = System.DateTime.Now
let final_states = initial_states |>
Array.map (fun (x,y) -> simulate (euler_implicit accel) t0 t_fin (0.25*delta) x y)
let diff0 = System.DateTime.Now - time0
printfn "Time: %f Result: %A" diff0.TotalSeconds final_states;

let time0 = System.DateTime.Now
let final_states = initial_states |>
Array.map (fun (x,y) -> simulate (runge_kutta_4 (adapt accel)) t0 t_fin delta x y)
let diff0 = System.DateTime.Now - time0
printfn "Time: %f Result: %A" diff0.TotalSeconds final_states

ignore(Console.ReadKey(true))


This code differs from the code in Tutorial 5 in the following ways:
- Some functions were declared "inline".
- The adapted function required by the RK4 integrator is constructed at the top level.
- The simulation is run on 1000 bodies instead of one.

The "inline" keyword in C++ instructs the compiler to emit the code of the function in place where it is called, instead of using a jump. I guess the effect is the same in F#. Which functions should be inlined is hard to tell. Small functions are obvious candidates. It also appears that F# automatically inlines some functions, even though they are not marked with "inline". For instance, "accel" was inlined inside the code for "runge_kutta_4". It was however not inlined inside "euler_implicit", which may explain why RK4 ended up faster than Euler (see the table further down).

The adapted function is now built once at the top level, instead of being repeatedly recreated inside "runge_kutta_4". It now seems obvious that this is the right way to do it, but if you look back to Tutorial 4 you may understand how I came to doing it the stupid way. Writing code in an incremental fashion, adding small bits of functionality after another often leads to suboptimal code, but that is hardly avoidable. This underlines how important it is for programming languages to support code refactoring in an easy and safe manner.

I modified the code to run a simulation of 1000 bodies instead of just one. My goal was to introduce concurrency, but I'll leave that for later. Here comes the explanations for the new code:
let initial_states = [| for s in [1..1000] -> (float s, 0.0) |]

If you are familiar with Python you will probably have recognized a comprehension. Comprehensions are also supported by F#. This code declares a variable "initial_states" which is an array of pairs of floats, where the first member denotes the initial position, and the second the initial speed. The new bodies are initially immobile and located one on top of another, separated by 1 meter.
Notice how the integer "s" must be explicitly converted to a float. In F#, integers cannot be multiplied by floats.

let final_states = initial_states |>
    Array.map (fun (x,y) -> simulate (euler_implicit accel) t0 t_fin (0.25*delta) x y)
The array of initial states is passed to the Array.map function, which takes a function and applies it to each item in the array, effectively producing a new array.
These lines result in the application of the simulate function to each initial state.

The "|>" operator is just syntactical sugar. This code could have been written:
let final_states =
    Array.map (fun (x,y) -> simulate (euler_implicit accel) t0 t_fin (0.25*delta) x y)
    initial_states


At the end of Tutorial 5, we had managed to write code that was generic, concise and hopefully correct. Sadly, it was also rather slow. How much faster is it now? See the table below.


ImplementationPerformance EulerPerformance RK4
C++172413326530
F# Tutorial 53015632425
F# Tutorial 6 66225111731
OCaml6578971428



Numbers in the performance columns are (numbers of seconds simulated) x (number of bodies) / (running time). They are an indication of how many computations per second each implementation achieved.
Note that the simulation based on Euler uses a time step four times smaller than that of RK4. I think that is fair, since Euler evaluates the acceleration function once each step, instead of four times for RK4.

The C++ implementation is generic (using templates, no virtual method was used) and uses side-effects. I also had a non-generic version which was about twice as fast, but I deleted it (stupid me).
The OCaml implementation is almost identical to the code in Tutorial 5, which means it is both generic and pure (no side effects).
The good performance of RK4 in F# is somewhat artificial and probably untypical of a real application. Indeed, the acceleration function used in this example is excessively simple.

The C++ code is 186 lines long (including empty lines), the F# code is 60 lines long. I also find the F# code much more readable. Compare with the code of RK4 in C++
template<typename A>
class RK4
{
public:
    RK4(const A& accel):
       accel(accel)
    {
    }

    template<typename T>
    inline
    void operator()(T& pos, T& speed, const double delta) const
    {
        const double half_delta = 0.5 * delta;

        std::pair k1 = accel(pos, speed);
        std::pair k2 = accel(pos + half_delta * k1.first, speed + half_delta * k1.second);
        std::pair k3 = accel(pos + half_delta * k2.first, speed + half_delta * k2.second);
        std::pair k4 = accel(pos + delta * k3.first, speed + delta * k3.second);

        pos += delta / 6.0 * (k1.first + 2.0 * (k2.first + k3.first) + k4.first);
        speed += delta / 6.0 * (k1.second + 2.0 * (k2.second + k3.second) +     k4.second);
    }

private:

    const A& accel;
};


What should we conclude? I personally think that the 3x performance penalty of F# is acceptable, considering how uncompromising my F# code is regarding purity and genericity. It would be interesting to see how fast the F# code rewritten to use side effects would run.
This little experiment was enough to convince me to move over to F#. I stuck to C++ because of its support for generic programming (using templates), static type checking and for the large amount of code available. F# supports generic programming using a much cleaner syntax, its IDE performs type checks while the code is being typed in, and it has access to the .NET framework.
A last interesting fact to report: I wrote the generic C++ code after I wrote the F# code, without thinking very much. It was easy to translate the F# code to C++. It would have been considerably harder to write the C++ code from scratch. As a matter of fact, I had no clear idea how to implement RK4 in a generic way before I tackled the problem "thinking" in F#. UPDATE (may 2012): The numbers for C++ were actually for managed C++. After further optimizations (see tutorial 7b), F# went ahead of the managed C++ version. I later wrote a native C++ version which if I remember correctly was about 10 times faster.