While asking on the OCaml mailing list about the relative merits of our HLVM project, David McClain described a computation that he had been required to perform in the context of scientific computing in the past:
"I remember dealing with stacks of images taken from an infrared imaging sensor -- perhaps 256 images, each of which is, say 512 x 512 pixels. I needed to obtain median pixel values from the stack and produce one median image..."
This problem is easily solved in both OCaml and F# using idiomatic functional programming. For example, the following is a one-line solution written in OCaml:
Array.map (Array.map (fun gs -> Array.sort compare gs; gs.(m/2))) images
This simply sorts each pixel's sequence into non-descending order and extracts the middle (median) pixel, mapped across the columns and then the rows of the image.
This tiny implementation solves David's example with 226 pixels in 32 seconds, which is fast enough for many applications. However, the performance of this OCaml implementation is heavily degraded by the extensive use of polymorphism at every stage because the current OCaml implementation uses static compilation that can handle polymorphism at run-time and, consequently, requires a uniform run-time representation of all types. This leads to the tagging of integers (which is why OCaml has a 31-bit int type) and the boxing of other types such as float into separate heap-allocated values, both of which are extremely inefficient. This general execution model is always used even if type information is statically available. The only workaround is to inline the Array.sort and Array.map functions by hand.
Fortunately, this inefficiency can be overcome by using Just-In-Time (JIT) compilation instead of static compilation and partially specializing polymorphism away before JIT compilation. This is the intended design for polymorphism in HLVM and the inspiration was drawn from Microsoft's excellent implementation of the CLR.
Consequently, the equivalent F# program is 100× faster than the OCaml:
images |> Array2D.map (fun xs -> Array.sortInPlaceWith compare xs; xs.[m/2])
Moreover, the Task Parallel Library makes it easy to parallelize this implemention to improve the performance even further:
Parallel.For(0, n, fun y ->
for x=0 to n-1 do
Array.sortInPlaceWith compare images.[y, x])
images |> Array2D.map (fun xs -> xs.[m/2])
This parallel implementation is a whopping 821× faster than the OCaml and, in fact, is a superlinear speedup with respect to the 8 cores because it is able to split the data across more caches.