Many applications require the ability to churn through gigabytes of data. In such cases, high IO throughput is valuable. This article examines high-throughput IO in the Haskell and F# programming languages. The Haskell solutions use the ByteString library and the F# solutions use ordinary .NET 4 IO.
An eager Haskell solution that loads the entire file into memory may be written as follows:
import System import Bits import Data.List import Data.ByteString as B main = getArgs >>= B.readFile . Data.List.head >>= print . B.foldl xor 0
A lazy alternative that loads parts of the file on-demand may be written as follows:
import System import Bits import Data.List import Data.ByteString.Lazy as B main = getArgs >>= B.readFile . Data.List.head >>= print . B.foldl xor 0
An eager F# version can simply call the ReadAllBytes function to load the entire file:
System.IO.File.ReadAllBytes file |> Array.fold (^^^) 0uy
An on-demand alternative simply opens the file and calls ReadByte on a buffered stream in a loop until the end of the file is reached:
use stream = new System.IO.FileStream(file, System.IO.FileMode.Open) use stream = new System.IO.BufferedStream(stream) let rec loop n = let b = stream.ReadByte() if b >= 0 then loop (n ^^^ b) else n loop 0
Compiling the Haskell with GHC 6.12.3 using --make -O2 flags and running the F# interactively in Visual Studio 2010 on a Dell Precision T5400 running 32-bit Windows Vista with 4Gb of RAM gave the following results for the on-demand versions:
The eager F# failed due to lack of memory on all three files. The eager Haskell obtained the correct answer on the smallest file, crashed on the middle file and produced corrupted output on the largest file. These problems appear to stem from a signed 32-bit integer overflowing inside the ByteString library, causing it to either crash when apparently-negative ByteString lengths are encountered or to read only part of a file.
Several points of interest:
- The variations between the speed of the Haskell and F# solutions is less than 5% in each case.
- Wrapping the IO in an IEnumerable (seq in F#) severely degrades performance.
- On 64-bit machines, memory mapping have different trade-offs.
- Haskell's ByteString library includes some special functions (e.g. count) that use optimized kernels written in C and assembler to manipulate large quantities of data more efficiently than is currently possible from Haskell or F#.