## Sunday, 1 August 2010

### Parallel generic quicksort in Haskell

Haskell has a history of making easy problems difficult. Perhaps the most infamous example was the Sieve of Eratosthenes, which is easily implemented in any imperative language but was so difficult to write in Haskell that almost all of the solutions that had been taught in universities and used in research for the preceding 18 years had been wrong until Melissa O'Neill published a seminal paper The Genuine Sieve of Eratosthenes that gave a beautiful description of what they had been doing wrong and how it should be corrected. Melissa's solution was to use a priority queue to implement a rolling wheel of numbers. The correct solution turned out to be 10× longer than a much simpler F# solution and a whopping 100× longer than the original bastardized algorithm in Haskell.

Today, quicksort is the new Sieve of Eratosthenes. Again, the academics have addressed Haskell's failure by bastardizing the algorithm, trading orders of magnitude in performance for something that Haskell can express easily:

```qsort []     = []
qsort (x:xs) = qsort (filter (< x) xs) ++ [x] ++ qsort (filter (>= x) xs)```

This completely fails to capture the essence of the real quicksort algorithm (see Tony Hoare's original 1962 quicksort paper) that makes it so efficient. Specifically, the in-place partitioning using swaps.

Faced with the challenge of writing a parallel generic quicksort in Haskell, Jim Apple (who is doing a PhD on Haskell at UC Davis) kicked of proceedings with the following code:

```import Data.HashTable as H
import Data.Array.IO
import Control.Parallel.Strategies
import System

exch a i r =
do tmpi <- readArray a i
writeArray a i tmpr
writeArray a i tmpi

bool a b c = if c then a else b

quicksort arr l r =
if r <= l then return () else do
i <- loop (l-1) r =<< readArray arr r
exch arr i r
withStrategy rpar \$ quicksort arr l (i-1)
quicksort arr (i+1) r
where
loop i j v = do
(i', j') <- liftM2 (,) (find (>=v) (+1) (i+1)) (find (<=v) (subtract 1) (j-1))
if (i' < j') then exch arr i' j' >> loop i' j' v
else return i'
find p f i = if i == l then return i
else bool (return i) (find p f (f i)) . p =<< readArray arr i

main =
do [testSize] <- fmap (fmap read) getArgs
arr <- testPar testSize
ans <- readArray arr  (testSize `div` 2)
print ans

testPar testSize =
do x <- testArray testSize
quicksort x 0 (testSize - 1)
return x

testArray :: Int -> IO (IOArray Int Double)
testArray testSize =
do ans <- newListArray (0,testSize-1) [fromIntegral \$ H.hashString \$ show i | i <- [1..testSize]]
return ans```

This solution uses Haskell's parallel "strategies". This concept was introduced to give Haskell programmers more control over parallelization but the only available implementation was found to leak memory and nobody was able to get it to work in this case: Jim's solution contains a concurrency bug that causes it to return incorrect results almost every time it is called.

Peaker then posted his own Haskell solution:

```import Data.Array.IO
import Control.Concurrent

bool t _f True = t
bool _t f False = f

swap arr i j = do
writeArray arr i jv
writeArray arr j iv

parallel fg bg = do
m <- newEmptyMVar
forkIO (bg >> putMVar m ())
fg >> takeMVar m

sort arr left right = when (left < right) \$ do
loop pivot left (right - 1) (left - 1) right
where
sw = swap arr
find n pred i = bool (find n pred (n i)) (return i) . pred i =<< read i
move op d i pivot = bool (return op)
(sw (d op) i >> return (d op)) =<<
loop pivot oi oj op oq = do
i <- find (+1) (const (<pivot)) oi
j <- find (subtract 1) (\idx cell -> cell>pivot && idx/=left) oj
if i < j
then do
sw i j
p <- move op (+1) i pivot
q <- move oq (subtract 1) j pivot
loop pivot (i + 1) (j - 1) p q
else do
sw i right
forM_ (zip [left..op-1] [i-1,i-2..]) \$ uncurry sw
forM_ (zip [right-1,right-2..oq+1] [i+1..]) \$ uncurry sw
let ni = if left >= op then i + 1 else right + i - oq
nj = if right-1 <= oq then i - 1 else left + i - op
let thresh = 1024
strat = if nj - left < thresh || right - ni < thresh
then (>>)
else parallel
sort arr left nj `strat` sort arr ni right

main = do
arr <- newListArray (0, 5) [3,1,7,2,4,8]
getElems arr >>= print
sort (arr :: IOArray Int Int) 0 5
getElems arr >>= print```

This solution also turned out to be buggy. Firstly, it contains a more subtle concurrency bug causes it to return incorrect results only occassionally. Peaker corrected this bug to give the following code:

```import System.Time
import System.Random
import Data.Array.IO
import Control.Concurrent
import Control.Exception
import qualified Data.List as L

bool t _ True = t
bool _ f False = f

swap arr i j = do
writeArray arr i jv
writeArray arr j iv

m <- newEmptyMVar
return \$ takeMVar m

parallel fg bg = do
wait <- background bg
fg >> wait

sort arr left right = when (left < right) \$ do
loop pivot left (right - 1) (left - 1) right
where
sw = swap arr
find n pred i = bool (find n pred (n i)) (return i) . pred i =<< read i
move op d i pivot = bool (return op)
(sw (d op) i >> return (d op)) =<<
swapRange px x nx y ny = if px x then sw x y >> swapRange px (nx x) nx (ny y) ny else return y
loop pivot oi oj op oq = do
i <- find (+1) (const (<pivot)) oi
j <- find (subtract 1) (\idx cell -> cell>pivot && idx/=left) oj
if i < j
then do
sw i j
p <- move op (+1) i pivot
q <- move oq (subtract 1) j pivot
loop pivot (i + 1) (j - 1) p q
else do
sw i right
nj <- swapRange (<op) left (+1) (i-1) (subtract 1)
ni <- swapRange (>oq) (right-1) (subtract 1) (i+1) (+1)
let thresh = 1024000
strat = if nj - left < thresh || right - ni < thresh
then (>>)
else parallel
sort arr left nj `strat` sort arr ni right

timed act = do
TOD beforeSec beforeUSec <- getClockTime
x <- act
TOD afterSec afterUSec <- getClockTime
return (fromIntegral (afterSec - beforeSec) +
fromIntegral (afterUSec - beforeUSec) / 1000000000000, x)

main = do
let n = 1000000
putStrLn "Making rands"
arr <- newListArray (0, n-1) =<< replicateM n (randomRIO (0, 1000000) >>= evaluate)
elems <- getElems arr
putStrLn "Now starting sort"
(timing, _) <- timed \$ sort (arr :: IOArray Int Int) 0 (n-1)
print . (L.sort elems ==) =<< getElems arr
putStrLn \$ "Sort took " ++ show timing ++ " seconds"```

This solution does run correctly on small inputs but increasing the problem size to 1,000,000 elements results in a stack overflow. Two attempts were made to diagnose this bug, here and here, but both turned out to be wrong. The bug is actually in the getElems function of the Haskell standard library which stack overflows on long arrays.

Surprisingly, more bug fixing seems to have culminated in the world's first parallel generic quicksort written in Haskell. Furthermore, the resulting Haskell solution is only around 55% slower than the equivalent F# solution. Note that this requires the latest GHC that was only released in recent weeks.

Ganesh Sittampalam said...

Congratulations on finally learning how to fork then synchronise in Haskell!

Jon Harrop said...

@Ganesh: Congratulations on getting to test your theory that this was going to be "trivial"!

Alessandro said...

There is a implementation of the Sieve of Eratosthenes in haskell consisting of only 10 lines of code.

Sure, it is kinda imperative. But so is the F# version.

Jon Harrop said...

@Alessandro: Note that my F# code produces an arbitrarily-long sequence of numbers whereas that imperative Haskell code only precomputes up to a specific number.

faithless said...

It's strange that no one called the author a liar. Melissas sieve is 15 lines long and the unfaithful sieve is 1 line. Wich gives 15 times less code, not 100. And isn't F# version is just trial division one?

Jon Harrop said...

@faithless IIRC Melissa's code requires data structures that aren't required by the others so I included those lines of code too. And, no, my F# is not just trial division.

twelfthstrand said...

No stack overflow; 14 seconds for 10 million elements (2.1 GB RAM), AMD FX-8350, ghc 8.6.3

https://gist.github.com/tallpeak/4b434c004d04111dd160b02452c94282