Sunday, 1 August 2010

Parallel generic quicksort in Haskell

Haskell has a history of making easy problems difficult. Perhaps the most infamous example was the Sieve of Eratosthenes, which is easily implemented in any imperative language but was so difficult to write in Haskell that almost all of the solutions that had been taught in universities and used in research for the preceding 18 years had been wrong until Melissa O'Neill published a seminal paper The Genuine Sieve of Eratosthenes that gave a beautiful description of what they had been doing wrong and how it should be corrected. Melissa's solution was to use a priority queue to implement a rolling wheel of numbers. The correct solution turned out to be 10× longer than a much simpler F# solution and a whopping 100× longer than the original bastardized algorithm in Haskell.

Today, quicksort is the new Sieve of Eratosthenes. Again, the academics have addressed Haskell's failure by bastardizing the algorithm, trading orders of magnitude in performance for something that Haskell can express easily:

qsort [] = [] qsort (x:xs) = qsort (filter (< x) xs) ++ [x] ++ qsort (filter (>= x) xs)

This completely fails to capture the essence of the real quicksort algorithm (see Tony Hoare's original 1962 quicksort paper) that makes it so efficient. Specifically, the in-place partitioning using swaps.

Faced with the challenge of writing a parallel generic quicksort in Haskell, Jim Apple (who is doing a PhD on Haskell at UC Davis) kicked of proceedings with the following code:

import Data.HashTable as H import Data.Array.IO import Control.Parallel.Strategies import Control.Monad import System exch a i r = do tmpi <- readArray a i tmpr <- readArray a r writeArray a i tmpr writeArray a i tmpi bool a b c = if c then a else b quicksort arr l r = if r <= l then return () else do i <- loop (l-1) r =<< readArray arr r exch arr i r withStrategy rpar $ quicksort arr l (i-1) quicksort arr (i+1) r where loop i j v = do (i', j') <- liftM2 (,) (find (>=v) (+1) (i+1)) (find (<=v) (subtract 1) (j-1)) if (i' < j') then exch arr i' j' >> loop i' j' v else return i' find p f i = if i == l then return i else bool (return i) (find p f (f i)) . p =<< readArray arr i main = do [testSize] <- fmap (fmap read) getArgs arr <- testPar testSize ans <- readArray arr (testSize `div` 2) print ans testPar testSize = do x <- testArray testSize quicksort x 0 (testSize - 1) return x testArray :: Int -> IO (IOArray Int Double) testArray testSize = do ans <- newListArray (0,testSize-1) [fromIntegral $ H.hashString $ show i | i <- [1..testSize]] return ans

This solution uses Haskell's parallel "strategies". This concept was introduced to give Haskell programmers more control over parallelization but the only available implementation was found to leak memory and nobody was able to get it to work in this case: Jim's solution contains a concurrency bug that causes it to return incorrect results almost every time it is called.

Peaker then posted his own Haskell solution:

import Data.Array.IO import Control.Monad import Control.Concurrent bool t _f True = t bool _t f False = f swap arr i j = do (iv, jv) <- liftM2 (,) (readArray arr i) (readArray arr j) writeArray arr i jv writeArray arr j iv parallel fg bg = do m <- newEmptyMVar forkIO (bg >> putMVar m ()) fg >> takeMVar m sort arr left right = when (left < right) $ do pivot <- read right loop pivot left (right - 1) (left - 1) right where read = readArray arr sw = swap arr find n pred i = bool (find n pred (n i)) (return i) . pred i =<< read i move op d i pivot = bool (return op) (sw (d op) i >> return (d op)) =<< liftM (/=pivot) (read i) loop pivot oi oj op oq = do i <- find (+1) (const (<pivot)) oi j <- find (subtract 1) (\idx cell -> cell>pivot && idx/=left) oj if i < j then do sw i j p <- move op (+1) i pivot q <- move oq (subtract 1) j pivot loop pivot (i + 1) (j - 1) p q else do sw i right forM_ (zip [left..op-1] [i-1,i-2..]) $ uncurry sw forM_ (zip [right-1,right-2..oq+1] [i+1..]) $ uncurry sw let ni = if left >= op then i + 1 else right + i - oq nj = if right-1 <= oq then i - 1 else left + i - op let thresh = 1024 strat = if nj - left < thresh || right - ni < thresh then (>>) else parallel sort arr left nj `strat` sort arr ni right main = do arr <- newListArray (0, 5) [3,1,7,2,4,8] getElems arr >>= print sort (arr :: IOArray Int Int) 0 5 getElems arr >>= print

This solution also turned out to be buggy. Firstly, it contains a more subtle concurrency bug causes it to return incorrect results only occassionally. Peaker corrected this bug to give the following code:

import System.Time import System.Random import Data.Array.IO import Control.Monad import Control.Concurrent import Control.Exception import qualified Data.List as L bool t _ True = t bool _ f False = f swap arr i j = do (iv, jv) <- liftM2 (,) (readArray arr i) (readArray arr j) writeArray arr i jv writeArray arr j iv background task = do m <- newEmptyMVar forkIO (task >>= putMVar m) return $ takeMVar m parallel fg bg = do wait <- background bg fg >> wait sort arr left right = when (left < right) $ do pivot <- read right loop pivot left (right - 1) (left - 1) right where read = readArray arr sw = swap arr find n pred i = bool (find n pred (n i)) (return i) . pred i =<< read i move op d i pivot = bool (return op) (sw (d op) i >> return (d op)) =<< liftM (/=pivot) (read i) swapRange px x nx y ny = if px x then sw x y >> swapRange px (nx x) nx (ny y) ny else return y loop pivot oi oj op oq = do i <- find (+1) (const (<pivot)) oi j <- find (subtract 1) (\idx cell -> cell>pivot && idx/=left) oj if i < j then do sw i j p <- move op (+1) i pivot q <- move oq (subtract 1) j pivot loop pivot (i + 1) (j - 1) p q else do sw i right nj <- swapRange (<op) left (+1) (i-1) (subtract 1) ni <- swapRange (>oq) (right-1) (subtract 1) (i+1) (+1) let thresh = 1024000 strat = if nj - left < thresh || right - ni < thresh then (>>) else parallel sort arr left nj `strat` sort arr ni right timed act = do TOD beforeSec beforeUSec <- getClockTime x <- act TOD afterSec afterUSec <- getClockTime return (fromIntegral (afterSec - beforeSec) + fromIntegral (afterUSec - beforeUSec) / 1000000000000, x) main = do let n = 1000000 putStrLn "Making rands" arr <- newListArray (0, n-1) =<< replicateM n (randomRIO (0, 1000000) >>= evaluate) elems <- getElems arr putStrLn "Now starting sort" (timing, _) <- timed $ sort (arr :: IOArray Int Int) 0 (n-1) print . (L.sort elems ==) =<< getElems arr putStrLn $ "Sort took " ++ show timing ++ " seconds"

This solution does run correctly on small inputs but increasing the problem size to 1,000,000 elements results in a stack overflow. Two attempts were made to diagnose this bug, here and here, but both turned out to be wrong. The bug is actually in the getElems function of the Haskell standard library which stack overflows on long arrays.

Surprisingly, more bug fixing seems to have culminated in the world's first parallel generic quicksort written in Haskell. Furthermore, the resulting Haskell solution is only around 55% slower than the equivalent F# solution. Note that this requires the latest GHC that was only released in recent weeks.

3 comments:

Ganesh Sittampalam said...

Congratulations on finally learning how to fork then synchronise in Haskell!

Flying Frog Consultancy Ltd. said...

@Ganesh: Congratulations on getting to test your theory that this was going to be "trivial"!

Alessandro said...

There is a implementation of the Sieve of Eratosthenes in haskell consisting of only 10 lines of code.

Sure, it is kinda imperative. But so is the F# version.