
Parallel quick sort on Haskell and how hard it was to write
- Transfer
Note perev .: This is a translation of the story of how hard it was to write parallel quicksort in Haskell. The original article was written in 2010, but it seems to me that it is still instructive and largely relevant.
There are many examples of how Haskell makes simple problems difficult. Probably the most famous of them is the sieve of Eratosthenes, which is easy to write in any imperative language, but so difficult to write in Haskell that almost all the solutions that were taught at universities and used in research for the past 18 years turned out to be wrong. Melissa O'Neill drew attention to their failure in her important scientific work " The Real Sieve of Eratosthenes". It provides an excellent description of what is wrong with the old approaches and how to fix them. Melissa’s decision was to use the priority queue to implement the sieve. The correct solution turned out to be 10 times longer than the much simpler solution F # and a whole 100 times longer than the original mutilated algorithm in Haskell.
Today quick sorting is the new sieve of Eratosthenes. And Haskell programmers have again overcome the inability of the language to express this algorithm by mutilating the latter . The new version is slower by p Ordinary, but it can be easily recorded on Haskell.
This code is completely inconsistent with the essence of the real quick sort algorithm, which makes it so efficient (see Tony Hoar's original article on fast sorting , 1962 ). Namely, partitioning an array without additional memory allocation [in-place partitioning using swaps].
Faced with writing a general-purpose parallel parallel quicksort in Haskell, Jim Apple (who is writing a Haskell Ph.D. at the University of California, Davis, UC Davis) started the case by writing the following code:
This algorithm uses Haskell’s parallel “ strategies ”. This concept was developed to give Haskell programmers more control over parallelization, but it turned out that the only available implementation was memory and no one was able to make it work in this code: Jim's solution contains a multithreading error [concurrency], due to which it returns incorrect results on almost every call.
Then Peaker proposed his solution on Haskell:
This solution also turned out to be with bugs. Firstly, it contains a more subtle concurrency bug, which leads to incorrect results only occasionally. Picker fixed this bug in the following code:
This solution does work on small input arrays, but increasing the size of the array to 1,000,000 elements causes the stack to overflow. Two attempts were made to analyze this bug, here and here , but both turned out to be wrong. In fact, this is a bug in the getElems function of the Haskell standard library, which overflows the stack on large arrays.
Oddly enough, the fixes to a few more bugs seem to have led to the implementation of the world's first parallel general purpose quicksort written in Haskell. Moreover, the final solution on Haskell is only about 55% slower than the equivalent solution on F #. Be careful, this solution requires the latest version of GHC, which was released a few weeks ago ( approx. Pere .: article 2010, so the reader has nothing to worry about ).
Ganesh Sittampalam:
Jon Harrop (original author):
There are many examples of how Haskell makes simple problems difficult. Probably the most famous of them is the sieve of Eratosthenes, which is easy to write in any imperative language, but so difficult to write in Haskell that almost all the solutions that were taught at universities and used in research for the past 18 years turned out to be wrong. Melissa O'Neill drew attention to their failure in her important scientific work " The Real Sieve of Eratosthenes". It provides an excellent description of what is wrong with the old approaches and how to fix them. Melissa’s decision was to use the priority queue to implement the sieve. The correct solution turned out to be 10 times longer than the much simpler solution F # and a whole 100 times longer than the original mutilated algorithm in Haskell.
Today quick sorting is the new sieve of Eratosthenes. And Haskell programmers have again overcome the inability of the language to express this algorithm by mutilating the latter . The new version is slower by p Ordinary, but it can be easily recorded on Haskell.
qsort [] = []
qsort (x:xs) = qsort (filter (< x) xs) ++ [x] ++ qsort (filter (>= x) xs)
This code is completely inconsistent with the essence of the real quick sort algorithm, which makes it so efficient (see Tony Hoar's original article on fast sorting , 1962 ). Namely, partitioning an array without additional memory allocation [in-place partitioning using swaps].
Faced with writing a general-purpose parallel parallel quicksort in Haskell, Jim Apple (who is writing a Haskell Ph.D. at the University of California, Davis, UC Davis) started the case by writing the following code:
import Data.HashTable as H
import Data.Array.IO
import Control.Parallel.Strategies
import Control.Monad
import System
exch a i r =
do tmpi <- readArray a i
tmpr <- readArray a r
writeArray a i tmpr
writeArray a i tmpi
bool a b c = if c then a else b
quicksort arr l r =
if r <= l then return () else do
i <- loop (l-1) r =<< readArray arr r
exch arr i r
withStrategy rpar $ quicksort arr l (i-1)
quicksort arr (i+1) r
where
loop i j v = do
(i', j') <- liftM2 (,) (find (>=v) (+1) (i+1)) (find (<=v) (subtract 1) (j-1))
if (i' < j') then exch arr i' j' >> loop i' j' v
else return i'
find p f i = if i == l then return i
else bool (return i) (find p f (f i)) . p =<< readArray arr i
main =
do [testSize] <- fmap (fmap read) getArgs
arr <- testPar testSize
ans <- readArray arr (testSize `div` 2)
print ans
testPar testSize =
do x <- testArray testSize
quicksort x 0 (testSize - 1)
return x
testArray :: Int -> IO (IOArray Int Double)
testArray testSize =
do ans <- newListArray (0,testSize-1) [fromIntegral $ H.hashString $ show i | i <- [1..testSize]]
return ans
This algorithm uses Haskell’s parallel “ strategies ”. This concept was developed to give Haskell programmers more control over parallelization, but it turned out that the only available implementation was memory and no one was able to make it work in this code: Jim's solution contains a multithreading error [concurrency], due to which it returns incorrect results on almost every call.
Then Peaker proposed his solution on Haskell:
import Data.Array.IO
import Control.Monad
import Control.Concurrent
bool t _f True = t
bool _t f False = f
swap arr i j = do
(iv, jv) <- liftM2 (,) (readArray arr i) (readArray arr j)
writeArray arr i jv
writeArray arr j iv
parallel fg bg = do
m <- newEmptyMVar
forkIO (bg >> putMVar m ())
fg >> takeMVar m
sort arr left right = when (left < right) $ do
pivot <- read right
loop pivot left (right - 1) (left - 1) right
where
read = readArray arr
sw = swap arr
find n pred i = bool (find n pred (n i)) (return i) . pred i =<< read i
move op d i pivot = bool (return op)
(sw (d op) i >> return (d op)) =<<
liftM (/=pivot) (read i)
loop pivot oi oj op oq = do
i <- find (+1) (const ( cell>pivot && idx/=left) oj
if i < j
then do
sw i j
p <- move op (+1) i pivot
q <- move oq (subtract 1) j pivot
loop pivot (i + 1) (j - 1) p q
else do
sw i right
forM_ (zip [left..op-1] [i-1,i-2..]) $ uncurry sw
forM_ (zip [right-1,right-2..oq+1] [i+1..]) $ uncurry sw
let ni = if left >= op then i + 1 else right + i - oq
nj = if right-1 <= oq then i - 1 else left + i - op
let thresh = 1024
strat = if nj - left < thresh || right - ni < thresh
then (>>)
else parallel
sort arr left nj `strat` sort arr ni right
main = do
arr <- newListArray (0, 5) [3,1,7,2,4,8]
getElems arr >>= print
sort (arr :: IOArray Int Int) 0 5
getElems arr >>= print
This solution also turned out to be with bugs. Firstly, it contains a more subtle concurrency bug, which leads to incorrect results only occasionally. Picker fixed this bug in the following code:
import System.Time
import System.Random
import Data.Array.IO
import Control.Monad
import Control.Concurrent
import Control.Exception
import qualified Data.List as L
bool t _ True = t
bool _ f False = f
swap arr i j = do
(iv, jv) <- liftM2 (,) (readArray arr i) (readArray arr j)
writeArray arr i jv
writeArray arr j iv
background task = do
m <- newEmptyMVar
forkIO (task >>= putMVar m)
return $ takeMVar m
parallel fg bg = do
wait <- background bg
fg >> wait
sort arr left right = when (left < right) $ do
pivot <- read right
loop pivot left (right - 1) (left - 1) right
where
read = readArray arr
sw = swap arr
find n pred i = bool (find n pred (n i)) (return i) . pred i =<< read i
move op d i pivot = bool (return op)
(sw (d op) i >> return (d op)) =<<
liftM (/=pivot) (read i)
swapRange px x nx y ny = if px x then sw x y >> swapRange px (nx x) nx (ny y) ny else return y
loop pivot oi oj op oq = do
i <- find (+1) (const ( cell>pivot && idx/=left) oj
if i < j
then do
sw i j
p <- move op (+1) i pivot
q <- move oq (subtract 1) j pivot
loop pivot (i + 1) (j - 1) p q
else do
sw i right
nj <- swapRange (oq) (right-1) (subtract 1) (i+1) (+1)
let thresh = 1024000
strat = if nj - left < thresh || right - ni < thresh
then (>>)
else parallel
sort arr left nj `strat` sort arr ni right
timed act = do
TOD beforeSec beforeUSec <- getClockTime
x <- act
TOD afterSec afterUSec <- getClockTime
return (fromIntegral (afterSec - beforeSec) +
fromIntegral (afterUSec - beforeUSec) / 1000000000000, x)
main = do
let n = 1000000
putStrLn "Making rands"
arr <- newListArray (0, n-1) =<< replicateM n (randomRIO (0, 1000000) >>= evaluate)
elems <- getElems arr
putStrLn "Now starting sort"
(timing, _) <- timed $ sort (arr :: IOArray Int Int) 0 (n-1)
print . (L.sort elems ==) =<< getElems arr
putStrLn $ "Sort took " ++ show timing ++ " seconds"
This solution does work on small input arrays, but increasing the size of the array to 1,000,000 elements causes the stack to overflow. Two attempts were made to analyze this bug, here and here , but both turned out to be wrong. In fact, this is a bug in the getElems function of the Haskell standard library, which overflows the stack on large arrays.
Oddly enough, the fixes to a few more bugs seem to have led to the implementation of the world's first parallel general purpose quicksort written in Haskell. Moreover, the final solution on Haskell is only about 55% slower than the equivalent solution on F #. Be careful, this solution requires the latest version of GHC, which was released a few weeks ago ( approx. Pere .: article 2010, so the reader has nothing to worry about ).
First comments on the original article
Ganesh Sittampalam:
Congratulations on learning how to fork and synchronize in Haskell!
Jon Harrop (original author):
Congratulations on testing your theory that it will be "trivial" ...