Trying to make a PDF book from a web comic with Haskell using xkcd as an example
After reading the article I decided to check how suitable Haskell is for this. I’ll say right away that Haskell itself works quite well, but running through hackage.haskell.org , I immediately found problems with libraries for working with PDF, which put an end to the full implementation.
But I decided nevertheless to do part of the work in order to show how the same task could be done at Haskell, if only ...
Since we will have to request information in the form of JSON, we will immediately write a useful function: The function will throw an exception for erroneous parsing, and it will also make it easier to specify the necessary fields in JSON. Then the code to get the last comic book and comic book by number will look like this: The function parses the received data in accordance with the specified pattern. In the case we get from JSON a numeric member named num , in the case we get a two-line stupid for the picture and name, respectively. The code to get the picture by URL:
We write the download of one comic in a separate function. Here we used the channel ( ), to which we will send the download results, as well as a thread-safe log . Due to the curve of the HPDF library, you first have to save the image to a file, and then load it from there again. It’s completely not clear to me why the author wrote parsing JPEG from scratch himself (and even only from a file), and did not use the ready-made library
Now it’s worth writing a function that will generate the resulting PDF for us from the list of pictures. In general, this function is nothing interesting. We call the corresponding functions from the libraries. Only the nuance is important that the list of pictures is lazy, so the function starts working as soon as the first picture appears.
In the main function, we initialize the log, create a channel into which lightweight threads will write the result, and cause the generation of a PDF with a lazy list of pictures from this channel. The function is similar , guaranteeing the closure of the log. By line, we create a download stream for each number, i.e. We call in a separate thread. will return us a lazy list with the first results. Take the whole list does not make sense, since the channel is endless. Then we apply the function , with each index in order, from to . This is necessary in order to get a lazy list as well, but in which the pictures go strictly in order. Thus, we will always write pictures in the right order.
Unfortunately, due to the lack of a decent library for working with PDF, the result did not pay off.
Most of the images HPDF refuses to accept (thanks to another cycling implementation of JPEG uploads), I did not even understand the scaling of images.
It was very convenient to test the request directly in GHCi, then parse one of them, download and save the picture. All development was carried out there, and then the code was transferred to a file. Multithreading was screwed on without adding interfaces or any extra code. Instead of returning the result, we simply write it to the channel, on the other end of which there is a handler. And we add to the asynchronous function
In general, look at hackage.haskell.org , look for the necessary libraries, and if you find, don’t miss the chance to write everything in Haskell!
But I decided nevertheless to do part of the work in order to show how the same task could be done at Haskell, if only ...
Get Comic Book Information
Since we will have to request information in the form of JSON, we will immediately write a useful function: The function will throw an exception for erroneous parsing, and it will also make it easier to specify the necessary fields in JSON. Then the code to get the last comic book and comic book by number will look like this: The function parses the received data in accordance with the specified pattern. In the case we get from JSON a numeric member named num , in the case we get a two-line stupid for the picture and name, respectively. The code to get the picture by URL:
rethrow :: (Show e) => Exceptional e a -> IO a
rethrow = switch (throwIO . userError . show) return
jsonAt url = simpleHTTP (getRequest url) >>= getResponseBody >>= rethrow . decode_ json
str s = member_ (literal s) JSON.string
num n = member_ (literal n) JSON.number
rethrow
str
num
comics = jsonAt "http://xkcd.com/info.0.json" >>= fmap (fromIntegral . numerator) . rethrow . decode_ (num "num")
comic n = jsonAt (concat ["http://xkcd.com/", show n, "/info.0.json"]) >>= rethrow . decode_ (str "img" <&> str "title")
decode_
num "num"
str "img" <&> str "title"
image url = simpleHTTP (getRequest url) >>= getResponseBody
Download stream
We write the download of one comic in a separate function. Here we used the channel ( ), to which we will send the download results, as well as a thread-safe log . Due to the curve of the HPDF library, you first have to save the image to a file, and then load it from there again. It’s completely not clear to me why the author wrote parsing JPEG from scratch himself (and even only from a file), and did not use the ready-made library
retrieve ch l i = tryGet `onException` onFail where
onFail = do
writeChan ch (i, Nothing)
writeLogger l Error $ "Comic " ++ show i ++ " failed to download"
tryGet = do
(imgUrl, title) <- comic i
imgData <- image imgUrl
jpg <- writeBinaryFile fname imgData >> readJpegFile fname >>= either (throwIO . userError . show) return
writeChan ch (i, Just (jpg, title))
writeLogger l Info $ "Comic " ++ show i ++ " downloaded"
fname = show i ++ ".jpg"
ch
Control.Concurrent.Chan
l
PDF Generation
Now it’s worth writing a function that will generate the resulting PDF for us from the list of pictures. In general, this function is nothing interesting. We call the corresponding functions from the libraries. Only the nuance is important that the list of pictures is lazy, so the function starts working as soon as the first picture appears.
pdf imgs = runPdf "Xkcd.pdf" doc (PDFRect 0 0 800 600) $ forM_ imgs genPage where
genPage (jpeg, title) = do
img <- createPDFJpeg jpeg
page <- addPage Nothing
drawWithPage page (drawText (text (PDFFont Times_Roman 12) 0 0 (toPDFString title)) >> drawXObject img)
doc = PDFDocumentInfo {
author = toPDFString "xkcd",
subject = toPDFString "xkcd",
pageMode = UseNone,
pageLayout = OneColumn,
viewerPreferences = standardViewerPrefs,
compressed = False }
Putting It Together
In the main function, we initialize the log, create a channel into which lightweight threads will write the result, and cause the generation of a PDF with a lazy list of pictures from this channel. The function is similar , guaranteeing the closure of the log. By line, we create a download stream for each number, i.e. We call in a separate thread. will return us a lazy list with the first results. Take the whole list does not make sense, since the channel is endless. Then we apply the function , with each index in order, from to . This is necessary in order to get a lazy list as well, but in which the pictures go strictly in order. Thus, we will always write pictures in the right order.
main = bracket (newLogger Console) closeLogger $ \l -> do
n <- comics
writeLogger l Info $ "Number of comics to download: " ++ show n
ch <- newChan
mapM_ (fork . retrieve ch l) [1..n]
cts <- fmap (take n) $ getChanContents ch
let imgs = catMaybes $ mapMaybe (`lookup` cts) [1..n]
pdf imgs `onException` (writeLogger l Error "Unable to generate PDF")
writeLogger l Info "PDF generated."
bracket
using
mapM_ (fork . retrieve ch l) [1..n]
retrieve ch l i
fmap (take n) $ getChanContents ch
n
lookup
1
n
Full listing
main = bracket (newLogger Console) closeLogger $ \l -> do
n <- comics
writeLogger l Info $ "Number of comics to download: " ++ show n
ch <- newChan
mapM_ (fork . retrieve ch l) [1..n]
cts <- fmap (take n) $ getChanContents ch
let imgs = catMaybes $ mapMaybe (`lookup` cts) [1..n]
pdf imgs `onException` (writeLogger l Error "Unable to generate PDF")
writeLogger l Info "PDF generated."
retrieve ch l i = tryGet `onException` onFail where
onFail = do
writeChan ch (i, Nothing)
writeLogger l Error $ "Comic " ++ show i ++ " failed to download"
tryGet = do
(imgUrl, title) <- comic i
imgData <- image imgUrl
jpg <- writeBinaryFile fname imgData >> readJpegFile fname >>= either (throwIO . userError . show) return
writeChan ch (i, Just (jpg, title))
writeLogger l Info $ "Comic " ++ show i ++ " downloaded"
fname = show i ++ ".jpg"
pdf imgs = runPdf "Xkcd.pdf" doc (PDFRect 0 0 800 600) $ forM_ imgs genPage where
genPage (jpeg, title) = do
img <- createPDFJpeg jpeg
page <- addPage Nothing
drawWithPage page (drawText (text (PDFFont Times_Roman 12) 0 0 (toPDFString title)) >> drawXObject img)
doc = PDFDocumentInfo {
author = toPDFString "voidex",
subject = toPDFString "xkcd",
pageMode = UseNone,
pageLayout = OneColumn,
viewerPreferences = standardViewerPrefs,
compressed = False }
rethrow :: (Show e) => Exceptional e a -> IO a
rethrow = switch (throwIO . userError . show) return
jsonAt url = simpleHTTP (getRequest url) >>= getResponseBody >>= rethrow . decode_ json
str s = member_ (literal s) JSON.string
num n = member_ (literal n) JSON.number
comics = jsonAt "http://xkcd.com/info.0.json" >>= fmap (fromIntegral . numerator) . rethrow . decode_ (num "num")
comic n = jsonAt (concat ["http://xkcd.com/", show n, "/info.0.json"]) >>= rethrow . decode_ (str "img" <&> str "title")
image url = simpleHTTP (getRequest url) >>= getResponseBody
writeBinaryFile fname str = withBinaryFile fname WriteMode (\h -> hPutStr h str)
Swearing
Unfortunately, due to the lack of a decent library for working with PDF, the result did not pay off.
Most of the images HPDF refuses to accept (thanks to another cycling implementation of JPEG uploads), I did not even understand the scaling of images.
Praises
It was very convenient to test the request directly in GHCi, then parse one of them, download and save the picture. All development was carried out there, and then the code was transferred to a file. Multithreading was screwed on without adding interfaces or any extra code. Instead of returning the result, we simply write it to the channel, on the other end of which there is a handler. And we add to the asynchronous function
fork
. In the general case, not everything is so simple, of course, but from my own experience I will say that I have never had to change the architecture for this. In general, look at hackage.haskell.org , look for the necessary libraries, and if you find, don’t miss the chance to write everything in Haskell!