Trying to make a PDF book from a web comic with Haskell using xkcd as an example

    After reading the article I decided to check how suitable Haskell is for this. I’ll say right away that Haskell itself works quite well, but running through hackage.haskell.org , I immediately found problems with libraries for working with PDF, which put an end to the full implementation.
    But I decided nevertheless to do part of the work in order to show how the same task could be done at Haskell, if only ...

    Get Comic Book Information

    Since we will have to request information in the form of JSON, we will immediately write a useful function: The function will throw an exception for erroneous parsing, and it will also make it easier to specify the necessary fields in JSON. Then the code to get the last comic book and comic book by number will look like this: The function parses the received data in accordance with the specified pattern. In the case we get from JSON a numeric member named num , in the case we get a two-line stupid for the picture and name, respectively. The code to get the picture by URL:
    rethrow :: (Show e) => Exceptional e a -> IO a

    rethrow = switch (throwIO . userError . show) return



    jsonAt url = simpleHTTP (getRequest url) >>= getResponseBody >>= rethrow . decode_ json



    str s = member_ (literal s) JSON.string

    num n = member_ (literal n) JSON.number



    rethrowstrnum


    comics = jsonAt "http://xkcd.com/info.0.json" >>= fmap (fromIntegral . numerator) . rethrow . decode_ (num "num")



    comic n = jsonAt (concat ["http://xkcd.com/", show n, "/info.0.json"]) >>= rethrow . decode_ (str "img" <&> str "title")


    decode_num "num"
    str "img" <&> str "title"



    image url = simpleHTTP (getRequest url) >>= getResponseBody



    Download stream

    We write the download of one comic in a separate function. Here we used the channel ( ), to which we will send the download results, as well as a thread-safe log . Due to the curve of the HPDF library, you first have to save the image to a file, and then load it from there again. It’s completely not clear to me why the author wrote parsing JPEG from scratch himself (and even only from a file), and did not use the ready-made library
    retrieve ch l i = tryGet `onException` onFail where

        onFail = do

            writeChan ch (i, Nothing)

            writeLogger l Error $ "Comic " ++ show i ++ " failed to download"

        tryGet = do

            (imgUrl, title) <- comic i

            imgData <- image imgUrl

            jpg <- writeBinaryFile fname imgData >> readJpegFile fname >>= either (throwIO . userError . show) return

            writeChan ch (i, Just (jpg, title))

            writeLogger l Info $ "Comic " ++ show i ++ " downloaded"

        fname = show i ++ ".jpg"



    chControl.Concurrent.Chanl



    PDF Generation

    Now it’s worth writing a function that will generate the resulting PDF for us from the list of pictures. In general, this function is nothing interesting. We call the corresponding functions from the libraries. Only the nuance is important that the list of pictures is lazy, so the function starts working as soon as the first picture appears.
    pdf imgs = runPdf "Xkcd.pdf" doc (PDFRect 0 0 800 600) $ forM_ imgs genPage where

        genPage (jpeg, title) = do

            img <- createPDFJpeg jpeg

            page <- addPage Nothing

            drawWithPage page (drawText (text (PDFFont Times_Roman 12) 0 0 (toPDFString title)) >> drawXObject img)

        doc = PDFDocumentInfo {

            author = toPDFString "xkcd",

            subject = toPDFString "xkcd",

            pageMode = UseNone,

            pageLayout = OneColumn,

            viewerPreferences = standardViewerPrefs,

            compressed = False }




    Putting It Together

    In the main function, we initialize the log, create a channel into which lightweight threads will write the result, and cause the generation of a PDF with a lazy list of pictures from this channel. The function is similar , guaranteeing the closure of the log. By line, we create a download stream for each number, i.e. We call in a separate thread. will return us a lazy list with the first results. Take the whole list does not make sense, since the channel is endless. Then we apply the function , with each index in order, from to . This is necessary in order to get a lazy list as well, but in which the pictures go strictly in order. Thus, we will always write pictures in the right order.
    main = bracket (newLogger Console) closeLogger $ \l -> do

        n <- comics

        writeLogger l Info $ "Number of comics to download: " ++ show n

        ch <- newChan

        mapM_ (fork . retrieve ch l) [1..n]

        cts <- fmap (take n) $ getChanContents ch

        let imgs = catMaybes $ mapMaybe (`lookup` cts) [1..n]

        pdf imgs `onException` (writeLogger l Error "Unable to generate PDF")

        writeLogger l Info "PDF generated."


    bracketusing
    mapM_ (fork . retrieve ch l) [1..n]

    retrieve ch l i
    fmap (take n) $ getChanContents ch

    n
    lookup1n

    Full listing

    main = bracket (newLogger Console) closeLogger $ \l -> do

        n <- comics

        writeLogger l Info $ "Number of comics to download: " ++ show n

        ch <- newChan

        mapM_ (fork . retrieve ch l) [1..n]

        cts <- fmap (take n) $ getChanContents ch

        let imgs = catMaybes $ mapMaybe (`lookup` cts) [1..n]

        pdf imgs `onException` (writeLogger l Error "Unable to generate PDF")

        writeLogger l Info "PDF generated."



    retrieve ch l i = tryGet `onException` onFail where

        onFail = do

            writeChan ch (i, Nothing)

            writeLogger l Error $ "Comic " ++ show i ++ " failed to download"

        tryGet = do

            (imgUrl, title) <- comic i

            imgData <- image imgUrl

            jpg <- writeBinaryFile fname imgData >> readJpegFile fname >>= either (throwIO . userError . show) return

            writeChan ch (i, Just (jpg, title))

            writeLogger l Info $ "Comic " ++ show i ++ " downloaded"

        fname = show i ++ ".jpg"



    pdf imgs = runPdf "Xkcd.pdf" doc (PDFRect 0 0 800 600) $ forM_ imgs genPage where

        genPage (jpeg, title) = do

            img <- createPDFJpeg jpeg

            page <- addPage Nothing

            drawWithPage page (drawText (text (PDFFont Times_Roman 12) 0 0 (toPDFString title)) >> drawXObject img)

        doc = PDFDocumentInfo {

            author = toPDFString "voidex",

            subject = toPDFString "xkcd",

            pageMode = UseNone,

            pageLayout = OneColumn,

            viewerPreferences = standardViewerPrefs,

            compressed = False }



    rethrow :: (Show e) => Exceptional e a -> IO a

    rethrow = switch (throwIO . userError . show) return



    jsonAt url = simpleHTTP (getRequest url) >>= getResponseBody >>= rethrow . decode_ json



    str s = member_ (literal s) JSON.string

    num n = member_ (literal n) JSON.number



    comics = jsonAt "http://xkcd.com/info.0.json" >>= fmap (fromIntegral . numerator) . rethrow . decode_ (num "num")



    comic n = jsonAt (concat ["http://xkcd.com/", show n, "/info.0.json"]) >>= rethrow . decode_ (str "img" <&> str "title")



    image url = simpleHTTP (getRequest url) >>= getResponseBody



    writeBinaryFile fname str = withBinaryFile fname WriteMode (\h -> hPutStr h str)



    Swearing

    Unfortunately, due to the lack of a decent library for working with PDF, the result did not pay off.
    Most of the images HPDF refuses to accept (thanks to another cycling implementation of JPEG uploads), I did not even understand the scaling of images.

    Praises

    It was very convenient to test the request directly in GHCi, then parse one of them, download and save the picture. All development was carried out there, and then the code was transferred to a file. Multithreading was screwed on without adding interfaces or any extra code. Instead of returning the result, we simply write it to the channel, on the other end of which there is a handler. And we add to the asynchronous function fork. In the general case, not everything is so simple, of course, but from my own experience I will say that I have never had to change the architecture for this.

    In general, look at hackage.haskell.org , look for the necessary libraries, and if you find, don’t miss the chance to write everything in Haskell!

    Also popular now: