kefirr February 1, 2011 at 16:52

Making a PDF book from a web comic using C # using xkcd as an example

From the sandbox

Considering the new xkcd release , I looked at my freshly purchased Sony PRS-650 electric book, and immediately thought - I want to watch comics on it! Xkcd is just black and white and usually small in size. Slightly google, found only a collection of images on TPB, and a script on bash that should do PDF. I decided to take a little fun in programming and make a comic book grabber in my favorite C #.

It would be possible to do with a console application, but, for clarity, made a simple interface on WPF.

A complete analysis of the code will be redundant, so I will explain the main points. I recommend immediately opening / downloading the full application code from Google Code .

1. Get pictures, titles and alt-text from the site

On xkcd, comics are conveniently located at addresses of the form xkcd.com/n , where n = 1 ...
The first thought was to tear out the necessary code from the page, but it turned out that you could get all the information in JSON at the address of the form xkcd.com {0} / info. 0.json

For JSON in .NET there is a DataContractJsonSerializer.
We create the corresponding DataContract:

   [DataContract]
   public class XkcdComic
   {
      #region Public properties and indexers
      [DataMember]
      public string img { get; set; }
      [DataMember]
      public string title { get; set; }
      [DataMember]
      public string month { get; set; }
      [DataMember]
      public string num { get; set; }
      [DataMember]
      public string link { get; set; }
      [DataMember]
      public string year { get; set; }
      [DataMember]
      public string news { get; set; }
      [DataMember]
      public string safe_title { get; set; }
      [DataMember]
      public string transcript { get; set; }
      [DataMember]
      public string day { get; set; }
      [DataMember]
      public string alt { get; set; }
      #endregion
   }

... and use:

      private static XkcdComic GetComic(string url)
      {
         var stream = new WebClient().OpenRead(url);
         if (stream == null) return null;
         var serializer = new DataContractJsonSerializer(typeof (XkcdComic));
         return serializer.ReadObject(stream) as XkcdComic;
      }

At xkcd.com/info.0.json, you can get the latest comic book, and taking its number from the num field, find out their total number.
It remains to deflate the picture itself, everything is simple:

var imageBytes = WebRequest.Create(comicInfo.img).GetResponse().GetResponseStream().ToBytes();

... where comicInfo is our data from JSON, and ToBytes () is a simple extension method that reads data from a stream into an array.

To represent a comic strip (comic strip, or how to call it correctly in the singular?), The Comic class is used. To validate the received bytes of the picture (and we could download something wrong, the server could return an error, etc.) the class constructor was made private, and the factory Create method was added, which will return null in case of a decoding error. BitmapImage is used for decoding, which, if successful, will be used as thumbnail to preview the result:

   public static Comic Create(byte[] imageBytes)
   {
     try
     {
      // Validate image bytes by trying to create a Thumbnail.
      return new Comic {ImageBytes = imageBytes};
     }
     catch
     {
      // Failure, cannot decode bytes
      return null;
     }
   }
   public byte[] ImageBytes
   {
     get { return _imageBytes; }
     private set
     {
      _imageBytes = value;
      var bmp = new BitmapImage();
      bmp.BeginInit();
      bmp.DecodePixelHeight = 100; // Do not store whole picture
      bmp.StreamSource = new MemoryStream(_imageBytes);
      bmp.EndInit();
      bmp.Freeze();
      Thumbnail = bmp;
     }
   }

Having gathered everything together, we get a method for downloading a comedian strip by its number:

      protected override Comic GetComicByIndex(int index)
      {
         // Download comic JSON
         var comicInfo = GetComic(string.Format(UrlFormatString, index + 1));
         if (comicInfo == null) return null;
         // Download picture
         var imageStream = WebRequest.Create(comicInfo.img).GetResponse().GetResponseStream().ToMemoryStream();
         var comic = Comic.Create(imageStream.GetBuffer());
         if (comic == null) return null;
         comic.Description = comicInfo.alt;
         comic.Url = comicInfo.link;
         comic.Index = index + 1;
         comic.Title = comicInfo.title;
         // Auto-rotate for best fit
         var t = comic.Thumbnail;
         if (t.Width > t.Height)
         {
            comic.RotationDegrees = 90;
         }
         return comic;
      }

Thus, we have a number of comics and a method for getting a strip by index.

Parallelization of downloads

I will use the Task Parallel Library , as I was going to try for a long time, but there was no reason. At first glance, everything is simple, instead of directly calling GetComicByIndex (i) in a loop, we do var task = Task.Factory.StartNew (() => GetComicByIndex (i)). We write all the running tasks into the tasks array and do Task.WaitAll (tasks), after which we get the results of each task from task.Result. But this approach will not allow us to track progress and show already loaded strips to the user. To solve this problem, we will use WaitAny and yield return to return the result of each task immediately upon completion:

   public IEnumerable GetComics()
   {
     var count = GetCount();
     var tasks = Enumerable.Range(0, count).Select(GetTask).ToList();
     while (tasks.Count > 0) // Iterate until all tasks complete
     {
      var task = tasks.WaitAnyAndPop();
      if (task.Result != null) yield return task.Result;
     }
   }

Here, the GetTask method returns the GetComicByIndex (i) task, plus error handling and caching (this is beyond the scope of this article). WaitAnyAndPop is an extension method that waits for one of the tasks to complete, removes it from the list, and returns:

WaitAnyAndPop — extension метод, который ждёт завершения одной из задач, удаляет её из списка и возвращает:
   public static Task WaitAnyAndPop(this List> taskList)
   {
     var array = taskList.ToArray();
     var task = array[Task.WaitAny(array)];
     taskList.Remove(task); 
     return task;
   }

Now in the ViewModel code (I do not consider architectural issues in this article, but MVVM (Model-View-ViewModel) is the de facto standard for WPF applications, and the code for deflating, exporting and other things, of course, is broken down into the corresponding classes) we can iterate in the background stream according to the result of the GetComics method and show the user the strips as they arrive:

   private readonly Dispatcher _dispatcher;
   private readonly ObservableCollection _comics = new ObservableCollection();
   private void StartGrabbing()
   {
     _dispatcher = Dispatcher.CurrentDispatcher;  // ObservableCollection modifications should be performed on the UI thread
     ThreadPool.QueueUserWorkItem(o => DoGrabbing());
   }
   private void DoGrabbing()
   {
     var grabber = new XkcdGrabber();
     foreach (var comic in grabber.GetComics())
     {
      var c = comic;
      _dispatcher.Invoke((Action) (() => Comics.Add( c )), DispatcherPriority.ApplicationIdle);
     }
   }

2. Display comics in WPF

In the XAML code, we can only do Binding to our ObservableCollection, and prepare the corresponding DataTemplate to observe the loading process and the comics themselves, with alt text in Tooltip:

3. Create a PDF book

PDF was chosen because of its popularity and good support in Sony's electric books. For working with PDF in .NET, there is a convenient open source library iTextSharp (you will need to download it separately to build the project). Everything is pretty simple here. Omitting exception handling, adjusting the size of the picture and fonts, we get the following:

var document = new Document(PageSize.LETTER);
var wri = PdfWriter.GetInstance(document, new FileStream(fileName, FileMode.Create));
document.Open();
foreach (var comic in comics.OrderBy(c => c.Index).ToList())
{
  var image = Image.GetInstance(new MemoryStream(comic.ImageBytes));
  var title = new Paragraph(comic.Index + ". " + comic.Title, titleFont);
  title.SetAlignment("Center");
  document.Add(title);
  document.Add(image);
  document.Add(new Phrase(comic.Description, altFont));
  document.Add(Chunk.NEXTPAGE);
}
document.Close();

results

It turned out here is an application that, in addition to exporting to PDF, allows you to conveniently view comics:

Webcomic Grabber Screenshot

How the result on the book looks can be seen in the first picture of the article.

What is left out of the article

Caching downloaded data between application launches (done using IsolatedStorage).
Support for other webcomics (For this purpose, I pre-allocated the IGrabber interface, and rendered some of the functionality in TaskParallelGrabber. While writing an article, I added grabbers for WhatTheDuck and Cyanide & Happiness).

References

Application code (C #): the Google Code
Working with PDF on .NET: iTextSharp
Comics: xkcd

UPD:
Thanks XHunter , that filled the resulting PDF and compiled program !

UPD2:
I’ll just leave here a link to a good “response” article, which details the topic of pumping comics using WCF: http://darren-brown.com/?p=37

Tags: