Automate cleaning of document snapshots with Sikuli

    Some time ago I was asked to expand one long-standing commentary to a full-fledged topic. I don’t think that in itself it is interesting enough, but I had an idea: why not combine the useful with the pleasant and get to know better with one curious tool, the news of which has recently spread all over IT resources.


    The main task that we will solve in the framework of this topic is the preparation of scans and photographs of written sources (books, lectures, etc.) for printing, compact storage, packaging in djvu, etc.
    Photoshop and FineReader will not be considered. Although they provide a number of useful tools, they cost money, generally speaking.
    With a scanner, everything is usually simple: you get images of good enough quality so that you can get by with minimal processing.
    More interesting with photographs: lighting problems and geometric distortions are added. Alas, the correction of geometric distortions is at least difficult to automate. But with the lighting and background it is quite possible to fight. What will we do.


    Paint.NET is a raster graphics editor for Windows with support for layers and filters.
    Sikuli is essentially a tool for automating interactions with a graphical interface. Plus additional features for testing applications, but in this article we do not touch them. We will use Sikuli to compensate for the lack of full support for macros in Paint.NET.
    The main killer feature Sikuli should be clarity and simplicity to create scripts, according to the principle "What you see, the way it works» ( «for What you see is how IT works») True, the overall dampness of the project spoils the impression somewhat. I worked with version 0.09. In the recently released version 0.10, the main rake is removed, but many usual things, like the Undo function in the editor, are still missing.
    By the way, I recently came across a QAliber project . Apparently, it has a number of advantages in terms of interaction with the tested interface and overall sophistication. But visibility ... In general, you can see and feel the difference :) Although, probably, I will try to use QAliber on occasion.

    Sikuli architecture includes several layers written in various languages:
    • The top level is the Jython API. In essence, Sikuli scripts are Python programs, and access the functions provided by the Jython API. (Each project is stored in the% scriptname% .sikuli folder. Inside the folder is the% scriptname% .py file and PNG images.) The author mentions the possibility of implementing a top level in any other language running on top of the JVM. You can work with the Sikuli Java API directly from your program.
    • The middle tier is the Java API. It works with the keyboard and mouse, and also interacts with the OpenCV library to search for specific graphic patterns on the screen.
    • Accordingly, the lower, platform- dependent level is the OpenCV library implemented in C / C ++.
    I described architecture not quite like the author, but the main thing is that you can get an idea of ​​the system.


    Since our task is, in fact, the separation of the useful signal from noise, we can use suitable analogies to explain the idea: a bandpass filter and an active noise reduction system.

    A simple Threshold filter acts like a band-pass filter, simply “cutting off” pixels with brightness below a given border (setting the brightness to 0 for them, and 255 for everyone else). More advanced Levels sets two boundaries between which values ​​change smoothly.
    If the brightness inside the image varies over a wide range, using a band-pass filter alone will not “cut off” the noise without losing a useful signal. Need a trickier method.

    The principle of operation of active noise reduction systems in a nutshell can be expressed as: "(Signal + Noise) - (Noise) = (Signal)".
    (Signal + Noise) is our picture. (Noise) is the background, everything except the text. (Signal) is, accordingly, the text.
    At first, we only have (Signal + Noise), but in our case it’s easy to get (Noise) from it if we use a certain property of the useful signal (text): it consists of thin lines.
    You must select a filter that gently “blurred” the text so that the image looked like a blank sheet. As such a filter, Median Blur (which in Paint.Net for some reason is located in the Noise menu , as a means of combating noise is suitable. Well, we will use it for the opposite purpose, removing a useful signal :)
    True, with illustrations, things may not be so smooth, and they will have to be processed separately ...

    The algorithm of actions is this:
    1. Apply a Median Blur filter to the original image to get a clean background, without text;
    2. Calculate the difference between the original and the images obtained in paragraph 1;
    3. Invert the image obtained in paragraph 2 (we need dark text on a white background);
    4. Apply a Levels filter to equalize the contrast and get rid of the slight noise left after items 1-2.
    There could have been beautiful patterns and illustrations, but I could not reconcile my perfectionism with design abilities (or rather, their absence). I hope the meaning is quite transparent and without pictures.


    So, the task for automation is to use Sikuli to sequentially open and process a set of images in Paint.NET using the described algorithm.
    I didn’t come up with anything better than opening a folder with images in advance and letting Sikuli go through the icons, launching Paint.NET through the context menu ...

    Open the Sikuli IDE and start a new script by declaring the necessary variables:
    patterns = [,,]
    openwith_img = 
    paintnet_img = 
    waitfor_img = 
    edited_text = "_edited"
    base_timeout = 30000
    negation_mode = 
    difference_mode = 

    • patterns - an array with images of those file formats that we will process;
    • openwith_img , paintnet_img - context menu items that we will click on;
    • waitfor_img - the operation of opening Paint.NET will take some time, and is considered completed when this fragment appears on the screen;
    • edited_text - suffix to be added to the names of the processed files;
    • base_timeout - the base value of the timeout for all resource-intensive operations (in milliseconds) so as not to change timeouts throughout the script if necessary;
    • negation_mode , difference_mode - while I was writing the script, I was experimenting with these two layer blending modes. Therefore, it was convenient for me to declare them as variables.

    Here it is necessary to pay attention to the fundamental problem of the Sikuli approach - the limited portability of scripts.
    You almost certainly have different graphic format icons. They will have to be added to the script yourself. The rest of the images may be affected by the OS and the layout used (VisualStyle). In my case, these are Windows XP and Opus OS from b0se.

    All the necessary functions follow.
    def OpenWith(x, y, w):
       wait(w, timeout=base_timeout*3)

    Opening a file through the context menu. The function should receive three patterns: the file icon, the menu item corresponding to the required application (Paint.NET, for example), and the fragment whose appearance on the screen corresponds to the completion of the download.
    Forgive me for users of meaningless variable names.

    def SaveFile(suffix):
       type("f", KEY_ALT)
       type(Key.END + suffix)

    Saving a file in Paint.NET. Press Alt + F to get to the File menu. (In the script, I do not use all possible keyboard shortcuts to navigate the menu, although this would slightly reduce the script and reduce the number of graphic fragments. I came across the fact that combinations with Ctrl + Shift did not always work in Sikuli, so I acted in a more reliable way. )
    After clicking on the “Save As ...” menu item, the input focus will be on the file name input field. Add the suffix to it. I did not come up with a reliable sign of the completion of the save, and therefore at the end of the function I inserted inaction for a sufficient period (7 seconds).

    def DoBlackWhite():
       type("a", KEY_ALT)
       wait(, timeout=base_timeout)

    The B / W filter is the first of the filters we need. By Alt + A, open the Adjustments menu and select the desired item. The filter works without parameters. We wait until the corresponding mark appears in the History panel . (It turned out to be a very convenient panel.)

    def DoDuplicateLayer():
       type("l", KEY_ALT)
       wait(, timeout=base_timeout)

    Cloning a layer. The process is similar. In our case, you do not need to switch between layers. This is good, otherwise I would have to tinker with the Layers panel .

    def DoInvertColors():
       type("a", KEY_ALT)
       wait(, timeout=base_timeout)

    Filter Negative. Similar to the previous ones.

    def DoOilPaint(a, b):
       type("c", KEY_ALT)
       type(a + Key.TAB + Key.TAB + Key.TAB + b + Key.ENTER)
       wait(, timeout=base_timeout*2)

    Oil Painting Filter . I originally used it, but ultimately declined in favor of Median Blur . Nevertheless, I’ll save it for the story :)
    (There is no point in this case worrying about dead code. Suddenly someone will come in handy ... In fact, all the functions for working with Paint.NET should be taken out in a separate file if Sikuli supported this feature. )
    This is the first filter to have a settings dialog. A pair of necessary parameters is passed to the function, which are entered in the corresponding form fields.

    def DoMedian(a, b):
       type("c", KEY_ALT)
       type(a + Key.TAB + Key.TAB + Key.TAB + b + Key.ENTER)
       wait(, timeout=base_timeout*2)

    The Median Blur filter is in the menu Effects> Noise . It is configured similarly to the previous one, and is very useful to us.

    def DoLayerBlend(mode):
       wait(, timeout=base_timeout)
       type("m", KEY_CTRL)
       wait(, timeout=base_timeout)

    Blending Layers By F4, we open the layer properties dialog and select the desired blending mode (passed as a parameter). Then glue the layers with Ctrl + M.

    def DoLevels(iwp, ibp, ogamma):
       k_del = Key.DELETE + Key.DELETE + Key.DELETE + Key.DELETE
       type("a", KEY_ALT)
       type(Key.TAB + Key.TAB)
       wait(, timeout=base_timeout)

    Filter Levels . The dialog allows you to configure five parameters: Input White Point, Input Black Point, Output White Point, Output Black Point, Output Gamma. At the filter output, we need to get the maximum contrast, so we do not touch OWP and OBP. We pass the rest as parameters.
    The behavior of the input fields in this dialog is different from the rest of the dialogs. We have to specifically clean them, simulating clicking on Delete.

    def DoFilter():
       DoMedian("35", "50")
       DoLevels("235", "200", "1")

    We begin to collect all the blanks into a single whole. Actually, the rest of the script exists to ensure the operation of this function. Here a call is made to a sequence of filters with the necessary parameters.
    (It is recommended to fine-tune DoLevels () parameters for each set of images, although at the end of the article I give examples made in one pass with the specified parameters ...)

    def RunTaskOverImage(x):
       OpenWith(x, paintnet_img, waitfor_img)

    Opening, processing, saving, closing a single file. The found region containing the file icon (or pattern ) that will be processed is passed as a parameter .

    def main():
       for pat in patterns:
          find_regs = findAll(Pattern(pat).similar(0.95))
          if find_regs:
             for region in find_regs:

    Search for all files on the screen, and processing found.
    setThrowException () - the function allows you to change the behavior of Sikuli in the case when findAll () does not find a single region that matches the pattern. In this case, we are not afraid if any pattern is not found on the screen.
    Pattern (pat) .similar (0.95) - pattern search is carried out with some allowable deviation. This should, if possible, compensate for the difference in interface settings on different machines. The default ratio of 0.7 is too soft. As a result, all my icons were considered the same, and the script tried to execute three times in a circle (according to the number of patterns in the array). 1.0, however, is also not worth setting: OpenCV may skip even the necessary icons in this case.


    The final chord: we call the main () function and report the completion of the script.
    The main () function is highlighted for ease of debugging. Instead, you can substitute a call to any of the described functions, and debug separately.

    Download archive with source code
    View source code in full


    For the tests we used: a picture from the comments, based on which this topic was written; a couple of arbitrary pictures from your archive; random shot from the internet.


    Measurement of speed was carried out on a laptop with a Pentium M 2 GHz and 2 GB RAM. Script execution time over 4 test images:
    • Run 1: 6:32
    • Run 2: 6:57
    • Run 3: 6:47
    • Run 4: 6:38

    Average time: 6 minutes 43 seconds. Average processing time for one image: 1 minute 41 seconds.
    Most of the time they eat filters. But, I think, due to the optimization of the script, it would be possible to save a dozen seconds per image ...


    1. If a person can extract useful information from an incoming data stream (read a text, parse a captcha ...), then an algorithm for a computer can be drawn up to extract this information. The complexity and versatility of this algorithm is a separate issue. The more we want, the more details will have to be taken into account in the algorithm. The described algorithm allows you to clear text snapshots in more severe cases than a simple Threshold filter , but it also has its limitations.
    2. Рассматривать Sikuli IDE, как серьёзный инструмент, на сегодняшний день сложно. И не потому, что «программирование с картинками» — глупая затея. Просто использование Computer Vision при работе с интерфейсом не очень надёжно, а имеющийся инструментарий при этом не очень удобен и может ещё добавить хлопот даже при решении простейших задач. В другой раз при возникновении подобной задачи попробую QAliber.
    3. Для ряда задач, думаю, Sikuli Java API пригодится в качестве удобной обёртки над OpenCV для использования в собственных средствах тестирования и т.п.


    Official site Paint.NET
    Official site Sikuli. Links for downloading, documentation, etc.
    Blog with announcements and sample scripts
    Sikuli Documentation version 0.10
    Sikuli page on LaunchPad

    PS: Thanks to free0u for its support. I apologize to those who made me wait and to whom this article would be more useful before the session than after.

    UPD: Moved to "Algorithms". If there is a better option - write.

    Also popular now: