Parse it!

    Some time ago, I had to do a little research on my work. Its essence was to find the best pdf parser implemented in java.

    A bit about the project. It implements a system for sending internal messages to which files can be attached. There is also a search that should be carried out on the contents of the attachments. Most of these attachments are pdfs.
    Actually, the mechanism’s operation is quite simple: when sending a message, attachment data is parsed and an index is parsed on it.

    For a long time, documents were parsed using the PDFBOX library, the work of which did not cause anyone joy: for a long time and with failures.
    As a result, 4 libraries were selected, the comparison of which I took up: PDFBOX, JPod, iText and Acrobat.

    The acrobat was ruled out almost immediately, as It turned out that this library has not been supported for several years, but statistics on it remained, so I will publish it too.
    We had to compare libraries according to two criteria - the speed of work and the quality of the results.
    I’ll warn you right away: the libraries were tested on internal documents, and they fall under a certain security system. So I can’t even indicate the names. I can only say one thing - the content of the files was the most diverse: text, image tables, scans, etc. File sizes also vary quite a bit, so objective estimates can be expected.

    Estimated time:


    file size

    Pdfbox

    Acrobat

    Jpod

    iText

    74.1 KB00: 02.59100: 03.15500: 01.68300: 00.963
    257.5 KB00: 01.68000: 03.19100: 00.11600: 00.78
    1,6 MB00: 05.80500: 02.88400: 02.53200: 02.79
    28.1 MBE01: 10.98300: 43.815E
    13.6 MB00: 05.21800: 04.33100: 00.59900: 00.77
    1.9 MB00: 02.78200: 14.50600: 00.60800: 00.707
    1,6 MB00: 06.18200: 02.9800: 00.90600: 02.413
    8.9 MB00: 05.9800: 03.89400: 00.68000: 00.647
    2,4 MB00: 14.1500: 07.89300: 02.826E
    604.7 KB00: 03.34200: 04.72100: 00.55100: 01.222
    100.6 KB00: 01.81900: 04.21200: 00.8400: 00.456
    1,6 MB00: 05.63300: 03.9900: 00.88300: 02.18
    10.3 MB00: 22.31100: 22.14500: 27.663E
    1.9 MB00: 06.94300: 14.73600: 01.200E
    2.1 MB00: 02.573E00: 00.49800: 00.475
    111.0 KB00: 01.95600: 02.84600: 00.70500: 00.300
    814.3 KB00: 02.55200: 04.22100: 00.30600: 00.900
    2.0 MB00: 06.31900: 07.12800: 01.82100: 02.796
    338.7 KB00: 01.95000: 03.68400: 00.7900: 00.415
    12.9 MB00: 15.93200: 13.62800: 04.989E
    7.3 MBE00: 17.27500: 16.377E
    97.2 MB00: 27.29100: 01.99400: 05.739E
    5.2 MB00: 07.77300: 11.10800: 01.964E
     
    Total:
    Best2147
    Middle12788
    Worst (including errors)eleven1418
     
    Errors218

    The worst and best times are highlighted in red and green, respectively. The letter “E” indicates the state of permanent collapse that has overtaken the process due to buffer overflows or any other errors.
    In comparison of time, the objective winner was JPod. Pleased with the lack of parsing errors.

    Quality control:


    Quality assessment was quite subjective and was divided into only 3 categories: Best, Middle and Worst. There is also an Empty score, which was set if during the parsing process a collapse occurred or the parser simply did not find text inside the document.
    The similarity of the received text with the original was assessed, but not very critical, because the text was needed for the index, not for output.

    file size

    Pdfbox

    Acrobat

    Jpod

    iText

    74.1 KBBestMiddleBestBest
    257.5 KBBestMiddleBestEmpty
    1,6 MBBestEmptyEmptyWorst
    28.1 MBEmptyMiddleBestEmpty
    13.6 MBEmptyEmptyEmptyEmpty
    1.9 MBBestMiddleWorstMiddle
    1,6 MBBestEmptyWorstWorst
    8.9 MBBestMiddleMiddleBest
    2,4 MBBestMiddleWorstEmpty
    604.7 KBBestMiddleMiddleMiddle
    100.6 KBBestMiddleBestEmpty
    1,6 MBBestEmptyWorstWorst
    10.3 MBBestBestBestEmpty
    1.9 MBBestBestBestEmpty
    2.1 MBBestBestBestEmpty
    111.0 KBBestBestBestBest
    814.3 KBBestWorstBestBest
    2.0 MBBestMiddleBestBest
    338.7 KBMiddleMiddleBestEmpty
    12.9 MBBestBestBestEmpty
    7.3 MBEmptyBestBestEmpty
    97.2 MBBestBestMiddleEmpty
    5.2 MBBestBestBestEmpty
     
    Total:
    Best198145
    Middle11032
    Worst (including empty)35616
     
    Empty342thirteen


    Characteristic features of parsing:
    Acrobat very often parses text on one line. Gaps between words and sentences remain, therefore, in principle, this is not critical for indexing.
    iText does not understand non-English characters. The tests used documents in English, German and French. Therefore, all their umlauts went into the forest. They didn’t even just go to the forest - instead of such characters, I received questions. Perhaps this is somewhere tuned, but the rest understood such symbols without dancing with a tambourine.
    PDFBOX on quality of complaints did not cause.
    JPod - everything is ok too. Except for one feature that made him mess around for quite some time. In some cases, a document is parsed in whole or in part by one letter per line - for an index such parsing is useless.

    As a result, JPod was declared the winner, despite its peculiarity to parse letter by line.
    I had to deal with this.

    Part two. Inside the JPod.


    Picking the source code for JPod took a lot of time. Letters, project forum. As a result, it was found that such parser behavior is caused by the orientation of the document pages. Portrait orientation is fine, but landscape is not. Attempts to pick out the parameters did not work. The class properties responsible for page orientation were useless.
    In general, at one point, I decided to simply remove all fonts from the classes. Anyway, they are not needed for indexing text. It helped because blocks of text were not calculated correctly, and this was caused by fonts.
    Here I would stop, but Egiptyanin insisted that it was nevertheless necessary to reach the end. Then I almost did not participate.
    A solution was found, and in this form it is used: the matrix of affine transformations was redefined. Instead of a dynamic matrix, a static one was set. The CSPlainTextExtractor class was used instead of the CSTextExtractor. The new class looks like this: Of course, this is not a panacea and very rarely the parser does not add hyphens to the necessary places, but this is not important for indexing. Actually, that's all. Thank you for your attention =) PS This is my first more or less serious article, I hope for objective criticism. Upd Particularly attentive readers found in the tables of inconsistencies - fixed.

    public class CSPlainTextExtractor extends CSTextExtractor {
    public void textSetTransform(float a, float b, float c, float d, float e, float f) {
    super.textSetTransform(1, 0, 0, 1, 0, 0);
    }
    }








    Also popular now: