vasyaabr November 21, 2014 at 11:54

Extract data from photo hosting

From the sandbox

Once I came across this post and I thought - since we have such a beautiful, completely open gallery of private data ( Radikal.ru ), should we try to extract this data from it in a convenient form for processing? I.e:

Download pictures;
Recognize the text on them;
Select useful information from this text and classify it for further analysis.

And as a result, after several evenings, a working prototype was made. A lot of technical details:

Everything was done in C # in ASP MVC 5. Just because I write there all the time and it’s more convenient for me.

Stage 1: Download the picture

After sitting properly in the source code of the gallery pages, I did not find any sequence - that means you have to download each web page and rip the link to the image from the code. It’s good that the address of the page with the image can be automatically generated - it’s just a URL with the serial number of the image. Ok, take HtmlAgilityPack , and write a parser, the benefit of the classes on the picture page is enough, and it’s not difficult to pull out the desired node.

We take out the node, look - there is no link. The link turns out to be generated via JavaScript, which we did not have. This is sad because scripts are obfuscated, and I didn’t have enough patience to understand the principles of their work.

Ok, there is another way - to open the page in the browser, wait for the scripts to execute, and get the link from the filled page. Fortunately, there is a wonderful combination of Selenium and PhantomJS (a browser without a graphical shell), because doing everything through, for example, FireFox is both longer in time and inconvenient. Unfortunately, this is also very slow - there is hardly an even slower way :( About 1 second per picture.

Parser:

        public static string Parse_Radikal_ImagePage(IWebDriver wd, string Url)
        {
            wd.Url = Url;
            wd.Navigate();
            new WebDriverWait(wd, TimeSpan.FromSeconds(3));
            HtmlDocument html = new HtmlDocument();
            html.OptionOutputAsXml = true;
            html.LoadHtml(wd.PageSource);
            HtmlNodeCollection Blocks = html.DocumentNode.SelectNodes("//div[@class='show_pict']//div//a//img");
            return Blocks[0].Attributes["src"].Value;
        }

^{* All code is greatly simplified, non-critical details are removed. More in the source.}

Controller - handler:

            IWebDriver wd = new PhantomJSDriver("C:\\PhantomJS");
            for (var imageCode = data.imgCode; imageCode > data.imgCode - data.imgCount; imageCode--)
            {
                if (ParserResult.Processed(imageCode)) continue;
                var Url = "http://radikal.ru/Img/ShowGallery#aid=" + imageCode.ToString() + "&sm=true";
                var imageUrl = Parser.Parse_Radikal_ImagePage(wd, Url);
                if (imageUrl != null)
                {
                    var image = Parser.GetImageFromUrl(imageUrl);
                    var Filename = TempFilesRepository.TempFilesDirectory() + "Radikal_" + imageCode.ToString() + "." + Parser.GetImageFormat(image);
                    image.Save(Filename);
                }
            }
            wd.Quit();

All this is somewhere to store and process. It is logical to choose an already deployed MS SQL Server, create a small database on it and add links to pictures and the path to the downloaded file there. We are writing a small class for storing and recording the result of parsing a picture. Why not store pictures in the database? About this below in the section on recognition.

    [Table(Name = "ParserResults")]
    public class ParserResult
    {
        [Key]
        [Column(Name = "id", IsPrimaryKey = true, IsDbGenerated=true)]
        public long id { get; set; }
        [Column(Name = "Url")]
        public string Url { get; set; }
        [Column(Name = "Code")]
        public long Code { get; set; }
        [Column(Name = "Filename")]
        public string Filename { get; set; }
        [Column(Name = "Date")]
        public DateTime Date { get; set; }
        [Column(Name = "Text")]
        public string Text { get; set; }
        [Column(Name = "Extracted")]
        public bool Extracted { get; set; }
        public ParserResult() { }
        public ParserResult(string Url, long Code, string Filename, string Text)
        {
            this.Url = Url;
            this.Code = Code;
            this.Filename = Filename;
            this.Date = DateTime.Now;
            this.Text = Text;
            this.Extracted = false;
            DataContext Context = DataEngine.Context();
            Context.GetTable().InsertOnSubmit(this);
            Context.SubmitChanges();
        }
        public static bool Processed(long imgCode)
        {
            return DataEngine.Data().Where(x => x.Code == imgCode).Count() > 0;
        }
    }

Stage 2: Recognize the text

Also, it would seem, is not the most difficult task. We take Tesseract (more precisely, a wrapper for it under .NET), download the data for the Russian language , and ... a bummer! As it turned out, for normal operation of Tesseract with the Russian language, conditions close to ideal are needed - an excellent scan quality, and not a photo of a document on a crappy mobile phone. Recognition percentage - good if it approaches 10.

In general, all acceptable Cyrillic recognition is represented by only three products: CuneiForm, Tesseract, FineReader. Reading forums and blogs reinforced the idea that CuneiForm does not make sense to try (many people write that it is not far from Tesseract in terms of recognition quality), and I decided to try FineReader right away. Its main disadvantage is that it is paid, very paid. In addition, there was no Finereader Engine at hand (which provides an API for recognition), and I had to make a terrible bike: run the Abbyy Hotfolder, which looks in the specified folder, recognizes the pictures that appear there, and puts the text files of the same name next to it. Thus, having waited a little after downloading the images, we can take ready-made recognition results and put them into the database. Very slow, very crutch - but the quality of recognition, I hope, pays for these costs.

            var data = DataEngine.Data().Where(x => x.Text == null & x.Filename != null).ToList();
            foreach (var result in data)
            {
                var textFilename = result.Filename.Replace(Path.GetExtension(result.Filename), ".txt");
                if (System.IO.File.Exists(textFilename))
                {
                    result.Text = System.IO.File.ReadAllText(textFilename, Encoding.Default).Trim();
                    result.Update();
                }
            }

By the way, it’s precisely because of such crutches that the pictures are not stored in the database - Abbyy Hotfolder with the database, unfortunately, does not work.

Stage 3: Extract information from the text

Surprisingly, this stage was the easiest. Probably because I knew what to look for - a year ago I took the Natural Language Processing course at Coursera.org and imagined how such problems are solved and what terminology is used. Including, therefore, I decided not to write the next bikes, and after a short google, I took the PullEnti library , which:

sharpened to work with the Russian language;
immediately wrapped to work with C #;
free for non-commercial use.

It turned out to be very simple to extract entities using it:

        public static List ExtractEntities(string source)
        {
            // создаём экземпляр процессора
            Processor processor = new Processor();
            // запускаем на тексте
            AnalysisResult result = processor.Process(new SourceOfAnalysis(source));
            return result.Entities;
        }

Selected entities must be stored and analyzed, for this we write them in a simple tablet in the database: picture ID / entity type / entity value. After parsing, we get something like this:

Docid	EntityType	Value
63	Territorial education	Ussuriysk city
63	Address	Dzer street 1; Ussuriysk city
63	date	November 17, 2014

PullEnti can extract quite a few of these entities from the text (automatically correcting errors): Bank details, Territorial entity, Street, Address, URI, Date, Period, Designation, Amount, Person, Organization, etc ... And then you need to sit down on the received tables and think: choose documents for a specific city, look for a specific organization, etc. We completed the main task - the data was extracted and prepared.

results

Let's see what happened on a small test sample.

Gallery pages processed - 2,263 ;
Images received - 1 972 (on the remaining pages, images are deleted or closed by the privacy settings);
Selected text - 773 (on other images FineReader did not find anything suitable for recognition);
Entities are selected from the text - 293 .

Correct response is the last indicator, because quite often a text in the form of " ^ ЯА71 Г1 / Г " and so on stands out from a picture with rich graphics . It turns out that we find text suitable for analysis in approximately every tenth image. This is not bad for such messy storage!

And here, for example, is a list of the cities recovered (quite often the documents from which they are extracted are passport photos): Ankara, Bobruisk, Warsaw, Zlatoust, Kazan, Kiev, Krasnoyarsk, Minsk, Moscow, Omsk, St. Petersburg, Sukhum, Tver, Ussuriysk, Ust-Kamenogorsk, Chelyabinsk, Shuya, Yaroslavl.

Summary

The task is being solved; created a working prototype of the solution.
The speed of this prototype so far does not stand up to criticism :( A picture per second is very slow.
And, of course, there are a number of unresolved issues: for example, crashing after PhantomJS eats up all the memory.

Source code (project for Visual Studio 2013) - download .

Tags: