Using optical character recognition in DeviceLock DLP to prevent document leaks

Published on December 03, 2018

Using optical character recognition in DeviceLock DLP to prevent document leaks

    One of the basic tasks for DLP systems is the detection of various state identification documents (passports, birth certificates, driver's licenses, etc.) in the stream of transmitted data, and the prevention of their unauthorized distribution.



    If the documents are presented in the form of text data in spreadsheets, databases, etc., then this usually does not cause any problems, provided that the DLP system supports content filtering in principle.


    However, what to do when it comes to document scans?


    Using the example of DeviceLock DLP, I would like to show how you can create a DLP policy that prohibits printing on printers, sending by e-mail (SMTP), and uploading passports to cloud file storage.


    A feature of DeviceLock DLP is that optical character recognition (OCR) is performed directly on a user's computer by a resident OCR module as part of a DLP agent, i.e. Built-in OCR allows you to extract text from graphic files and then check it with rules based on analyzing the contents of the transferred files and data immediately when the user performs actions with these files without transferring them to a third-party OCR server. This architecture allows DeviceLock DLP to quickly decide whether to prohibit or allow a user operation.


    Separately, I would like to note that the agency implementation of the DLP system fundamentally eliminates the need to transfer user data outside the protected computer for any type of analysis, including OCR, which makes it possible to successfully exploit DeviceLock DLP in countries with very strict legislation in the field of the protection of workers' rights, for example in Germany and France.


    As a test sample, we will use this scan of a Russian passport in JPG format.



    First, create a composite content filtering rule. We will “catch” scans of passports according to words typical for a Russian passport from the dictionary built into DeviceLock DLP and by numbers, and only graphic files are of interest for us (more than 30 graphic formats are supported).



    Then apply the content filtering rule to the SMTP protocol, cloud storage and printers. According to the above task, we will set bans on sending over the network and printing the files that fall under the rule. Additionally, we will enable the logging of user actions in order to see in the logs attempts to transfer and print scans of passports.




    Now we will try to fill in the scan of the passport on Yandex.Disk.



    At the same time, an unsuccessful attempt was made in the audit log.



    When you try to print a scan of the DeviceLock DLP passport, it will stop printing at the moment the task is sent to the printer and show this message.



    Failure will befall us at the time of sending the scan via SMTP.



    In the audit log you can see all the tracks.



    In conclusion, I want to add that DeviceLock DLP supports optical character recognition (OCR) for all major languages, including Russian, English, German, Chinese, Japanese, etc. The text can be extracted from scanned documents, photographed at an angle of up to 90 degrees to the surface of the documents being scanned, as well as screenshots of the documents.