ABBYYTeam October 28, 2010 at 10:37

Own Google Search - now also on document scans

How to make documents on company servers accessible for full-text search and at the same time keep them confidential? How to get the functionality of Google Search, ~~without taking out the dirty linen in public,~~ leaving documents within the company’s network? Corporate search is another fast-growing tasty cake .

~~Tiny little-known~~ company Google offers a solution in the form of a beautiful yellow box for installation in a standard 19-inch rack - Google Search Appliance.

The scheme is as follows:

conclude an agreement
put in your yellow box
assign it an IP address (the domain name will not hurt either)
box bypasses and indexes documents on the network
everyone who enters the browser at that IP address sees exactly the same page as on www.google.com - there you can give the same requests, also receive results
???
HAPPINESS

The same familiar search (respectively, a minimum of effort to train employees), and documents do not leave the company’s network. A significant limitation is that image files in file storages (for example, document scans) are not available for search - GSA cannot extract text from them. Houston, we have a problem.

As often happens in this corporate blog, ~~Captain~~ comes to the rescue optical evidence of text recognition .

The Google Search Appliance can not only independently crawl sites, but also accept so-called feeds (alas, an adequate Russian word has not yet been found).

A feed is a special XML document; You can include a pair (URL + text) in it. The feed is sent to GSA by an external program - just an HTTP POST request to the corresponding port. The GSA will accept the feed, parse it and write it in the index “this document contains this text with this URL”.

Further, when the user enters a suitable search query, the document (link plus the extracted text with highlighted matches) will be displayed in the search results. The same Google Search, but the text is extracted and "embedded" by an external program.

Happiness is near. For text recognition, as usual, we will use ABBYY Recognition Server ~~electrical tape~~. It includes a separate service that can bypass file storages, transfer files for recognition to Recognition Server, make feeds from recognition results and transfer feeds to the Google Search Appliance.

The storage can be crawled many times, while the changed files are re-recognized, new feeds are sent for them, special feeds are sent for deleted files, instructing to remove the file URL from the index. The service runs on the same machine as the Recognition Server.

The feed mechanism allows you to completely separate recognition from the GSA itself. Due to the excellent scalability of the Recognition Server, recognition can be performed quite quickly even in the case of a large number of documents. For example, if you want to quickly include a large archive in the index, you can put recognition stations on employees' machines using an SMS installation and configure the product so that the stations are used only on weekends or only at night.

Naturally, the same Recognition Server installation can be used for the rest of the organization’s business processes.

Here it is, another scenario for using Recognition Server is to help get to that fast-growing pie.

Dmitry Meshcheryakov
Data Entry Products Department

Tags:

Own Google Search - now also on document scans

Also popular now: