alizar November 3, 2008 at 14:58

Google Connected OCR Engine for PDF Indexing

Google has taken a significant step towards indexing the so-called Invisible Network, that is, the lion's share of network content that is still not amenable to search engine robots. These are mainly password-protected sites and various databases, as well as huge arrays of scanned documents in PDF format.

Both Google and many other search engines index PDFs without any problems if it has a text layer (it is stored in standard text format in a file container). But there are actually quite a few such “right” PDFs. Much more documents are ordinary scanned copies in graphic format, just saved in PDF. Therefore, to index them, Google has now connected an OCR engine. Now millions of previously inaccessible state reports, court decisions and academic research will be included in the index. Here are some examples of the new engine.

It should be recalled that in April Google learned to process drop-down menus and other HTML forms in various database interfaces, this is also an important technology for indexing the Invisible Network.

Tags:

Google Connected OCR Engine for PDF Indexing

Also popular now: