aleks_raiden August 4, 2008 at 22:16

Full-text search in web projects: Sphinx, Apache Lucene, Xapian

Full copyright from my blog . The original material was written specifically for Developers.org.ua.

Probably any modern web project is difficult to imagine without ... without content! Yes, it is content in its various manifestations that today “rules the ball” in various web projects. Not so important - created by users or obtained from other sources automatically - information is the main one of any (well, or almost any) project. And if so, then the question of finding the necessary information is very acute. And sharper every day, in view of the rapid expansion of the amount of this content itself, mainly due to user-generated content (these are forums, blogs, and now fashionable communities like Habrahabr.ru) Thus, any developer implementing any project today is faced with the need to implement a search in his web application. Moreover, the requirements for such a search are already much more complicated and wider than even a year or two ago. Of course, for some projects, a simple solution is quite suitable, for example, it is quite possible to use Custom Google Search. But the more complex the application, and the more complex the structure of the content, if you need special types of search and processing of the result, or just the amount or format of the data in your project is special, you will need your own search engine. It is its own system, its own search server or service, and not a third-party, even flexible and customizable one. But what to choose, and in general - which search projects are currently on the market that are ready for use in real projects, not research or scientific, but real business applications? Next, we briefly consider the various options for search solutions that are suitable for embedding in your web application or deploying on your own server.

General architecture and terms

And so, for a deeper understanding of the essence of the search, we will try to briefly familiarize ourselves with the concepts and terms used. By a search server (or simply “search engine”) we mean a library or component, in general, a software solution that independently maintains its own database (in fact, it can be a DBMS or just files and a distributed storage platform) of documents in which in fact, the search is in progress, it provides third-party applications to add, delete and update documents in this database. This process is called indexing, and can be implemented by a separate component or server (indexer). Another component, namely the search engine, accepts a search request and, processing the database created by the indexer, selects data that matches the request. Besides, it can calculate additional parameters for search results (to rank documents, calculate the degree of compliance with a search query, etc.). These are the most important search engine systems, and they can be both monolithically implemented in one library, or they can be independent servers, access to which is realized through various application protocols and APIs. Additionally, pre-processing of documents before indexing (for example, extracting text from files of various formats or databases) can be implemented in a search server, various APIs can also be implemented with additional components. The server itself can store its index data in a database (both built-in and using an external server, for example, MySQL),

Separately, I would single out the presence of a web search implementation module. That is, a built-in ability to receive documents from websites using the HTTP protocol and enter them into the index can be implemented in the search server. This module is usually called a "spider" or "crawler", and thus, the search engine may already look like a "real" and familiar to all searches like Google or Yandex. So you can implement your own search engine for the sites you need, for example, dedicated to one topic - just create a list of addresses and configure their periodic crawl. However, this is already a much more complex and serious task, both technically and organizationally, so we do not dwell on the details of its implementation. Among the projects that we will consider, there is one server that implements a web search engine, that is, it contains everything you need to create a "Yandex killer." Interesting?

What parameters are important?

When choosing a search engine, the following parameters should be considered:

indexing speed - that is, how quickly the search server “grinds” documents and puts them in its index, making search on them available. Usually measured in megabytes of clear text per second.
reindexing speed - in the process, documents are changed or new ones are added, so you have to reindex information. If the server supports incremental indexing, then we process only new documents, and leave the update of the entire index for later or even not at all. Other servers require a complete rebuild of the index when adding new information, or use additional indexes (delta index), which includes only new information
Supported APIs - if you use a search engine in conjunction with a web application, pay attention to the presence of a built-in API for your language or platform. Most search engines have APIs for all popular platforms - Java, PHP, Ruby, Phyton.
supported protocols - in addition to the API, access protocols are also important, in particular, if you want to access from another server or application that does not have a native API. Usually supported are XML-RPC (or varieties like JSON-RPC), SOAP, or access via http / socket.
database size and search speed - these parameters are very interconnected, and if you implement something unique and provide that you can have millions or more documents in the database that need to be instantly searched, then look at the well-known implementations of the selected platform. Although no one explicitly states restrictions on the number of documents in the databases, and in small collections (for example, several thousand or tens of thousands of documents), all search engines will be approximately the same, but if it comes to millions of documents, this can become a problem. By the way, this parameter in itself does not always matter; you need to look at the features of each system and the algorithms for working with search, as well as other parameters, for example, the speed of reindexing or the possible types of indexes and their storage systems.
supported document types - of course, any server supports plain text (although you should look at the ability to work with multilingual documents and UTF-8 encoding), but if you need to index different types of files, for example, HTML, XML, DOC or PDF, then it's worth looking to those solutions where there is a built-in component for indexing various formats and extracting text from them. Of course, all this can be done right in your application, but it’s better to search for ready-made solutions. This also includes support for indexing and searching for information stored in the DBMS - it’s no secret that storage is the most common for web applications, and it’s better for the search server to work directly with the database, without having to manually retrieve and “feed” documents to it for indexing.
work with different languages and stemming - for the correct search using different languages, native support is required not only for encodings, but also for working with the language features. All support the English language, which is quite simple for search and processing, but for Russian and other similar ones it is necessary to use automatic means to ensure morphology. The stemming module allows you to inflect and parse words in a search query for a more correct search. If the search in Russian is critical for you, pay attention to the presence of this module and its features (of course, manual stemming is better than automatic, but more difficult, although automatic are very different).
support for additional types of fields in documents - in addition to the text itself, which is indexed and in which the search is performed, it is necessary to be able to store an unlimited number of other fields in the document that can store meta-information about the document, which is necessary for further work with search results. It is highly desirable that the number and types of fields are not limited, and indexability of fields can be customized. For example: in one field the name is stored, in the second - abstract, in the third - keywords, in the fourth - the identifier of the document in your system. It is necessary to flexibly configure the search area (in each field or in the specified), as well as those fields that will be extracted from the search engine database and displayed in the search results.
platform and language are just as important, albeit to a lesser extent. If you are going to separate the search into a module or server separate from the application, or even move it to a separate server (hardware in the sense), then the role of the platform is not that big. This is usually either C ++ or Java, although there are options for other languages (usually solution ports in java).
the presence of built-in ranking and sorting mechanisms is especially good if the search engine can be expanded (and it is written in a language you know) and write the implementations of these functions you need, because there are so many different algorithms, and it’s not the fact that the default search engine suits you .

Of course, there are still a lot of different parameters, and the data search area itself is quite complex and serious, but for our applications this is quite enough. You do not compete with Google?

Now briefly talk about those search solutions that you should pay attention to, only you decide to start the question of search. I intentionally do not consider the solution built into your DBMS - FULLTEXT in MySQL and FTS in PostgreSQL (integrated into the database from the version, if I'm not mistaken, 8.3). In MySQL, the solution cannot be used for serious searches, especially for large amounts of data; searching in PostgreSQL is much better, but only if you already use this database. Although, as an option, install a separate database server and use only data storage and search there, too. Unfortunately, I do not have data on real applications on large volumes of data and complex queries (units and tens of GB of text).

Sphinx search engine

Type : separate server, MySQL storage engine
Platform : C ++ / Cross Platform
Index : monolithic + delta index, distributed search capability
Search capabilities : boolean search, phrase search, etc. with the ability to group, rank and sort the result
APIs and protocols : SQL DB (as well as native support for MySQL and PostgreSQL), native XML interface, built-in APIs for PHP, Ruby, Python, Java, Perl
Language support : built-in English and Russian stemming, soundex for morphology
Additional fields : yes, unlimited
Document types : text only or native XML format
Index size and search speed : very fast, indexing about 10 Mb / s (depending on CPU), search about 0.1 sec / ~ 2 - 4 GB index, supports index sizes of hundreds of GB and hundreds of millions of documents (if you do not use clustering) , however, there are examples of work on terabyte databases.
License : open source, GPL 2 or commercial.
URL : http://sphinxsearch.com

Sphinx is probably the most powerful and fastest of all open engines that we will consider. It is especially convenient because it has direct integration with popular databases and supports advanced search capabilities, including ranking and stemming for Russian and English. It seems that the project has excellent support for the Russian language because the author is our compatriot, Andrei Aksenov. Non-trivial features such as distributed search and clustering are supported, however, the proprietary feature is a very, very high indexing and search speed, as well as the ability to perfectly parallelize and utilize the resources of modern servers. Very serious installations are known that contain terabytes of data, so Sphinx can be recommended as a dedicated search server for projects of any level of complexity and volume of data. Transparent work with the most popular MySQL and PostgreSQL databases allows it to be used in a typical web development environment, and there is also an “out of the box” API for different languages, primarily for PHP without the use of additional modules and extension libraries. But the search engine itself needs to be compiled and installed separately, therefore, it is not applicable on ordinary hosting - only VDS or its own server, moreover, preferably more memory. The search engine’s index is monolithic, so you’ll have to “pervert” a little, setting up the delta index to work correctly when there are a lot of new or changed documents, although the huge indexing speed allows organizing the rebuild of the index on a schedule and this will not affect the work of the search itself.

SphinxSE is a version that functions as a data storage engine for MySQL (requires a patch and recompilation of the database), Ultrasphinx is a configurator and client for Ruby (except for the API present in the distribution kit), in addition there are plugins for many well-known CMS b blog platforms, wikis that replace the standard search (see the full list here: http://www.sphinxsearch.com/contribs.html )

Apache Lucene Family

Type : standalone server or servlet, embedded library
Platform : Java / cross-platform (there are ports for many languages and platforms)
Index : incremental index, but requiring the operation of merging segments (can be in parallel with the search)
Search features : boolean search, phrase search, fuzzy search, etc. with the ability to group, rank and sort the result
API and protocols : Java API
Language support : there is no morphology, there is stemming (Snowball) and analyzers for a number of languages (including Russian)
Additional fields : yes, unlimited
Types of documents : text, possibly indexing the database via JDBC
Index size and search speed : about 20 Mb / min, index files are limited to 2 GB (on a 32-bit OS). There are parallel search capabilities across multiple indexes and clustering (requires third-party platforms)
License : open source, Apache License 2.0
URL : http://lucene.apache.org/

Lucene- The most famous of the search engines, initially focused specifically on embedding in other programs. In particular, it is widely used in Eclipse (search in documentation) and even in IBM (products from the OmniFind series). The project's pluses include developed search capabilities, a good system for constructing and storing an index that can be simultaneously replenished, documents are deleted and optimization is carried out along with the search, as well as parallel search across multiple indexes with combining the results. The index itself is built from segments, but to improve the speed it is recommended to optimize it, which often means almost the same costs as for reindexing. Initially, there are analyzer options for different languages, including Russian with support for stemming (reduction of words to normal form). However, the downside is still the low indexing speed (especially in comparison with Sphinx), the complexity of working with databases and the lack of an API (except for native Java). Although Lucene can cluster and store indexes in a distributed file system or database to achieve serious performance, it requires third-party solutions, as well as for all other functions - for example, initially it can only index plain text. But it is precisely in terms of using Lucene “ahead of the rest” as part of third-party products - for no other engine there are so many ports for other languages and uses. One of the factors of such popularity is a very successful index file format, which is used by third-party solutions, so it’s quite possible to build solutions that work with the index and search,

Solr- The best Lucene-based solution, greatly expanding its capabilities. This is an independent enterprise-level server that provides extensive search capabilities as a web service. By default, Solr accepts documents using the HTTP protocol in XML format and returns the result also through HTTP (XML, JSON or another format). Clustering and replication to several servers is fully supported, support for additional fields in documents is expanded (unlike Lucene, they support various standard data types, which brings the index closer to databases), support for faceted search and filtering, advanced configuration and administration tools, and also features backup index in the process. Built-in caching also improves the speed of work. On the one hand, this is a standalone solution based on Lucene,

Nutch is the second most famous project based on Lucene. This is a web search engine (search engine + web spider for crawling sites) combined with a distributed storage system Hadoop. Out of the box, Nutch can work with remote sites on the network, indexes not only HTML, but also MS Word, PDF, RSS, PowerPoint and even MP3 files (meta tags, of course), in fact it is a full-fledged Google search killer. Just kidding, the payback for this is a significant reduction in functionality, even the basic one from Lucene, for example, Boolean operators in the search are not supported, and stemming is not used. If the task is to make a small local search engine for local resources or a pre-limited set of sites, while you need full control over all aspects of the search, or you are creating a research project to test new algorithms, then Nutch will be your best choice. However, consider its requirements for hardware and a wide channel - for a real web-search engine, traffic is counted in terabytes.

Do you think that no one uses Nutch “in an adult way”? You are mistaken - from the most famous projects that you could hear about, it is used by the search engine using the Krugle source codes ( http://krugle.com/ ).

But not only through add-on projects, Lucene is known and popular. Being the leader among open source solutions and embodying many excellent solutions, Lucene is the first candidate for porting to other platforms and languages. Now there are the following ports (I mean those that are more or less actively developing and the most complete ports):

Lucene.Net - the full Lucene port, fully algorithmically, by class and API, identical port to the MS.NET/Mono platform and C # language. While the project is in the incubator, and the last release is dated April 2007 (port of the final 2.0 version). Project page .
Ferret - Ruby port
CLucene is a C ++ version that promises to give a significant performance boost. According to some tests, it is 3-5 times faster than the original, and sometimes even more (on indexing, search is comparable or faster by only 5-10%). It turned out that this version is used by a large number of projects and companies - ht: // Dig, Flock, Kat (search engine for KDE), BitWeaver CMS and even companies such as Adobe (search for documentation) and Nero. Project page
Plucene - Perl implementation
PyLucene is an implementation for Python applications , but not complete and requires partially Java
Zend_Search_Lucene is the only PHP language port available as part of the Zend Framework. By the way, it is quite functional and as an independent solution, outside the framework, I did experiments and as a result of the selection, the entire search mechanism now occupies only 520 KB in a single PHP file. Project Homepage: http://framework.zend.com/manual/en/zend.search.lucene.htm

Xapian

Type : Embedded Library
Platform : C ++
Index : incremental index, transparently updated in parallel with the search, work with multiple indexes, in-memory indexes for small databases.
Search capabilities : Boolean search, phrase search, ranking search, mask search, synonym search, etc. with the ability to group, rank and sort the result
APIs and protocols : C ++, Perl API, Java JINI, Python, PHP, TCL, C # and Ruby, CGI interface with XML / CSV format
Language support : there is no morphology, there is stemming for a number of languages (including Russian), spell checking in search queries has been implemented
Additional fields : none
Document types : text, HTML, PHP, PDF, PostScript, OpenOffice / StarOffice, OpenDocument, Microsoft Word / Excel / Powerpoint / Works, Word Perfect, AbiWord, RTF, DVI, SQL database indexing via Perl DBI
Index size and search speed : running installations on the 1.5 TB index and the number of documents in the hundreds of millions are known.
License : open source, GPL
URL : http://xapian.org

While this is the only contender for competition with the dominance of Lucene and Sphinx, it compares favorably with the presence of a “live” index that does not require restructuring when adding documents, a very powerful query language, including built-in stemming and even spell checking, as well as support for synonyms. This library will be the best choice if you have a perl system or if you need advanced features for building search queries and you have a very frequent index update, while new documents should be immediately available for search. However, I did not find any information about the ability to add arbitrary additional fields to the documents and get them with the search results, so the connection of the search system with your own can be difficult. Package includes Omega- Add-on library, which is ready to be used as an independent search engine and it is just it that is responsible for indexing different types of documents and CGI interface.

Perhaps this is where our review can be completed. Although there are still many search engines, some of them are ports or add-ons for the ones already examined. For example, an industrial-level search server for eZ’s own CMS, ezFind is actually not a separate search engine, but an interface to standard Lucene Java and includes it in its delivery. The same goes for the Search component from their eZ Components package.- It provides a unified interface for accessing external search engines, in particular, it interacts with the Lucene server. And even such an interesting and powerful solution as Carrot and SearchBox are seriously modified versions of the same Lucene, significantly expanded and supplemented with new features. There are not so many independent search solutions with open source that fully implement indexing and search using their own algorithms in the market. And what decision to choose depends on you and features, often not at all obvious, of your project.

conclusions

Although you can finally decide whether or not a specific search engine fits your project, you can and often only after a detailed study and tests, however, some conclusions can be drawn now.

Sphinx is suitable for you if you need to index large volumes of data in the MySQL database and the speed of indexing and searching is important to you, however, specific search capabilities like “fuzzy search” are not required and you agree to allocate a separate server or even a cluster for this.

If you need to embed a search module in your application, it is best to search for ready-made ports for your language in the Lucene library- for all common languages they are, but they can realize far from all the possibilities of the original. If you are developing a Java application, Lucene is definitely the best choice. However, take into account the sufficient slow indexing and the need for frequent optimization of the index (and the demand for CPU and disk speed). For PHP, this, apparently, is the only acceptable option for the full implementation of the search without additional modules and extensions.

Xapian is a fairly good and high-quality product, but less common and flexible than the rest. For C ++ applications and the requirements for the wide possibilities of the query language, it will be the best choice, however, it requires manual refinement and modifications to be embedded in your own code or used as a separate search server.

Related Links

Sphinx search ( http://sphinxsearch.com/ )
Apache Nutch ( http://lucene.apache.org/nutch/ )
Apache Solr ( http://lucene.apache.org/solr/ )
Apache Lucene Java ( http://lucene.apache.org/ )
Apache Lucy ( http://lucene.apache.org/lucy/ )
Lucene.Net ( http://incubator.apache.org/lucene.net/ )
CLucene ( http://clucene.wiki.sourceforge.net/ )
Lucene Port List ( http://wiki.apache.org/jakarta-lucene/LuceneImplementations )
Zend Framework Lucene full PHP port ( http://framework.zend.com/manual/en/zend.search.lucene.html )
ezComponents Search ( http://ezcomponents.org/docs/tutorials/Search )
ez Find ( http://ez.no/ezfind )
Xapian ( http://xapian.org )
OpenFTS ( http://openfts.sourceforge.net/ )
List of Java-based search solutions ( http://www.manageability.org/blog/stuff/full-text-lucene-jxta-search-engine-java-xml )
DotLucene.net is closed ( http://www.dotlucene.net/ )
Lidia ( http://www.nttdata.co.jp/en/media/2006/101100.html )
Hyper Estraier ( http://hyperestraier.sourceforge.net/ )
Kneobase ( http://sourceforge.net/projects/kneobase/ )
Egothor ( http://egothor.sourceforge.net/ )
Ferret ( http://ferret.davebalmain.com/ )
OpenGrok ( http://www.opensolaris.org/os/project/opengrok/ )

Tags:

Full-text search in web projects: Sphinx, Apache Lucene, Xapian

Also popular now: