We search three times faster: multi-queries and facet search

    In today's article I’ll tell you about a Sphinx feature called multi-queries: the optimizations built into it, the implementation of so-called. facet search, and in general, as sometimes you can use it to make the search three times faster.

    But first, 15 seconds of political information (you will not praise yourself, no one will praise). This year, Sphinx entered the second round of the Sourceforge Awards 2009 contest in the SysAdmins and Enterprise nominations (they say they didn’t get a little bit in the Developers nomination). Voting will last another week (until the 20th day). In addition to a working email address, nothing is needed. Thanks in advance to everyone who will not let us go!

    And back to development. What are multi-requests in general, and where does the promise come three times faster?

    Multi-queries- This is a mechanism that allows you to send multiple searches in one package.

    API methods that implement the multi-query mechanism are called AddQuery () and RunQueries (). (By the way, the “regular” Query () method internally uses them: it calls AddQuery () once, and then RunQueries () right away). The AddQuery () method saves the current state of all query settings exposed by previous API calls and remembers the query. The settings of the already stored request will not change any more, any API calls will not touch them, so for subsequent requests you can use any other settings (different sorting mode, other filters, etc.). The RunQueries () method actually sends all remembered requests in one packet and returns several results. There are no restrictions on requests participating in the package. The number of requests, just in case, is limited by the max_batch_queries directive (added in 0.9.10, previously a fixed number of 32), but this is generally only a check against broken packets.

    Why use multi-queries? Generally speaking, it all comes down to performance. First, by sending requests to searchd in one package, we always save a little resources and time by sending less network packets back and forth. Secondly, more significantly, searchd gets the opportunity to perform some optimizations on the entire query package. Over time, new optimizations are gradually added, so it always makes sense to send requests in batches whenever possible - then when updating Sphinx, new batch optimizations will turn on automatically. In the case where no batch optimizations can be applied, the requests will simply be processed one at a time, without any visible differences for the application.

    Why (more precisely when) do NOT use multi-queries? All queries in the package must be independent, but sometimes it is not, and query B may depend on the results of query A. For example, we may want to show search results from an additional index only when nothing was found in the main index. Or just choose a different offset in the 2nd set of results depending on the number of matches in the 1st set. In such cases, you will have to use separate requests (or separate packages).

    There are two important package optimizations that you should be aware of: general query optimization (available since version 0.9.8), and general subtree optimization (available since version 0.9.10 in development).

    General query optimizationit works like that. searchd selects from the package all queries that differ only in sorting and grouping settings, and the full-text part, filters, etc. match - and conducts a search only once. For example, if there are 3 requests in the package, the text part of all is “ipod nano”, but the 1st request selects the 10 cheapest results, the 2nd groups the results by store ID and sorts the stores by rating, and the 3rd request simply selects the maximum price, search “ipod nano »Will work only once, but 3 differently sorted and grouped responses will be built from its results.

    So-called facet searchis a special case for which this optimization is applicable. In fact, it can be implemented by running several search queries with different settings: one for the main search results, several more with the same search query, but different grouping settings (top-3 authors, top-5 stores, etc.). When everything except sorting and grouping is the same, optimization is turned on and the speed grows well (example below).

    Optimizing common subtrees is even more interesting. It allows searchd to use similarities between different queries within a package. Inside all come separate - different! - full-text queries identify common parts, and if there are any, intermediate calculation results are cached and shared between queries. For example, in such a package of 3 requests

    barack obama president
    barack obama john mccain
    barack obama speech
    


    there is a common part of 2 words (“barack obama”), which can be calculated exactly once for all three queries and cached. This is exactly what optimization of common subtrees does. The maximum cache size per pack is strictly limited by the directives subtree_docs_cache and subtree_hits_cache, so if the common part “i am” is found in one hundred million documents, the server’s memory will suddenly not end.

    Let's go back to optimization about general queries. Here is an example of code that runs the same query, but with three different
    sorting modes: sorting modes:

    require ("sphinxapi.php");
    $ cl = new SphinxClient ();
    $ cl-> SetMatchMode (SPH_MATCH_EXTENDED2);
    $ cl-> SetSortMode (SPH_SORT_RELEVANCE);
    $ cl-> AddQuery ("the", "lj");
    $ cl-> SetSortMode (SPH_SORT_EXTENDED, "published desc");
    $ cl-> AddQuery ("the", "lj");
    $ cl-> SetSortMode (SPH_SORT_EXTENDED, "published asc");
    $ cl-> AddQuery ("the", "lj");
    $ res = $ cl-> RunQueries ();
    


    How do I know if optimization worked? If it worked, in the corresponding lines of the log there will be a field with a "multiplier", which shows how many requests were processed together:

    [Sun Jul 12 15: 18: 17.000 2009] 0.040 sec x3 [ext2 / 0 / rel 747541 (0.20)] [lj] the
    [Sun Jul 12 15: 18: 17.000 2009] 0.040 sec x3 [ext2 / 0 / ext 747541 (0.20)] [lj] the
    [Sun Jul 12 15: 18: 17.000 2009] 0.040 sec x3 [ext2 / 0 / ext 747541 (0.20)] [lj] the
    


    Pay attention to “x3”, this is it - it means that the request has been optimized and processed in a package of 3 requests (including this one). For comparison, this is what the log looks like in which the same requests were sent one at a time:

    [Sun Jul 12 15: 18: 17.062 2009] 0.059 sec [ext2 / 0 / rel 747541 (0.20)] [lj] the
    [Sun Jul 12 15: 18: 17.156 2009] 0.091 sec [ext2 / 0 / ext 747541 (0.20)] [lj] the
    [Sun Jul 12 15: 18: 17.250 2009] 0.092 sec [ext2 / 0 / ext 747541 (0.20)] [lj] the
    


    It can be seen that the search time for each request in the case of a multi-request has improved from 1.5 to 2.3 times, depending on the sorting mode. In fact, this is not the limit. For both optimizations, there are cases when the speed improved by 3 or more times - and not on synthetic tests, but quite in production. Optimization of general queries pretty well rests on vertical searches for goods and online stores, the cache of common subtrees is compatible with data mining queries; but, of course, applicability is not limited strictly to these areas. For example, you can do a search without the full-text part at all and read several different reports (with different sorting, grouping, etc.) using the same data for one query.

    What other optimizations can be expected in the future? It depends on you. So far, in the long term, understandable optimization has been recorded about the same queries with different sets of filters. Do you know another frequent pattern that can be cleverly optimized? Send!

    Also popular now: