Search on a site is not only search on a site
Not all sites need a standalone search module. If the site has five pages - no search is needed. If the site is updated once a month or all updates are reflected on the cover page - you can do an external search on the site from Google or Yandex. But some tasks cannot be solved by an external search. This article is about what functions a built-in search module can perform. And some of these functions are not directly related to the search process.
Large search engines provide us with the opportunity to use their many years of experience in the field of relevance, calculation of search spam, morphology and other ranking. But there are tasks that an external search engine is not able to cope with due to its “appearance”.
You added an article to the site - and it is already immediately in the index and available for search. You deleted the obscene comment - and no one will find it. If your site has a search form through the global search engine, you will have to wait for reindexing sometimes for weeks. If the built-in search module allows you to reindex a site fragment by the event of “add”, “change” or “delete”, the account will go for minutes or even seconds.

The name of our product is often written in Russian as "netcat", "netcat", "netket". It would not occur to Yandex that for such requests it is necessary to show the pages where NetCat is written (except for the request "netcat", Yandex recognized it). What can we say about cases such as "CMS - CMS, Tsmska, Siems." And we can explicitly set similar synonyms for the built-in search.

Writing texts “for Yandex” has cardinal differences from writing texts for people. In the first case, we need Yandex out of a million pages “on this topic” to show the page on our site above all. In the second - so that a person who ALREADY came to the site quickly finds the page he needs and buys our product. Therefore, if a person is looking for a “pink elephant” on our website, we need to show him not a long article with an ideally verified amount of data phrases, but a page with a couple of photos and a buy button. Having the ability to set the weights of tags and individual blocks of web pages (for example, by the class attribute) for an internal search engine, we can prepare the content so that the process “entered a request - bought” takes a minimum of time from the user.

In the robots.txt file, we can write the Disallow statement, which will prevent external search engines from indexing certain parts of the site. As summer scandals involving private information getting into search engines have shown, this does not always help. But even if you do not take this into account, the disallow syntax is very primitive, and it would be much better to specify forbidden areas with a regular expression. Example: the page sloniki.html? Action = add, intended for the administrator to add content to the corresponding page, may well get into the index, even if there is only the heading “Pink Elephants” and the authorization form. But why do we litter search results?
Everyone is familiar with the drop-down list that Yandex or Google shows us as you type the request. But this tip is just a list of the most popular queries. Internal search, however, can load not only popular queries, but also page titles (that is, the names of documents) that fit the input query. Starting to enter “pinks”, we will see a list of contents of the title tag, where this fragment is contained; by clicking on the “Pink Elephants” we need, we will immediately get to the page you are looking for instead of the search results.

If our site is large, its full reindexing can take a lot of time and resources. But if the “Forum” section needs to be reindexed every hour, “News” every day, and “About the Company” never at all, it would be great to do so. An internal search may well allow itself different schedules for different sections. Of course, in the sitemap we can control the frequency of re-indexing using the changefreq attribute, but ... It is unlikely that Yandex and Google will perceive our desire as an instruction for action.

And:
The business search module can perform not only its direct responsibilities, but also other socially useful tasks. Here are some examples.
Who is the most careful to make a complete list of site pages for external search engines, if not a local search robot? At the same time, at the level of the site structure, we can set the changefreq and priority parameters, different for different sections.

There are at least two ways to search for internal links to non-existent pages. The first is to write a 404 error handler that will send a letter with the address of the page and referrer (or add a message to the site database) every time someone visits such a page. The second is to entrust this to a search robot. This is clearly a more correct way.
If the search engine collects statistics on queries and their results, this data can help us very much. Firstly, we can see the queries for which users do not find anything, and add the corresponding pages. Secondly, after seeing typos that are often found, we can add them to the dictionary of synonyms. Thirdly, if a page is searched too often, then it is difficult to find without a search; it might be worth putting it on the menu. Well and so on.

By the way, a separate item is statistics on requests of specific registered users. Just don’t think that I urge you to follow them :)
All these bells and whistles are good only if the search is really convenient, looking for what you need, and it is easy and simple to use. Therefore, our search module must be able to do what large search engines do. As good as they are. Well, or almost the same.
Many local wordform search engines cost stamming. This term means discarding the end of a word in an attempt to find its root and, as a consequence, word forms. We take the word “pink”, apply stemming, get “roses”, and now we consider all words starting with this “root” as word forms. So we will find “pink”, “pink”, etc. upon request of “pink”. But stemming gives too much error and is not suitable, for example, for isolated verbs (“go - go - go”). Morphological dictionaries give the most accurate search for word forms. For texts of household or business subjects, they are not so large (NetCat uses a free dictionary from aot.ru, Russian and English dictionaries together occupy only 15 megabytes, which is not much for modern hosting services; other dictionaries can be used; You can also add dictionaries of other languages).

By the way, taking this opportunity, I want to say many thanks to the author of the phpMorphy morphoanalysis library , which turned out to be very useful for our tasks.
There are two ways to deal with typos. The first is to find the most similar word, as Yandex does, the second is to use a fuzzy search.

But fixing a keyboard layout is a lot easier.
We did not include the following features in our search module because of their exotic nature or too high complexity, although for some cases they may be useful.
European languages (including Russian) have a direction of writing from left to right (left-to-right, LTR). But Arabic script is written in the opposite direction. If your project is aimed at this language audience, get ready to write (or connect a ready-made) stemmer. And the hieroglyphs are generally a separate case, there you can’t do with a single stemmer.
The systems of differentiation of rights in Internet projects are different, including completely paranoid (I want to write a separate article about this). An example of a complex option: publishing systems. A journalist may have the right to add an article (in certain sections!) And its correction, until it is adjusted by the editor. The editor has the right to view, correct and enable / disable all materials within the framework of their subject. The editor-in-chief is not entitled to adjust the ordered articles without the approval of the commercial department. From the point of view of the paranoid system of differentiation of rights, an ideal search module would index all the content posted on the project, and during the search it would check the rights of the user and display only the materials accessible to him. And at the same time, it would allow filtering by document status.
By analyzing the indexed page of the text, it is possible to determine its subject matter with some degree of error. Useful effects from such a feature: automatic cataloging of materials and building a tag cloud, analyzing the interests of the community or its individual members (for UGC projects), building lists of related materials (did you see the “see also” blocks?). Most often, such an analysis is used to target contextual advertising.
This also includes: caching search results, image search, indexing of pages generated by filling out forms or ajax, search for duplicate pages.
Another interesting application of the search engine occurred to me in the process of writing this article. It is suitable, for example, for collective blogs and media. By analyzing the texts of different authors, we can build their ratings according to various parameters. The first thing that comes to mind: vocabulary. In addition: the rating of authors-choleric (who uses exclamation marks more often than others), lovers of lengthy reasoning (who likes question marks), according to parasites, etc. Maybe suggest the Habraadministration do something similar? :) True, I have not yet come up with a commercial justification for such a toy.
... first you need to answer the question, is it needed, is it worth it to use the existing solutions: Yandex.Server , Sphinx , etc. The main advantage of your own search engine is the possibility of tight integration with other CMS modules used on the site. It's not just about embedding the management interface in the admin panel, but about integration with the system of differentiating rights, managing the structure, users, etc. (I already wrote about this).
As for technology, there are enough flexible and powerful platforms. The default NetCat search engine uses Zend_Search_Lucene. This solution has disadvantages, for example, a relatively low speed. In our case, this is justified by the fact that the site under NetCat should work on any standard UNIX hosting without the need to install additional components, and Zend_Search_Lucene does not require anything other than PHP. In fairness, it should be noted that we made the module extensible, that is, you can replace not only dictionaries, but also software components: if the project is large enough and the server is dedicated, you can replace the components that are responsible for storing and retrieving information, indexing, and basic form conversion etc. For example, use the same Sphinx or Solr (and if necessary, Yandex.XML ).
If you are developing not a universal CMS, but a specific large project, choosing and setting up the optimal platform is not a problem. It is much more important to understand how to use its capabilities as efficiently as possible.
All screenshots in the article were made on our website and in its administration system.
What Yandex does not know how
Large search engines provide us with the opportunity to use their many years of experience in the field of relevance, calculation of search spam, morphology and other ranking. But there are tasks that an external search engine is not able to cope with due to its “appearance”.
Instant reindexing
You added an article to the site - and it is already immediately in the index and available for search. You deleted the obscene comment - and no one will find it. If your site has a search form through the global search engine, you will have to wait for reindexing sometimes for weeks. If the built-in search module allows you to reindex a site fragment by the event of “add”, “change” or “delete”, the account will go for minutes or even seconds.

Synonyms
The name of our product is often written in Russian as "netcat", "netcat", "netket". It would not occur to Yandex that for such requests it is necessary to show the pages where NetCat is written (except for the request "netcat", Yandex recognized it). What can we say about cases such as "CMS - CMS, Tsmska, Siems." And we can explicitly set similar synonyms for the built-in search.

Tag weight management for relevance calculation
Writing texts “for Yandex” has cardinal differences from writing texts for people. In the first case, we need Yandex out of a million pages “on this topic” to show the page on our site above all. In the second - so that a person who ALREADY came to the site quickly finds the page he needs and buys our product. Therefore, if a person is looking for a “pink elephant” on our website, we need to show him not a long article with an ideally verified amount of data phrases, but a page with a couple of photos and a buy button. Having the ability to set the weights of tags and individual blocks of web pages (for example, by the class attribute) for an internal search engine, we can prepare the content so that the process “entered a request - bought” takes a minimum of time from the user.

Flexible setting of forbidden for indexing pages
In the robots.txt file, we can write the Disallow statement, which will prevent external search engines from indexing certain parts of the site. As summer scandals involving private information getting into search engines have shown, this does not always help. But even if you do not take this into account, the disallow syntax is very primitive, and it would be much better to specify forbidden areas with a regular expression. Example: the page sloniki.html? Action = add, intended for the administrator to add content to the corresponding page, may well get into the index, even if there is only the heading “Pink Elephants” and the authorization form. But why do we litter search results?
Auto-matching query options as you type
Everyone is familiar with the drop-down list that Yandex or Google shows us as you type the request. But this tip is just a list of the most popular queries. Internal search, however, can load not only popular queries, but also page titles (that is, the names of documents) that fit the input query. Starting to enter “pinks”, we will see a list of contents of the title tag, where this fragment is contained; by clicking on the “Pink Elephants” we need, we will immediately get to the page you are looking for instead of the search results.

Flexible reindexing schedule
If our site is large, its full reindexing can take a lot of time and resources. But if the “Forum” section needs to be reindexed every hour, “News” every day, and “About the Company” never at all, it would be great to do so. An internal search may well allow itself different schedules for different sections. Of course, in the sitemap we can control the frequency of re-indexing using the changefreq attribute, but ... It is unlikely that Yandex and Google will perceive our desire as an instruction for action.

And:
- indication of the search area (for example, search everywhere or only on the forum)
- extraction of additional attributes from objects and search in them (find all articles of such and such an author, all products in this price range)
- sorting of search results not only by relevance, but also by date (as on Habré)
... and not only search
The business search module can perform not only its direct responsibilities, but also other socially useful tasks. Here are some examples.
Auto build sitemap.xml
Who is the most careful to make a complete list of site pages for external search engines, if not a local search robot? At the same time, at the level of the site structure, we can set the changefreq and priority parameters, different for different sections.

Search for broken links
There are at least two ways to search for internal links to non-existent pages. The first is to write a 404 error handler that will send a letter with the address of the page and referrer (or add a message to the site database) every time someone visits such a page. The second is to entrust this to a search robot. This is clearly a more correct way.
Query statistics collection
If the search engine collects statistics on queries and their results, this data can help us very much. Firstly, we can see the queries for which users do not find anything, and add the corresponding pages. Secondly, after seeing typos that are often found, we can add them to the dictionary of synonyms. Thirdly, if a page is searched too often, then it is difficult to find without a search; it might be worth putting it on the menu. Well and so on.

By the way, a separate item is statistics on requests of specific registered users. Just don’t think that I urge you to follow them :)
Do not lag behind the "elders"
All these bells and whistles are good only if the search is really convenient, looking for what you need, and it is easy and simple to use. Therefore, our search module must be able to do what large search engines do. As good as they are. Well, or almost the same.
Full morphology
Many local wordform search engines cost stamming. This term means discarding the end of a word in an attempt to find its root and, as a consequence, word forms. We take the word “pink”, apply stemming, get “roses”, and now we consider all words starting with this “root” as word forms. So we will find “pink”, “pink”, etc. upon request of “pink”. But stemming gives too much error and is not suitable, for example, for isolated verbs (“go - go - go”). Morphological dictionaries give the most accurate search for word forms. For texts of household or business subjects, they are not so large (NetCat uses a free dictionary from aot.ru, Russian and English dictionaries together occupy only 15 megabytes, which is not much for modern hosting services; other dictionaries can be used; You can also add dictionaries of other languages).

By the way, taking this opportunity, I want to say many thanks to the author of the phpMorphy morphoanalysis library , which turned out to be very useful for our tasks.
Fighting typos
There are two ways to deal with typos. The first is to find the most similar word, as Yandex does, the second is to use a fuzzy search.

But fixing a keyboard layout is a lot easier.
Exotic cases
We did not include the following features in our search module because of their exotic nature or too high complexity, although for some cases they may be useful.
RTL languages and hieroglyphs
European languages (including Russian) have a direction of writing from left to right (left-to-right, LTR). But Arabic script is written in the opposite direction. If your project is aimed at this language audience, get ready to write (or connect a ready-made) stemmer. And the hieroglyphs are generally a separate case, there you can’t do with a single stemmer.
Search for restricted areas
The systems of differentiation of rights in Internet projects are different, including completely paranoid (I want to write a separate article about this). An example of a complex option: publishing systems. A journalist may have the right to add an article (in certain sections!) And its correction, until it is adjusted by the editor. The editor has the right to view, correct and enable / disable all materials within the framework of their subject. The editor-in-chief is not entitled to adjust the ordered articles without the approval of the commercial department. From the point of view of the paranoid system of differentiation of rights, an ideal search module would index all the content posted on the project, and during the search it would check the rights of the user and display only the materials accessible to him. And at the same time, it would allow filtering by document status.
Automatically detect page topics
By analyzing the indexed page of the text, it is possible to determine its subject matter with some degree of error. Useful effects from such a feature: automatic cataloging of materials and building a tag cloud, analyzing the interests of the community or its individual members (for UGC projects), building lists of related materials (did you see the “see also” blocks?). Most often, such an analysis is used to target contextual advertising.
This also includes: caching search results, image search, indexing of pages generated by filling out forms or ajax, search for duplicate pages.
Another interesting application of the search engine occurred to me in the process of writing this article. It is suitable, for example, for collective blogs and media. By analyzing the texts of different authors, we can build their ratings according to various parameters. The first thing that comes to mind: vocabulary. In addition: the rating of authors-choleric (who uses exclamation marks more often than others), lovers of lengthy reasoning (who likes question marks), according to parasites, etc. Maybe suggest the Habraadministration do something similar? :) True, I have not yet come up with a commercial justification for such a toy.
If you write your search ...
... first you need to answer the question, is it needed, is it worth it to use the existing solutions: Yandex.Server , Sphinx , etc. The main advantage of your own search engine is the possibility of tight integration with other CMS modules used on the site. It's not just about embedding the management interface in the admin panel, but about integration with the system of differentiating rights, managing the structure, users, etc. (I already wrote about this).
As for technology, there are enough flexible and powerful platforms. The default NetCat search engine uses Zend_Search_Lucene. This solution has disadvantages, for example, a relatively low speed. In our case, this is justified by the fact that the site under NetCat should work on any standard UNIX hosting without the need to install additional components, and Zend_Search_Lucene does not require anything other than PHP. In fairness, it should be noted that we made the module extensible, that is, you can replace not only dictionaries, but also software components: if the project is large enough and the server is dedicated, you can replace the components that are responsible for storing and retrieving information, indexing, and basic form conversion etc. For example, use the same Sphinx or Solr (and if necessary, Yandex.XML ).
If you are developing not a universal CMS, but a specific large project, choosing and setting up the optimal platform is not a problem. It is much more important to understand how to use its capabilities as efficiently as possible.
All screenshots in the article were made on our website and in its administration system.