2.0.2 came out, users rally soon and 2.0.3
Here we tested, compiled binary packages and uploaded the version of Sphinx 2.0.2-beta (this is such an open-source search server used on a bunch of websites), we planned for mid-December (a revolutionary change!) To release Sphinx 2.0.3-release , and We are also working hard on the (free) rally of Sphinx users on December 04 in St. Petersburg . You need to register for the meeting via the link a little higher, submit a cool report through our contact form, and a number of details about those ~ 30 new features and plans / terms for the next releases and their cycle can be read under the cut.
It is clear that most of those 30 features are rather pleasant things. Well, there’s a new flag, including the ability to replay logs in a situation where for some reason the time jumped sharply, support for snippets spread across all clusters (and not copied on each machine), support for 256 text fields (rather than 32), and so on. .P. But again, there are a few that we consider to be relatively large and important in general.
Attached support for MVA64 attributes. “Classic” MVAs are a lot of 32-bit unsigned values, the new MVA64 are, respectively, a lot of 64-bit signed ones. I can recall two obvious applications offhand: a) eliminate the possibility of collisions, if CRC32 were saved from lines in MVA, b) save any additional data there, but I’m sure that you will find many more less obvious and more interesting applications. MVA64 are supported in both disk and RT indexes.
By the way, MVA attributes are now also supported in RT indexes , as well as index_exact_words . In general, we are doing little by little all the opportunities previously absent in RT.
Made support for dict = keywords in RT indexes . It means thatnow in RT indexes there is a search by the beginnings of keywords (word *). The min_prefix_len, min_infix_len directives that existed in disk indexes, which indexed all possible substrings in advance , we decided not to specifically: this is a strong hit on indexing everywhere, but in the case of disk indexes, this is also a hit on a (relatively large) disk, and in the case of RT, precious memory, which is always lacking. If I inflate at times the disk requirements for finding substrings, I somehow agreed, then there is no memory requirement. Well, with the advent of dict = keywords, the search for substrings is possible, and the memory is intact.
Another interesting new thing is ATTACH INDEX. It now allows you to take a disk index full of data, determine a new empty RT index, and convert a disk to RT. After that, the data from the disk index disappears, but then appear in RT, and then you can safely work with that RT, as usual. It’s quite convenient for quick initial import of a large amount of data, or for quick RT recovery, if suddenly (pah-pah-pah) something bad happens to him: it is clear that re-indexing a disk index with one hit is much faster than inserting records into RT one at a time and even a few pieces. Physically, the operation translates into just renaming files, so it’s very fast. In fact, the functionality implemented right now (one-time conversion) would be more correct to call CONVERT. But we plan to develop this thing further and make it possible to import large data shatamata into a non-empty RT index like this. Therefore, they immediately scored the ATTACH keyword, for the future.
The UPDATE statement now supports more complete conditions in WHERE . Now it is possible to make queries like UPDATE myindex SET deleted = 0 WHERE MATCH ('test'), well, or there ... WHERE vendor = 123. Those. banging a thousand records by condition has just become a thousand times easier. As with the previous update of column values by ID, this new UPDATE also works in both regular and disk indexes.
And finally, the last “big” feature in the list is the ability to create your own formulas for calculating relevance and set them on the fly ( expression based ranker ). In previous versions, the options for calculating the relevance available through WEIGHT () essentially boiled down to choosing from several previously written rankers (PROXIMITY_BM25, SPH04, etc.). It’s clear that after that it was possible to put WEIGHT () in expressions and mix in some other attributes of the document, but influence the calculation of WEIGHT () itself and otherwise combine all sorts of ranking factors calculated not for the whole document, but for individual fields, was impossible. And there were not many factors.
Now you can. The ranking formula can be set for at least each individual request. Plus, the available ranking factors have become significantly greater. All rankers with a new "scriptable" ranker are successfully emulated. There are examples in the documentation, here is one:
Surprisingly, it works much faster than I expected. I expected a slowdown to several times, in fact, on a small test collection of 1,000,000 blog posts, I observe a slowdown from 1.1x to 1.3x times - this is compared to compiled C ++ code, which in addition considers much fewer factors. I think it's pretty good.
The 2.0.x branch is now frozen, there will be no new features there, only bug fixes and regular releases with these bug fixes themselves. The nearest one is appointed after 1 month, then after that either again by the hour with an interval of 1-2 months (if enough corrections accumulate), or as they accumulate.
All new features from here will add up to the trunk, the next version is 2.1.1. For him, the release date has not yet been planned. But a number of features are already under active development, so you can tease now. We already do a search for substrings (* word *), and not just at the beginning of a word (word *), using dict = keywords. Perhaps (possibly), we will also make support for wildcards for the same case. We are working on an interesting improvement for clusters with a bunch of agents so that requests are sent to them in parallel (now this is still serialized). Plus, secret work is underway about screwing in the famous library and improving support for Russian morphology.
Features features, in addition to them, we again shook the internal processes of testing, assembling and rolling out releases. It seems to be touched, so the next version, 2.0.3-release will not be pumped out as usual, “when it's ready” - but by call, after 1 month, in mid-December 2011. If your boss decisively doesn’t order versions without such a tag, here He will be in a month.
You can also tell him that the current tag is, in fact, not beta, but even rc. I mean, there are no known big and serious bugs in 2.0.2-beta at the time of release. For the previously existing test functionality, traditionally there has only been more, respectively, for "just search" it should be more stable than it was. Therefore, in principle, it could be called Release Candidate, but I decided not to complicate the set of tags.
We again added some new features, and the policy is such that in this case the Release tag is delayed until, in addition to our internal testing, the version is tested by living people from the community. So take the new version, try it, and be sure to write to us about bugs if you suddenly run into any.In the morning in the newspaper, in the evening on the Internet In the morning in the bugtracker, in the evening in the trunk!
More about everything new, and the correct use of the old, and, I hope, a bunch of other things in the near future can not only be read in rare blog posts, but also listen live at a user conference . We are arranging a second time, still free (I did not teach anything the first time !!!), but now for a change, not Moscow, but St. Petersburg, Sunday, December 04 . Request for readers to register as early as possible, request for writers not to be shy and send us suggestions about reports and / or lightning talks.
Hello everyone, to the new releases and, hopefully, live meetings at the conference.
About features
It is clear that most of those 30 features are rather pleasant things. Well, there’s a new flag, including the ability to replay logs in a situation where for some reason the time jumped sharply, support for snippets spread across all clusters (and not copied on each machine), support for 256 text fields (rather than 32), and so on. .P. But again, there are a few that we consider to be relatively large and important in general.
Attached support for MVA64 attributes. “Classic” MVAs are a lot of 32-bit unsigned values, the new MVA64 are, respectively, a lot of 64-bit signed ones. I can recall two obvious applications offhand: a) eliminate the possibility of collisions, if CRC32 were saved from lines in MVA, b) save any additional data there, but I’m sure that you will find many more less obvious and more interesting applications. MVA64 are supported in both disk and RT indexes.
By the way, MVA attributes are now also supported in RT indexes , as well as index_exact_words . In general, we are doing little by little all the opportunities previously absent in RT.
Made support for dict = keywords in RT indexes . It means thatnow in RT indexes there is a search by the beginnings of keywords (word *). The min_prefix_len, min_infix_len directives that existed in disk indexes, which indexed all possible substrings in advance , we decided not to specifically: this is a strong hit on indexing everywhere, but in the case of disk indexes, this is also a hit on a (relatively large) disk, and in the case of RT, precious memory, which is always lacking. If I inflate at times the disk requirements for finding substrings, I somehow agreed, then there is no memory requirement. Well, with the advent of dict = keywords, the search for substrings is possible, and the memory is intact.
Another interesting new thing is ATTACH INDEX. It now allows you to take a disk index full of data, determine a new empty RT index, and convert a disk to RT. After that, the data from the disk index disappears, but then appear in RT, and then you can safely work with that RT, as usual. It’s quite convenient for quick initial import of a large amount of data, or for quick RT recovery, if suddenly (pah-pah-pah) something bad happens to him: it is clear that re-indexing a disk index with one hit is much faster than inserting records into RT one at a time and even a few pieces. Physically, the operation translates into just renaming files, so it’s very fast. In fact, the functionality implemented right now (one-time conversion) would be more correct to call CONVERT. But we plan to develop this thing further and make it possible to import large data shatamata into a non-empty RT index like this. Therefore, they immediately scored the ATTACH keyword, for the future.
The UPDATE statement now supports more complete conditions in WHERE . Now it is possible to make queries like UPDATE myindex SET deleted = 0 WHERE MATCH ('test'), well, or there ... WHERE vendor = 123. Those. banging a thousand records by condition has just become a thousand times easier. As with the previous update of column values by ID, this new UPDATE also works in both regular and disk indexes.
And finally, the last “big” feature in the list is the ability to create your own formulas for calculating relevance and set them on the fly ( expression based ranker ). In previous versions, the options for calculating the relevance available through WEIGHT () essentially boiled down to choosing from several previously written rankers (PROXIMITY_BM25, SPH04, etc.). It’s clear that after that it was possible to put WEIGHT () in expressions and mix in some other attributes of the document, but influence the calculation of WEIGHT () itself and otherwise combine all sorts of ranking factors calculated not for the whole document, but for individual fields, was impossible. And there were not many factors.
Now you can. The ranking formula can be set for at least each individual request. Plus, the available ranking factors have become significantly greater. All rankers with a new "scriptable" ranker are successfully emulated. There are examples in the documentation, here is one:
$client->SetRankingMode(SPH_RANK_EXPR, "sum(lcs*user_weight)*1000+bm25");
Surprisingly, it works much faster than I expected. I expected a slowdown to several times, in fact, on a small test collection of 1,000,000 blog posts, I observe a slowdown from 1.1x to 1.3x times - this is compared to compiled C ++ code, which in addition considers much fewer factors. I think it's pretty good.
About development plans
The 2.0.x branch is now frozen, there will be no new features there, only bug fixes and regular releases with these bug fixes themselves. The nearest one is appointed after 1 month, then after that either again by the hour with an interval of 1-2 months (if enough corrections accumulate), or as they accumulate.
All new features from here will add up to the trunk, the next version is 2.1.1. For him, the release date has not yet been planned. But a number of features are already under active development, so you can tease now. We already do a search for substrings (* word *), and not just at the beginning of a word (word *), using dict = keywords. Perhaps (possibly), we will also make support for wildcards for the same case. We are working on an interesting improvement for clusters with a bunch of agents so that requests are sent to them in parallel (now this is still serialized). Plus, secret work is underway about screwing in the famous library and improving support for Russian morphology.
About releases
Features features, in addition to them, we again shook the internal processes of testing, assembling and rolling out releases. It seems to be touched, so the next version, 2.0.3-release will not be pumped out as usual, “when it's ready” - but by call, after 1 month, in mid-December 2011. If your boss decisively doesn’t order versions without such a tag, here He will be in a month.
You can also tell him that the current tag is, in fact, not beta, but even rc. I mean, there are no known big and serious bugs in 2.0.2-beta at the time of release. For the previously existing test functionality, traditionally there has only been more, respectively, for "just search" it should be more stable than it was. Therefore, in principle, it could be called Release Candidate, but I decided not to complicate the set of tags.
We again added some new features, and the policy is such that in this case the Release tag is delayed until, in addition to our internal testing, the version is tested by living people from the community. So take the new version, try it, and be sure to write to us about bugs if you suddenly run into any.
About the conference
More about everything new, and the correct use of the old, and, I hope, a bunch of other things in the near future can not only be read in rare blog posts, but also listen live at a user conference . We are arranging a second time, still free (I did not teach anything the first time !!!), but now for a change, not Moscow, but St. Petersburg, Sunday, December 04 . Request for readers to register as early as possible, request for writers not to be shy and send us suggestions about reports and / or lightning talks.
Hello everyone, to the new releases and, hopefully, live meetings at the conference.