How to make friends Django and Sphinx?

Background


I needed to add a search function to the site. The first thought was to take advantage of the capabilities of the SQL server, but you need to search several tables, words and phrases at once, and even with stemming. I realized that it would be unprofitable to invent your bike.

I decided to search, but what is there of ready-made solutions? It turned out, frankly, not a lot: django-haystack and django-sphinx . Previously, the advantages and disadvantages of both have already been listed , so I will not repeat it.

Having spent some time reading blogs and forums, I decided to try django-sphinx, because django-haystack, as far as I know, with Sphinx support is still not very good.

The author of django-sphinx has long abandoned his project, but there are many forks, and they say that it is quite possible to use it. I chose the one that was, hmm, fresher and tried to connect it to my project.

Story


It turned out that everything is very bad there - a lot of errors, deficiencies, problems with the Python Sphinx API.
At first, I tried to just fix the errors in the code and make it work. I even succeeded - I was able to search by one word (experts will rightly notice that SPH_MATCH_ANY would solve this problem as well), but I found out about this flag a bit later. And I learned a lot more.

In the comments to the post to which I referred earlier, they scolded django-sphinx, that de does not know how, it does not support. I decided to add the missing features - as a result, another fork was born. After some time, he already knew how to index MVA and fields from related models (the Sphinx documentation seemed confusing in some places - I had to figure out what was happening for a long time). Many errors have been fixed and no less added ... but how else?

And then I decided to read the section on SphinxQL. And almost completely rewrote django-sphinx.

At the moment, my fork is able to work with Sphinx due to its dialect SphinxQL and boasts:

  • support for sphinx 2.0.1-beta and higher
  • quite a lot of flexibility in customization
  • automatic generation of sphinx configuration
  • the ability to search by one index or several at once
  • the ability to index MVA and fields from one-to-one related models in one index
  • support for creating snippets
  • binding documents from the index to objects of the corresponding models
  • similar Django ORM search filtering methods (including method chains)


RealTime indexes are not yet supported; accordingly, there are no functions for working with them (INSERT, UPDATE, DELETE).
Search by related models is not supported. And I’m not sure that it is needed at all. Commentators, who knows, give examples of where and how this can be used?

Part of the code is already covered by tests (yes, I’m learning to write unit tests at the same time - I tried to start several times before, but didn’t understand which side to approach this lesson at all)

Besides, I started writing documentation - for now, drafts, but in general, I hope that’s all understandably.

Well, here are some examples that, in my opinion, may seem interesting.

As a basis, I will take these models:

classRelated(models.Model):
    name = models.CharField(max_length=10)
    def__unicode__(self):return self.name
classM2M(models.Model):
    name = models.CharField(max_length=10)
    def__unicode__(self):return self.name
classSearch(models.Model):
    name = models.CharField(max_length=10)
    text = models.TextField()
    stored_string = models.CharField(max_length=100)
    datetime = models.DateTimeField()
    date = models.DateField()
    bool = models.BooleanField()
    uint = models.IntegerField()
    float = models.FloatField(default=1.0)
    related = models.ForeignKey(Related)
    m2m = models.ManyToManyField(M2M)
    search = SphinxSearch(
        index='test_index',
        options={
            'included_fields': [
                'text',
                'datetime',
                'bool',
                'uint',
            ],
            'stored_attributes': [
                'stored_string',
            ],
            'stored_fields': [
                'name',
            ],
            'related_fields': [
                'related',
            ],
            'mva_fields': [
                'm2m',
            ]
        },
    )


First of all, based on the options dictionary passed by the SphinxSearch argument, a config will be generated in which:

  • all fields from included_fields will be placed in the index, with non-row fields being stored attributes
  • all fields from stored_attributes , as you understand, will also become stored. This list can be useful if you need to make a stored text field.
  • fields from stored_fields will become stored, but will also be available for full-text search
  • fields from related_fields , you guessed it ?, will be similarly declared as stored. The keys from the associated models will be stored there (a little later I will explain why)
  • finally, the appointment of mva_fields , I think you already understand. Only the names of ManyToMany fields can be placed on this list.


What does all this give us? And it gives quite a lot of opportunities for search.

Get the QuerySet for our model. There are two ways to do this:

    qs = Search.search.query('query')


or:

    qs = SphinxQuerySet(model=Search).query('query')


Both methods will give a similar result, but in the second case, the parameters passed to SphinxSearch in the model description (excluding field lists) will not be taken into account.

Now we can search for something:

    qs1 = qs.filter(bool=True, uint__gt=100, float__range=(1.0, 15.4)).group_by('date').order_by('-pk').group_order_by('-datetime')


Let me explain what this request does:
  • looks up the word 'query' in the index of the Search model
  • in this case, only results in which the bool field contains True, the uint field is greater than 100, and the contents of the float field are in the range from 1.0 to 15.4 will be included in the output.
  • groups all results by date
  • sorting them by document id in reverse order ('pk' is cast to 'id' automatically)
  • inside each group, sorts the results by the datetime field, also in reverse order


What else can be done?

For example, suppose a variable QuerySet with several Related objects is stored in the variable r , and M2M is stored in m (see models above). Then you can do something like this:

    qs2 = qs.filter(related__in=r, m2m__in=m)
    # или
    qs3 = qs.filter(related=r[0])


That is, you do not need to prepare identifier lists yourself - django-sphinx will do it for you!

And finally, I will say that SphinxQuerySet behaves like an array.

# можно взять любой результат по индексу
    doc = qs[5]
    # или срез
    docs = qs[3:20]
    docs = qs[:50]
    docs = qs[100:]


Finally, to get the values ​​of stored attributes (if you need them for some reason) or calculated expressions, you need to refer to the sphinx attribute of the object obtained from the SphinxQuerySet.

Yes. A bit about expressions.
Sphinx can calculate various formulas on the fly for each document (ranking works on the same principle) and allows you to create your own:

    qs4 = qs.fields(expr1='uint*(float+100)')


You can find the result of the calculation inside the sphinx attribute of the received objects.
In addition, Sphinx allows you to sort the output not only by a specific field, but also by these expressions, so this code is also possible:

    qs4 = qs.fields(expr1='uint*(float+100)').order_by('expr1')


So what am I talking about?



I hope that the inhabitants of the Habr will give me useful advice (or throw poop if you deserve ...) and indicate where I should further develop django-sphinx.

Thank you all for your attention! I thought of writing a short article, but it turned out ... what happened.

Also popular now: