Conditional indexing. Optimizing the full-text search process



In this article I want to talk about the integration of Apache Lucene and Hibernate Search. To be more precise, then about one of the mechanisms of Hibernate Search, which can coolly increase productivity on a project with full-text search.

It's no secret to anyone who has worked with the technologies listed above that indexing is required for full-text search. In other words, when adding and changing records in the database, it is necessary to add / change indexes, by which, in fact, a full-text search will be performed. Apache Lucene is responsible for this process. And here is how we notify Lutsen that this entity needs to be indexed:

@Entity
@Indexed
public class SomeEntity {
    @Id
    @GeneratedValue
    private Integer id;
    @Field
    private String indexedField;
    private String unindexedField;
    //getters and setters
}

In the class above, the Indexed annotation indicates that the entity is indexed by Lucene. An annotation @Fieldindicates which fields will be indexed. Because the annotation is @Fieldhung only over the indexedField field, which means that we can carry out full-text search only in this field.

Note. For the normal functioning of Lucena, other settings are required in addition to these annotations. But since the article is not dedicated to setting up Lucena as a whole, but only to optimizing the indexing process, we will omit these details.

Now let's look at an example of indexing some entity. Suppose we have an ad site. And here is our essence:

@Entity
public class Ad {
    @Id
    @GeneratedValue
    private Integer id;
    private String text;
    private AdStatus status;
    //getters and setters
}

We want to provide our users with the opportunity of full-text search on all site ads. To do this, add the appropriate annotations:

@Entity
@Indexed
public class Ad {
    @Id
    @GeneratedValue
    private Integer id;
    @Field
    private String text;
    private AdStatus status;
    //getters and setters
}

Now is the time to mention that an ad may have one of the following statuses: DRAFT, ACTIVE, ARCHIVE. After some thought, we come to the decision that users in the search results need to display only ads in the ACTIVE status. Consider two options for solving this problem. The first is in the forehead. Add the @Field annotation above the status field. And each time we search, we add predicate, which will indicate what this status should be. Cons of this solution: a noticeable drop in performance with a large number of ads in the status of ARCHIVE and DRAFT, excessive indexing of entities that will no longer be searched.

Immediately another decision came to mind - not to index / delete existing indexes for ads in all statuses except ACTIVE. A mechanism such as interceptors will help us with this. First we pose the problem. We want that when changing the entity, indexing is performed depending on the new status of the ad. Now we start implementation. Create an AdIndexInterceptor class that implements the EntityIndexingInterceptor interface:

public class AdIndexInterceptor implements EntityIndexingInterceptor {
    @Override
    public IndexingOverride onAdd(Ad entity) {
        if (entity.getStatus() == AdStatus.ACTIVE) {
            return IndexingOverride.APPLY_DEFAULT;
        }
        return IndexingOverride.SKIP;
    }
    @Override
    public IndexingOverride onUpdate(Ad entity) {
        if (entity.getStatus() == AdStatus.ACTIVE) {
            return IndexingOverride.UPDATE;
        }
        return IndexingOverride.REMOVE;
    }
    @Override
    public IndexingOverride onDelete(Ad entity) {
        return IndexingOverride.APPLY_DEFAULT;
    }
    @Override
    public IndexingOverride onCollectionUpdate(Ad entity) {
        return onUpdate(entity);
    }
}

As you can see above, the class must implement 4 methods that will be called when adding a record, editing a record, deleting and updating a collection of records, respectively. Each of these methods should return one of the IndexingOverride values, which in turn is enum. There are four meanings of this enum. I will sign what happens when each of them is returned:

  • APPLY_DEFAULT - the indexing process continues as if it would pass in the absence of an interceptor.
  • SKIP - indexing does not occur.
  • UPDATE - Updates an existing index.
  • REMOVE - the existing index is deleted, a new one is not created.

Now back to the entity class. In order for Lucena to know that before indexing it is necessary to call the corresponding interceptor methods, we add the interceptor attribute to the Indexed annotation over the entity:

@Entity
@Indexed(interceptor = AdIndexingInterceptor.class)
public class Ad {
    @Id
    @GeneratedValue
    private Integer id;
    @Field
    private String text;
    private AdStatus status;
    //getters and setters
}

It remains only to correctly document the use of this interceptor so that Lucena’s behavior is expected for your teammates.

PS In the official documentation, the developers indicate that this feature is experimental and its functioning may change depending on feedback from users.

Link to official documentation.

Also popular now: