Setting up document processing on FAST

    One of the tasks when integrating a third-party search engine into the system is to configure the process of processing source documents (roughly speaking, indexing). The complexity of setting up such a process depends on the functional requirements of the search system and the capabilities of the search engine. Customization can be limited to just a couple of clicks in the admin panel of a search engine, or it can result in writing your own procedures, scripts, etc. If we are accustomed to trust the standard capabilities of the system (especially if its code cannot be modified), then for my own scripts I would like to have tests whose implementation is not always provided by the engine.

    We are faced with the need to implement a search on the MS FAST ESP 5.3 platform. This serious engine has impressive capabilities for customizing document processing, some of which we touched on in our project. In general, we want to share our way of testing custom stages on this engine.

    The documentation describes the process of creating stages quite well. We will not retell it, we will limit ourselves only to the information necessary for understanding the foregoing.
    In FAST ESP terminology, the entire sequence of actions that must be performed when indexing a single document is called Pipeline, and the individual actions are called Stage. Stage starts in a specific context with which it can interact, one of the elements of which is a document. For example, stage can read and write attributes of the processed document. Schematically, the whole process of document processing looks like this:



    Stage is presented in the form of two files - xml-specification and implementation (in FAST ESP 5.3, the implementation provides for the use of the python v.2.3 language).

    Below is an example of a stage that writes 500 in the quality field if the hotornew document attribute is true.
    Set high rank to documents, which have HotOrNew = yes
      

    (the specification is not directly involved in the tests, given for the integrity of the picture)

    from docproc import Processor, DocumentException,  ProcessorStatus
    class SetOnEqual(Processor.Processor): 
        def ConfigurationChanged(self, attributes): 
            self.input = self.GetParameter('Input') 
            self.output = self.GetParameter('Output') 
            self.inputfieldvalue = self.GetParameter('InputFieldValue')
            self.outputfieldvalue = self.GetParameter('OutputFieldValue')  
        def Process(self, docid, document): 
            testField = str(document.GetValue(self.input, None))
            if testField == str(self.inputfieldvalue):
                output = int(self.outputfieldvalue)
                document.Set(self.output,output)
            else:
                    document.Set(self.output, 0)
            return ProcessorStatus.OK
    

    In order to verify the work of the created stage in the native context of document processing, it is necessary to do the sequence of actions specified in the documentation:
    1. Put specification and implementation files in certain directories;
    2. Restart the document processing service - Document Processor (procserver. When it starts, it compiles the stage code);
    3. Include a new stage in the pipeline;
    4. Index the test document;
    5. Look at the result of processing the document (you can display it in a log file, or you can “find” a new document through a standard frontend and see all the attributes of the document).

    If we did not get the expected result (for example, we set the hotornew = true field in the document, but the value of the Quality field did not change), we will have to debug it, which means:
    - Look for the error in the procserver log;
    - Check whether our stage specifically did what was asked of him by putting Spy stages before and after performing the tested stage. (Spy dumps the document dump along with its attributes to a file on disk);
    - Look for an error in the stage code.

    After the error is fixed, you need to check again - i.e. Perform steps 1, 2, 4, 5 again.
    This is dreary. It is more convenient to debug the stage code “as is” with the usual methods, for example, in the unit test:



    Therefore, to “recreate” the context, we made primitive mocha classes with objects that the stage works with:

    class Document(object):
        """
        Mock сущности Document
        """
        def GetValue(self, name, default):
            return getattr(self, name, default)
        def Set(self, field, value):
            setattr(self, field, str(value))
    class Processor (object):
        """
        Mock сущности Processor
        """
        def GetParameter(self, name):
            return getattr(self, name)
        def Set(self, field, value):
            setattr(self, field, str(value))
    

    In reality, the document entity provides other methods. We limited ourselves to those that we use.
    Now you can write / debug / test stage without a search engine:

    import unittest
    import docproc.Processor as proclib
    from docproc import ProcessorStatus
    import SetOnEqual
    class testSetOnEqual(unittest.TestCase):
        def setUp(self):
            self.stage = SetOnEqual.SetOnEqual()
            self.stage.Set('Input', 'hot')
            self.stage.Set('Output', 'quality')
            self.assertEquals(self.stage.GetParameter('Input'), 'hot')
            self.assertEquals(self.stage.GetParameter('Output'), 'quality')
        def test_true(self):
            self.stage.Set('InputFieldValue', 'true')
            self.stage.Set('OutputFieldValue', '600')
            doc = proclib.Document()
            doc.Set('hot', 'true')        
            self.stage.ConfigurationChanged('')
            status = self.stage.Process("docid", doc)
            self.assertEquals(status, ProcessorStatus.OK)
            self.assertEquals(doc.GetValue('quality', ""), '600')
        def test_false(self):
            self.stage.Set('InputFieldValue', 'true')
            self.stage.Set('OutputFieldValue', '600')
            doc = proclib.Document()
            doc.Set('hot', 'no')        
            self.stage.ConfigurationChanged('')
            status = self.stage.Process("docid", doc)
            self.assertEquals(status, ProcessorStatus.OK)
            self.assertEquals(doc.GetValue('quality', ""), '0')
    def suite():
        suite = unittest.TestSuite()
        suite.addTest(unittest.makeSuite(testSetOnEqual))
        return suite
    if __name__ == "__main__":
       unittest.main()
    

    The entire code is in the archive .

    What this approach gave us:
    - Simplification of life for ourselves (it saves time and allows us not to abandon the practice of unit testing);
    - Simplification of life to the tester.

    Posted by
    Leah Shabakaeva
    Lead
    Developer Softline Development Department

    Also popular now: