InterSystems iKnow. Download data from Vkontakte

  • Tutorial
This article continues the series of stories ( one , two ) about the main ways / scenarios for using iKnow - a Natural Language Processing tool from the InterSystems technology stack.
Previous posts on this topic were mainly devoted to working with data after they were placed in the domain (the place where the whole analysis of the text goes). The same article will be about how to correctly and conveniently upload information to iKnow. As an example, consider loading information about Vkontakte users: their personal data, posts, etc.
The article implies a certain basic background in the field of InterSystems technologies (in particular, Caché ObjectScript).

Long road to the domain


alt text

According to the official documentation , there are two scenarios for loading data into an existing domain:
  1. The class instance is created %iKnow.Source.Loader. It is bound to a specific domain (the one whose id was passed to the constructor). An instance of the class that implements the lister interface is created. This instance invokes a method AddListToBatchwith some arguments specifying the loaded information. Thus, a new list of information for downloading is added to the current domain batch. This can be done several times. In order to load the current batch into the domain, you need to call the method on the loader ProcessBatch. This option is better for large volume downloads.
  2. A class instance is created that implements the lister interface , a method is called for this instance ProcessListwith some arguments specifying the loaded information, and the download takes place directly into the domain directly. This option is better for small volume downloads.

List customization


The standard library offers many ready-made implementations of the lister (RSS-lister, file lister, global lister). However, the final programmer has the opportunity to write his own implementation, suitable for his own needs.
Before writing a lister for Vkontakte posts, I wrote a wrapper for some Vkontakte API methods on COS that operate on open access data. All code is available on github in the package VKReader.
I decided that it would be interesting if the lister could download the latest posts for some keyword, well, and some other parameters. It turned out that this is not at all difficult to achieve. Head of Documentationdedicated to customization, says that to create your own lister you need to inherit from the system class and override several methods.
So, everything in the same package I created a class VKReader.Listerthat inherits from the class %iKnow.Source.Lister. If you write your lister, it must also be inherited from this class.
Each lister must be assigned a unique short name (alias), according to which iKnow system methods will access it. If this name is not specified, the full class name of this lister will be used instead.
To specify alias, simply override the class method in your class GetAlias. For our VKontakte lister, I did it like this:

ClassMethod GetAlias ​​() As% String
{
    Quit "VKAPI"
}

All data sources submitted for download have an external id, which should contain the short name of the lister and full reference, which, in turn, consists of the name of the source group and local reference.
For the lister to work, you need to redefine the class methods BuildFullRefand SplitFullRef, accordingly, collect the full reference from groupname and local reference and break it into these two parts.
The extrenal id in our case turned out like this:
VKAPI:searchQuery:::vkPostId
Here VKAPI is the short name of our lister, the search query plays the role of the name of the group of sources, and the id of the Vkontakte record is the local reference.
Method code BuildFullRefand SplitFullRef:

ClassMethod SplitFullRef (domainId As% Integer, fullRef As% String, Output groupName As% String, 
Output localRef As% String) As% Status [Private]
{
    set delim = ":::"
    set localRef = $ piece (fullRef, delim, $ l (fullRef, delim))
    set groupName = $ e (fullRef, 1, * - $ l (localRef) - $ l (delim))
    Quit $$$ OK
}

ClassMethod BuildFullRef (domainId As% Integer, groupName As% String, localRef As% String) As% String [Private]
{
    quit groupName _ "::: "_localRef
}

You also need to specify which one Processorwill be standard for this lister. In iKnow Processor, this is an object that handles the processing of downloaded data directly. There are several types of different handlers (Processors), but since in our case the data will only be stored directly in memory, I decided to use a handler for temporary storage. A handler is also specified through an override.

ClassMethod DefaultProcessor () As% String
{
    Quit "% iKnow.Source.Temp.Processor"
}

All the main boot activity occurs in another overridden method with an eloquent name ExpandList. This method extends the list for uploading to the domain. The arguments to the ProcessList and AddListToBatch methods will be the same as you define them in ExpandList.
First, we give all the method code for our case.
We will have the following arguments (in order): the query word by which we want to search for records; number of records; a boolean value corresponding to whether we want to check the list for loading for the existence of a source with the same local reference; restrictions on the time of publication of the record.

A lot of code under the spoiler
Method ExpandList (listparams As% List) As% Status
{
    set query = $ li (listparams, 1)
    set count = $ li (listparams, 2)
    set checkExists = + $ lg (listparams, 3, 1)
    set startDate = $ lg (listparams, 4)
    set startTime = $ lg (listparams, 5)
    set endDate = $ lg (listparams, 6)
    set endTime = $ lg (listparams, 7)
    
    #dim response As% ListOfObjects
    set tSC = ## class (VKReader. Requests.APIPublicMethodsCaller) .NewsfeedSearch (.response, query,
 count ,,, startDate, startTime, endDate, endTime)
    quit: $$$ ISERR (tSC) tSC
    
    do ..RegisterMetadataKeys ($ lb ("PostDate", "PostTime", "AuthorID", "AuthorCity", "AuthorCountry",
 "AuthorDOB "," AuthorSex "))
    
    set userIds = "1"
    set groupIds = "1"
    
    for i = 1: 1: response.Count () {
        if (response.GetAt (i) .FromID <0) {
            set groupIds = groupIds _ "," _ (- (response.GetAt (i) .FromID))
        } else {
            set userIds = userIds _ "," _ response.GetAt (i) .FromID
        }
    }
    
    set tSC = ## class (VKReader.Requests.APIPublicMethodsCaller) .UsersGet (. responseUsers, userIds,
 "sex, city, bdate, country")
    quit: $$$ ISERR (tSC) tSC
    set tSC = ## class (VKReader.Requests.APIPublicMethodsCaller) .GroupsGetById (.responseGroups, groupIds,
 "city, country" )
    quit: $$$ ISERR (tSC) tSC
    
    for i = 1: 1:response.Count () {
        set tPostDate = response.GetAt (i). Date
        set tPostTime = response.GetAt (i) .Time
        set tOwnerID = response.GetAt (i) .OwnerID
        set tFromID = response.GetAt (i) .FromID
        set tID = response.GetAt (i) .ID
        #dim tTextStream as% GlobalCharacterStream
        set tTextStream = response.GetAt (i) .Text
        if (tFromID <0) {
            set tAuthorCity = responseGroups.GetAt (-tFromID) .City
            set tAuthorCountry = responseGroups.GetAt (-tF ) .Country
            set tAuthorDOB = ""
            set tAuthorSex = ""
        } else {
            set tAuthorCity = responseUsers.GetAt (tFromID) .City
            set tAuthorCountry = responseUsers.GetAt (tFromID) .Country
            set tAuthorDOB = responseUsers.GetAt (tFromID) .DOB
            set tAuthorSex = responseUsers.GetAt (tFromID) .Sex
        }
                
        set tLocalRef = tOwnerID _ "# _ _" # _ "
        
        if (checkExists) {
            continue: .. RefExists (query, tLocalRef, checkExists - 1)
        }
        
        set tRef = $ lb (i% ListerClassId, ..AddGroup (query), tLocalRef)
        do tTextStream.Rewind ()
        if (tTextStream.Size = 0) {
            continue
        }
        
        set len ​​= 32000
        while (len = 32000) {
            do ..StoreTemp (tRef, tTextStream.Read (.len))
        }
        
        do ..SetMetadataValues ​​(tRef, $ lb (tPostDate, tPostTime, tFromID, tAuthorCity, tAuthorCountry,
 tAuthorDOB, tAuthorSex)
    }
}}

Let's go through the code in more detail.
First, we highlight the arguments.

    set query = $ li (listparams, 1)
    set count = $ li (listparams, 2)
    set checkExists = + $ lg (listparams, 3, 1)
    set startDate = $ lg (listparams, 4)
    set startTime = $ lg (listparams , 5)
    set endDate = $ lg (listparams, 6)
    set endTime = $ lg (listparams, 7)

We will make a request to the Vkontakte API through our wrapper method. The result of this method is a list of class objects VKReader.Data.Postthat contains some fields specific to a VKontakte record.

    #dim response As% ListOfObjects
    set tSC = ## class (VKReader.Requests.APIPublicMethodsCaller) .NewsfeedSearch (.response, query,
 count ,,, startDate, startTime, endDate, endTime)
    quit: $$$ ISERR (tSC) tSC

Register the metadata keys for further easy saving of meta-information. In the metadata we want to store the date and time the post was published, as well as the id, city, country and date of birth of the author.

    do ..RegisterMetadataKeys ($ lb ("PostDate", "PostTime", "AuthorID", "AuthorCity", "AuthorCountry",
 "AuthorDOB", "AuthorSex"))

Save the comma-separated-list IDs of users and groups, who are the authors of the records we found. Id groups, as in Vkontakte API, are negative integers, and user IDs are positive.

    set userIds = "1"
    set groupIds = "1"
    

        if (response.GetAt (i) .FromID <0) {
            set groupIds = groupIds _ "," _ (- (response.GetAt (i) .FromID))
        else {
            set userIds = userIds _ "," _ response. GetAt (i) .FromID
        }
    }

Get information about these users and groups using wrapper methods. They return lists of objects of types VKReader.Data.Userand VKReader.Data.Groupcontaining fields that are typical for users and groups of VKontakte (like a city, country, and everything else).

    set tSC = ## class (VKReader.Requests.APIPublicMethodsCaller) .UsersGet (.responseUsers, userIds,
 "sex, city, bdate, country")
    quit: $$$ ISERR (tSC) tSC
    set tSC = ## class (VKReader.Requests.APIPublicMethodsCaller) .GroupsGetById (.responseGroups, groupIds,
 "city, country")
    quit: $$$ ISERR (tSC) tSC

In a loop, we process all the found posts. First, we isolate all the received meta-information into local variables.

        set tPostDate = response.GetAt (i). Date
        set tPostTime = response.GetAt (i) .Time
        set tOwnerID = response.GetAt (i) .OwnerID
        set tFromID = response.GetAt (i) .FromID
        set tID = response.GetAt (i) .ID
        #dim tTextStream as% GlobalCharacterStream
        set tTextStream = response.GetAt (i) .Text
        if (tFromID <0) {
            set tAuthorCity = responseGroups.GetAt (-tFromID). City
            set tAuthorCountry = responseGroups.GetAt (-tFromID) .Country
            set tAuthorDOB = ""
            set tAuthorSex = ""
        } else {
            set tAuthorCity = responseUsers.GetAt (tFromID) .City
            set tAuthorCountry = responseUsers.GetAt. tetutut = tUutCutry = responseUsers.GetAt
            . responseUsers.GetAt (tFromID) .DOB
            set tAuthorSex = responseUsers.GetAt (tFromID) .Sex
        }

Local reference - wall host id, sender id, and record id separated by a grid.

        set tLocalRef = tOwnerID _ "#" _ tFromID _ "#" _ tID

If necessary, check to see if there are sources with the same local reference.

        if (checkExists) {
            continue: .. RefExists (query, tLocalRef, checkExists - 1)
        }

The following code could be different if another source handler were selected. I use a handler for temporary storage, so I need to expand the list using the method StoreTemp(for more details for each handler, see the page with its documentation). I also need to set the resulting values ​​for the metadata fields.

        set tRef = $ lb (i% ListerClassId, ..AddGroup (query), tLocalRef)
        do tTextStream.Rewind ()
        if (tTextStream.Size = 0) {
            continue
        }
        
        set len ​​= 32000
        while (len = 32000) {
            do ..StoreTemp (tRef, tTextStream.Read (.len))
        }
        
        do ..SetMetadataValues ​​(tRef, $ lb (tPostDate, tPostTime, tFromID, tAuthorCity, tAuthorCountry,
 tAuthorDOB, tAuthorSex))

Everything. Lister is written!
We will test his work.

Testing the Lister


I wrote a small web application that, using the lister we implemented, allows you to browse, search for similar ones, add on demand, and delete entries from the domain. Here are some screenshots:

Initially an empty domain.

alt text
Click on the plus sign to add new posts.
In the form that appears, fill in the fields and click on the button to add entries.

alt text
We are waiting for some time and entries are added.

alt text
For those users or groups who have provided information about themselves in the public domain, our lister saves them in the meta-information fields, and this small demo displays them in the form of a not-too-elegant table.
Out of the box, iKnow can show similar entries: click on the button with a target near some post and make sure that it works.

alt text

Summary


In the course of the article, we figured out how data loading into the domain works, discussed in detail how the average lister works and how to write your own lister, which will also work. We wrote our lister for working with Vkontakte data, and also made sure that it really works modulo the fact that the domain and configuration were created somewhere behind the scenes.
In case there is a desire to look behind these scenes, all the code that was presented, used or mentioned in the article can be found on the project page on github .

Also popular now: