My address is not a house or a street, my address is the Soviet Union?

    microBIGDATA or FIAS in the pocket of


    Peter Breugel the Younger, Payment of tax , 1640 The

    last call at the shaving facility went to the objects . Continue reconnaissance in force. Today let's talk about the hard. Suppose it is not about BIG DATA, but it’s already inconvenient to work - rather large amounts of data. Not everyone will fit into the RAM as a whole, and some will not even fit on the disk (not enough space, but a lot of trash). The name of our ward FIAS DB is the database of the federal address information system. 5.5 GB archive. And this is a compressed XML archive. After unpacking, there will be a full 53 GB (for unpacking, store 110 GB). And as you start parsing it and converting it, then 110 GB will not be enough. About the required amount of RAM will also be.

    All anything, but you can dig further. There is such an international open project for the collection and systematization of address data - OpenAddresses . So there will be more databases. The current coverage of the planet has many white spots, for example, Russia is almost absent. Archive size - 10 GB.


    Or a database of a fairly well-known project OpenStreetMaps . It is built by volunteers on the principle of Wikipedia. Quite detailed and multilingual. Now a full archive of compressed XML of 74 GB.
    If they started talking about addresses, unexpected news arrived from DuckDuckGo , the best of the safe search engines today, about its transition to Apple cards. More precisely on Apple MapKit JS. The most interesting feature in our context is the “improved address search”. Apple painstakingly collects and saves our data? It will be necessary to trace ...
    So, the task. How to put all this riches of wealth into a repository that is pleasant to use, to give an opportunity to dream of a loose API (on Python, of course) and not to allow the birth gland to choke on a load that is not read. Let's call it MicroBigData - µBD or µBG in English :-)

    In the economy of every second (and even the first) developer, this piece is an address directory, it is also a directory of place names, a thing very necessary. And when there is also a normative, the right body is prepared, cleaned and well documented - just a fairy tale. We must pay tribute, the Russian tax service is doing its digital production well. As much as possible. There are probably some flaws inside and data cleaning continues. How to resolve this issue, let the state heads think. For themselves, decide and benefit us all. By the way, one typo from FIAS was found in the example below. The result does not affect. Did not fix it. Will you find?

    I don’t know how relevant address data is relevant in your projects - these are all regions, cities, streets. But it seems that one project for people cannot do without them. That address, where to find a person or where to send him a parcel. That details of the passport or any other document must be saved. Or maybe it is the address of the working office or attractions that are recommended to visit. And what to do? Where to get?

    The simplest solution, without regard to errors and duplicates, is primitive objects containing simple string literals (they are also string constants, they are also string). Let the users make the next entries in them. And objects are able to save themselves - we have already passed .

    Such objects, for example, as described in the class below. Straight from the tutorial, albeit American, but adjusted for our Russian reality - instead of their ZIP, there will be our postalCode. I would also replace the zip code with a number, but for the sake of monotony I would leave a string. Anyone who has recognized the language right away, and this is ObjectScript, is relied upon to receive a like.

    Class Soviet.Address Extends %Persistent {
        Property streetName As %String;
        Property cityName As %String; 
        Property areaName As %String; 
        Property postalCode As %String;
    }

    Of course, many will be indignant, saying that from the pockets of the object everything outside (with literals) sticks out. Where has it been seen that the object has its fields publicly luminous ?! Let us leave it so far, painfully, an example of eloquent and understandable to any schoolchild.

    In fact, this is all that is needed. Filled in the fields. Put in storage. Transferred to work other objects. Inherited further by someone. Everything is working. And stored!
    But a few words why you should not do this, you must say. What is our object Address? Why can't it just be a group of text strings? The most obvious objections that come to mind come from the context - who uses this Address, in what form and for what purpose? Try to put aside your programming thinking and imagine how a “foreign tourist”, “historian”, “tax inspector”, “lawyer” and so on thinks.

    I suppose, there immediately appears a mass of additional questions-clarifications: which language to use, in what encoding to store and deliver, to what epoch should be counted, which documents were put into effect, legal or postal? Is a city a named settlement or what? Even the street can be a boulevard, a lane, avenue or something else. How to deal with all these important implementation details?

    Take a living example. Google is now managed by Sundar Pichai. He himself is from India. Born in the city of Chennai (aka Chennai). Or in Madras? In 1996, the Indians decided that the name of the city was some very Portuguese and renamed the state capital Tamil Nadu from Madras to Chennai. And what should Sundar and 72 million of his fellow countrymen write in their electronic documents?

    In general, the whole science is engaged in this -applied place names .
    So it begs questions in dogonku. How to manage time and date ? Is money obvious ? Are geographical coordinates simple? And how is this implemented in your code? Can you transfer to the selected database without lowering the level of abstraction? How not to slip into atomic types of machine data and constantly think about their reconstruction? Here you should look for the source of a primitive or, on the contrary, good-quality API. Think about it at your leisure.

    In short, the context is the most important. And the object model gives us the opportunity to use this directly by encapsulating “machine data” and implementing context-dependent “live” behavior. Not at all that the low-level tuples are arranged in tables ;-)

    For now, let us return to the “primitive” implementation and complicate our life. To begin, eliminate errors and duplicates. That is, we will look for a way to write addresses right away. At the same time, we will help UI developers organize hints to users when filling in data entry fields.
    When two people gather in one place - the texts and the InterSystems IRIS data platform, the developer has a real opportunity to turn around to the full without departing from the machine. For example, using the built-in object components iKnow and iFind . These are components for working with unstructured data and full-text search , respectively. Russian language is supported out of the box.
    First and foremost, we teach the Address to read the necessary data from the original source. Fortunately, in the data set of the Federal Tax Service there are ready-made descriptions of the structure of XML documents. According to the description attached to the data from the FIAS website , we will need the ADDROBJ dataset, which, in my case, corresponds to the AS_ADDROBJ_2_250_01_04_01_01.xsd file

    . . The percent sign at the beginning just means that it is a class from the system library. Details of use are in the documentation . We will perform operations in the terminal.

    set xmlScheme = ##class(%XML.Utils.SchemaReader).%New()
    do xmlScheme.Process("http://localhost/AS_ADDROBJ_2_250_01_04_01_01.xsd")

    The same can be obtained in IDE Attelier (in Tools> Add-Ins> XML Schema Wizard) or by similar requests to objects directly from the program code.



    Since we used a constructor without specifying parameters, namely the name of the package for allocating the resulting classes, they were in the Test package. As you can see from the second command, I gave the schema file through my local Python web server:

    python3 -m http.server 80

    You can use any other http-server that you like. Or upload the file to your IRIS server and point the way to it.

    As a result, we have two classes that completely reflect the structure of our address XML:

    Test.AddressObjects
    /// Состав и структура файла с информацией классификатора адресообразующих элементов БД ФИАС
    Class Test.AddressObjects Extends (%Persistent, %XML.Adaptor) [ ProcedureBlock ] {
        Parameter XMLNAME = "AddressObjects";
        Parameter XMLSEQUENCE = 1;
        /// Классификатор адресообразующих элементов
        Relationship Object As Test.Object(XMLNAME = "Object", XMLPROJECTION = "ELEMENT") [ Cardinality = many, Inverse = AddressObjects ];
    }

    Test.Object
    /// Создано из: http://localhost:28869/AS_ADDROBJ_2_250_01_04_01_01.xsd
    Class Test.Object Extends (%Persistent, %XML.Adaptor) [ ProcedureBlock ] {
        Parameter XMLNAME = "Object";
        Parameter XMLSEQUENCE = 1;
        /// Глобальный уникальный идентификатор адресного объекта
        Property AOGUID As %String(MAXLEN = 36, MINLEN = 36, XMLNAME = "AOGUID", XMLPROJECTION = "ATTRIBUTE") [ Required ];
        /// Формализованное наименование
        Property FORMALNAME As %String(MAXLEN = 120, MINLEN = 1, XMLNAME = "FORMALNAME", XMLPROJECTION = "ATTRIBUTE") [ Required ];
        /// Код региона
        Property REGIONCODE As %String(MAXLEN = 2, MINLEN = 2, XMLNAME = "REGIONCODE", XMLPROJECTION = "ATTRIBUTE") [ Required ];
        /// Код автономии
        Property AUTOCODE As %String(MAXLEN = 1, MINLEN = 1, XMLNAME = "AUTOCODE", XMLPROJECTION = "ATTRIBUTE") [ Required ];
        /// Код района
        Property AREACODE As %String(MAXLEN = 3, MINLEN = 3, XMLNAME = "AREACODE", XMLPROJECTION = "ATTRIBUTE") [ Required ];
        /// Код города
        Property CITYCODE As %String(MAXLEN = 3, MINLEN = 3, XMLNAME = "CITYCODE", XMLPROJECTION = "ATTRIBUTE") [ Required ];
        /// Код внутригородского района
        Property CTARCODE As %String(MAXLEN = 3, MINLEN = 3, XMLNAME = "CTARCODE", XMLPROJECTION = "ATTRIBUTE") [ Required ];
        /// Код населенного пункта
        Property PLACECODE As %String(MAXLEN = 3, MINLEN = 3, XMLNAME = "PLACECODE", XMLPROJECTION = "ATTRIBUTE") [ Required ];
        /// Код элемента планировочной структуры
        Property PLANCODE As %String(MAXLEN = 4, MINLEN = 4, XMLNAME = "PLANCODE", XMLPROJECTION = "ATTRIBUTE") [ Required ];
        /// Код улицы
        Property STREETCODE As %String(MAXLEN = 4, MINLEN = 4, XMLNAME = "STREETCODE", XMLPROJECTION = "ATTRIBUTE");
        /// Код дополнительного адресообразующего элемента
        Property EXTRCODE As %String(MAXLEN = 4, MINLEN = 4, XMLNAME = "EXTRCODE", XMLPROJECTION = "ATTRIBUTE") [ Required ];
        /// Код подчиненного дополнительного адресообразующего элемента
        Property SEXTCODE As %String(MAXLEN = 3, MINLEN = 3, XMLNAME = "SEXTCODE", XMLPROJECTION = "ATTRIBUTE") [ Required ];
        /// Официальное наименование
        Property OFFNAME As %String(MAXLEN = 120, MINLEN = 1, XMLNAME = "OFFNAME", XMLPROJECTION = "ATTRIBUTE");
        /// Почтовый индекс
        Property POSTALCODE As %String(MAXLEN = 6, MINLEN = 6, XMLNAME = "POSTALCODE", XMLPROJECTION = "ATTRIBUTE");
        /// Код ИФНС ФЛ
        Property IFNSFL As %String(MAXLEN = 4, MINLEN = 4, XMLNAME = "IFNSFL", XMLPROJECTION = "ATTRIBUTE");
        /// Код территориального участка ИФНС ФЛ
        Property TERRIFNSFL As %String(MAXLEN = 4, MINLEN = 4, XMLNAME = "TERRIFNSFL", XMLPROJECTION = "ATTRIBUTE");
        /// Код ИФНС ЮЛ
        Property IFNSUL As %String(MAXLEN = 4, MINLEN = 4, XMLNAME = "IFNSUL", XMLPROJECTION = "ATTRIBUTE");
        /// Код территориального участка ИФНС ЮЛ
        Property TERRIFNSUL As %String(MAXLEN = 4, MINLEN = 4, XMLNAME = "TERRIFNSUL", XMLPROJECTION = "ATTRIBUTE");
        /// OKATO
        Property OKATO As %String(MAXLEN = 11, MINLEN = 11, XMLNAME = "OKATO", XMLPROJECTION = "ATTRIBUTE");
        /// OKTMO
        Property OKTMO As %String(MAXLEN = 11, MINLEN = 8, XMLNAME = "OKTMO", XMLPROJECTION = "ATTRIBUTE");
        /// Дата  внесения записи
        Property UPDATEDATE As %Date(XMLNAME = "UPDATEDATE", XMLPROJECTION = "ATTRIBUTE") [ Required ];
        /// Краткое наименование типа объекта
        Property SHORTNAME As %String(MAXLEN = 10, MINLEN = 1, XMLNAME = "SHORTNAME", XMLPROJECTION = "ATTRIBUTE") [ Required ];
        /// Уровень адресного объекта
    Property AOLEVEL As %Integer(XMLNAME = "AOLEVEL", XMLPROJECTION = "ATTRIBUTE", XMLTotalDigits = 10) [ Required ];
        /// Идентификатор объекта родительского объекта
        Property PARENTGUID As %String(MAXLEN = 36, MINLEN = 36, XMLNAME = "PARENTGUID", XMLPROJECTION = "ATTRIBUTE");
        /// Уникальный идентификатор записи. Ключевое поле.
    Property AOID As %String(MAXLEN = 36, MINLEN = 36, XMLNAME = "AOID", XMLPROJECTION = "ATTRIBUTE") [ Required ];
        /// Идентификатор записи связывания с предыдушей исторической записью
    Property PREVID As %String(MAXLEN = 36, MINLEN = 36, XMLNAME = "PREVID", XMLPROJECTION = "ATTRIBUTE");
        /// Идентификатор записи  связывания с последующей исторической записью
    Property NEXTID As %String(MAXLEN = 36, MINLEN = 36, XMLNAME = "NEXTID", XMLPROJECTION = "ATTRIBUTE");
        /// Код адресного объекта одной строкой с признаком актуальности из КЛАДР 4.0.
    Property CODE As %String(MAXLEN = 17, MINLEN = 0, XMLNAME = "CODE", XMLPROJECTION = "ATTRIBUTE");
        /// Код адресного объекта из КЛАДР 4.0 одной строкой без признака актуальности (последних двух цифр)
    Property PLAINCODE As %String(MAXLEN = 15, MINLEN = 0, XMLNAME = "PLAINCODE", XMLPROJECTION = "ATTRIBUTE");
        /// Статус актуальности адресного объекта ФИАС. Актуальный адрес на текущую дату. Обычно последняя запись об адресном объекте.
        /// 0 – Не актуальный
        /// 1 - Актуальный
        Property ACTSTATUS As %Integer(XMLNAME = "ACTSTATUS", XMLPROJECTION = "ATTRIBUTE", XMLTotalDigits = 10) [ Required ];
        /// Статус центра
        Property CENTSTATUS As %Integer(XMLNAME = "CENTSTATUS", XMLPROJECTION = "ATTRIBUTE", XMLTotalDigits = 10) [ Required ];
        /// Статус действия над записью – причина появления записи (см. описание таблицы OperationStatus):
        /// 01 – Инициация;
        /// 10 – Добавление;
        /// 20 – Изменение;
        /// 21 – Групповое изменение;
        /// 30 – Удаление;
        /// 31 - Удаление вследствие удаления вышестоящего объекта;
        /// 40 – Присоединение адресного объекта (слияние);
        /// 41 – Переподчинение вследствие слияния вышестоящего объекта;
        /// 42 - Прекращение существования вследствие присоединения к другому адресному объекту;
        /// 43 - Создание нового адресного объекта в результате слияния адресных объектов;
        /// 50 – Переподчинение;
        /// 51 – Переподчинение вследствие переподчинения вышестоящего объекта;
        /// 60 – Прекращение существования вследствие дробления;
        /// 61 – Создание нового адресного объекта в результате дробления
        Property OPERSTATUS As %Integer(XMLNAME = "OPERSTATUS", XMLPROJECTION = "ATTRIBUTE", XMLTotalDigits = 10) [ Required ];
        /// Статус актуальности КЛАДР 4 (последние две цифры в коде)
        Property CURRSTATUS As %Integer(XMLNAME = "CURRSTATUS", XMLPROJECTION = "ATTRIBUTE", XMLTotalDigits = 10) [ Required ];
        /// Начало действия записи
        Property STARTDATE As %Date(XMLNAME = "STARTDATE", XMLPROJECTION = "ATTRIBUTE") [ Required ];
        /// Окончание действия записи
        Property ENDDATE As %Date(XMLNAME = "ENDDATE", XMLPROJECTION = "ATTRIBUTE") [ Required ];
        /// Внешний ключ на нормативный документ
        Property NORMDOC As %String(MAXLEN = 36, MINLEN = 36, XMLNAME = "NORMDOC", XMLPROJECTION = "ATTRIBUTE");
        /// Признак действующего адресного объекта
        Property LIVESTATUS As %xsd.byte(VALUELIST = ",0,1", XMLNAME = "LIVESTATUS", XMLPROJECTION = "ATTRIBUTE") [ Required ];
        /// Тип адресации:
        /// 0 - не определено
        /// 1 - муниципальный;
        /// 2 - административно-территориальный
        Property DIVTYPE As %xsd.int(VALUELIST = ",0,1,2", XMLNAME = "DIVTYPE", XMLPROJECTION = "ATTRIBUTE") [ Required ];
        Relationship AddressObjects As Test.AddressObjects(XMLPROJECTION = "NONE") [ Cardinality = one, Inverse = Object ];
    }


    Of the entire list of xml files in FIAS, we will only use the file with the names of regions, cities and streets. At the time of the publication, I had this:
    AS_ADDROBJ_20190106_90809714-fe22-45b2-929c-52bd950963e0.XML

    The file size is neither more nor less than 3 GB. You can’t open it with ordinary text-processing tools - they don’t digest that size.
    By the way, the maximum length of a string literal (type String) in InterSystems IRIS is no more than 3,641,144 characters. That is, uploading a file or URL directly to it will not work. Other restrictions can be peeped in the documentation . To work with large amounts of data you can use streams of data that do not have such a length limit.
    Let's see what we get?

    Preparing FIAS stuffed peppers. This is only a blank for a great future. First we get the initial minimum set. We need only these ingredients:

    Class FIAS.AddressObject Extends (%Persistent, %XML.Adaptor) [ ProcedureBlock ] {
        Parameter XMLNAME = "Object";
        Parameter XMLSEQUENCE = 1;
        /// Глобальный уникальный идентификатор адресного объекта
        Property AOGUID As %String(MAXLEN = 36, MINLEN = 36, XMLNAME = "AOGUID", XMLPROJECTION = "ATTRIBUTE") [ Required ];
        /// Официальное наименование
        Property OFFNAME As %String(MAXLEN = 120, MINLEN = 1, XMLNAME = "OFFNAME", XMLPROJECTION = "ATTRIBUTE");
        /// Почтовый индекс
        Property POSTALCODE As %String(MAXLEN = 6, MINLEN = 6, XMLNAME = "POSTALCODE", XMLPROJECTION = "ATTRIBUTE");
        /// Краткое наименование типа объекта
        Property SHORTNAME As %String(MAXLEN = 10, MINLEN = 1, XMLNAME = "SHORTNAME", XMLPROJECTION = "ATTRIBUTE") [ Required ];
        /// Уровень адресного объекта
        Property AOLEVEL As %Integer(XMLNAME = "AOLEVEL", XMLPROJECTION = "ATTRIBUTE", XMLTotalDigits = 10) [ Required ];
        /// Идентификатор объекта родительского объекта
        Property PARENTGUID As %String(MAXLEN = 36, MINLEN = 36, XMLNAME = "PARENTGUID", XMLPROJECTION = "ATTRIBUTE");
        /// Уникальный идентификатор записи. Ключевое поле.
        Property AOID As %String(MAXLEN = 36, MINLEN = 36, XMLNAME = "AOID", XMLPROJECTION = "ATTRIBUTE") [ Required ];
    

    Next, do the writing . Create an object that understands XML as native - use the class from the% XML.Reader system library:

    set reader = ##class(%XML.Reader).%New()

    And we instruct him who to take, and ignore the rest. We will take one portion:

    do reader.Correlate("Object","FIAS.AddressObject")

    Then there are variations on how to get the original mbd file. If convenient, you can put it next to the storage - locally in the file system of the IRIS server. Or, as in my example, ask to send via HTTP. There is an even more universal option, which will be a few words below.

    set url="http://localhost/AS_ADDROBJ_20190106_90809714-fe22-45b2-929c-52bd950963e0.XML"   
    write reader.OpenUrl(url)

    Important! At this moment, the majority who will pass this example on themselves, there will be a terrible. The system will return instead of the joyful "1" (everything is in order), something starting with "0 ¸ STORE ...". And it will not please. That is, the file with what seems to be a mcbd turned out to be not quite micro and does not fit into our object. The memory allocated for it was not enough. Solvable? Of course. The IRIS data platform allows you to create objects up to 4 TB in RAM. Then what went wrong? By default, the system settings are set to 256 MB per object. And we need much more. And remember, these are RAM requirements. Is there enough stock on your computer / server?
    What size of memory for the placement of this giant we need to install empirically - almost 10 GB. What you need to specify in the settings (Menu> Configure Memory> Maximum memory capacity per process (KB)) or through the $ ZSTORAGE system variable (in kilobytes):

    set $ZSTORAGE=10000000

    Have you launched a new process with the necessary memory settings? Then everything is simple further - we read and save.

    There is also an alternative (and, probably, preferred) option - to use the UsePPGHandler property of the% XML.Reader class which allows you not to store XML in memory and works with standard memory settings.

    set reader = ##class(%XML.Reader).%New()
    set reader.UsePPGHandler = 1

    more ... Correlate / Read, etc. ...

    do reader.Next(.object)
    do object.%Save()

    And so 3,722,548 times for each operation :-)

    This is exhausting. Therefore, let's add our method FIAS.AddressObject to import, based on the commands just shown:

    ClassMethod Import() {
            // Создать объект для чтения XML
            Set reader = ##class(%XML.Reader).%New()
            // Получить исходный XML для разбора
            Set status = reader.OpenURL("http://localhost/AS_ADDROBJ_20190106_90809714-fe22-45b2-929c-52bd950963e0.XML")
            If $$$ISERR(status) {Do $System.Status.DisplayError(status)}
            // Связать объект с нужной структурой выборки
            Do reader.Correlate("Object","FIAS.AddressObject")
            // Читать и сохранять объект в хранилище
            While (reader.Next(.object,.status)) {
                Set  status = object.%Save()
                                     If $$$ISERR(status) {do $System.Status.DisplayError(status)}
            }
           // В случае возникновения ошибки при разборе, показать сообщение
            If $$$ISERR(status) {Do $System.Status.DisplayError(status)}
        }

    Let's use the power of our computer exocortex - just one command in the terminal :

    do ##class(FIAS.AddressObject).Import()



    I ask everyone to the table. There were mkbd, and now the finished dish in the form of a global with the verified names of Russian cities and weights is ready.



    Finally, a couple words about when 4TB is not enough. In this case, we follow the streams (or streams, if you like). The documentation is all laid out on the shelves. It is possible binary, it is possible character. Store in the global is also not forbidden. The recipe is as follows: we take the stream, cut it in parts and give it to the objects we need for consumption.

    Further, about beautiful address ObjectScript objects and API on Python did not fit. There will be a separate story.
    Nice: Gartner has just completed the annual collection of real user ratings and feedback in the DBMS category and, on this basis, published his ranking of the best DBMSs of 2019. Products InterSystems Caché and InterSystems IRIS Data Platform received the highest rating "Consumer Choice." From whom you have chosen and how you rated it, you can have a look at it yourself .
    Best Operational Database Management Systems of 2019 Reviewed by Customers


    Also popular now: