Automation of obtaining information from Incorporation using Freepascal

Published on August 03, 2018

Automation of obtaining information from Incorporation using Freepascal



    In my work (legal), I am ready to automate everything that just gives in to it. But while the robots pumped by neural networks from the utopia of German Grefdid not appear and did not take away all the work of ordinary lawyers, the routine will remain our main companion for a long time. Automation of this routine is what I periodically do for the past years, be it numerous excel tables with a bunch of formulas allowing you to quickly print out hundreds of one-type mailing documents in the word, well, or automatically generated reports. But there are such things that you cannot do with simple formulas and substitutions. Here comes to the aid programming, which I have been fond of since childhood, and it just so happened that it started with delphi. Now it is easier for me than in C # or python, to master which I started recently, to do quickly a project in Lazarus using freepascal. And yes, I quite seriously believe that the possibilities of this environment are more than enough. Therefore, automate the register, you guessed it,

    A lawyer of a consulting firm doing business with dozens of legal entities, a corporate lawyer on free bread, and any other lawyer confronted with ensuring the activities of organizations - they all know how easily dozens and hundreds of different names, TIN numbers, OGRN numbers are mixed together it is easy to forget who is where the manager is, and when he has a term of renewal, does he have any problems with shares in an LLC and with payment of his share capital. Well, the need to quickly make a document that includes many constantly changing details, leads to periodic errors and typos. To automate just such processes, I needed a solution with a database that allows you to create documents using templates, keep various registries, track changes and not miss any deadlines.Site of the Federal Tax Service . Of course, no one says that using the site directly is long and difficult, but agree that clicking on one button without leaving the application is much more fun, and you can do it without interrupting a phone call (or a cup of coffee).

    So, first we decide what we want to receive. The site allows you to search in the official database of unregistered for the unique number of OGRN or TIN and give one relevant result in the form of a brief reference about the person and a link to download a pdf-file with an extract. Also, the search may be fuzzy by name with an additional filter by region (subject of the Russian Federation). And in this case, the site produces a table with all the appropriate persons and with the same data set, including pdf links.

    This means that in a specific case, the finished function must return the pdf as a file (or better, a stream), having a face at the entrance to the OGRN or TIN. But for universalization and the possibility of further expansion, we will not neglect all the capabilities of the site and will also do a fuzzy search function with the return of the data set found by the name of the organization with or without a filter by region. Let's try to describe the interfaces of these functions:

      IEGRULstreamer = interface
        procedure GetExtractByOGRN(OGRN: string; ХХХХХХ; isLegal: boolean; var Extract: TStream);
        procedure GetLegalsListByName(Name, Region: string; ХХХХХХ; var LegalsList: TCollection);
      end;
    

    In order to understand what the mysterious parameter X and the collection of which will be returned by the second function, let us see how the site executes the request.

    1. The site contains a form with input fields for search identifiers and captcha checks:



    2. A captcha is formed using a pre-generated hidden field called captchaToken, which uses a java script to generate a captcha image on the given token.

    3. After clicking on the "find" button, a POST request is sent to the server, in the processing results of which JSON is returned with an array of objects. This JSON response uses a different java script that fills the table, which we see in the search results.

    So, the first snag is a captcha test. In order not to burden our methods dealing with interaction with the site, with unnecessary functionality, we will put the captcha processing actions into a separate function. And in X, we will have a parameter for the callback method, which has a stream with an image of a captcha at the input, and a line with a recognized captcha at the output:

    TCapthcaRecognizeFunc = function(Captha: TStream): string of object;
    ...
    procedure GetExtractByOGRN(OGRN: string; CaptchaFunc: TCapthcaRecognizeFunc;
          isLegal: boolean; var Extract: TStream);
    

    The captcha processing function can do it in any way: let the user enter it manually, send the image to a paid automatic recognition server, and self-recognize it using the unique know-how of the algorithm. For simplicity of the picture, and since in my case the flow of captcha on an industrial scale is not expected, we choose the first option:

    function TForm1.RecognizeFunc(captcha: TStream): string;
    begin
      CaptchaImg.Picture.LoadFromStream(captcha);
      Result := InputBox('Капча','Введите текст капчи с картинки', '');
    end;
    

    The second question is the contents of the server JSON response. Here is an example of what comes in it:

    Answer in formatted JSON format
    {
    "query":
    	{"captcha":"382915",
    	"ogrninnfl":null,
    	"fam":null,
    	"nam":null,
    	"otch":null,
    	"region":null,
    	"ogrninnul":null,
    	"namul":"правительство",
    	"regionul":"73",
    	"kind":"ul",
    	"ul":true,
    	"searchByOgrn":false,
    	"nameEq":false,
    	"searchByOgrnip":true},
    "rows":
    	[
    		{"T":"ED346E713D4A1AC851F9B589C6D2AECD1D809D5B6B5D1B98E697B6E0FD873E137B828AC59A60D159BB2894F11D00AB5639E2ACEE4E2ED5B7AC7A6EFE28FD987BC288B93C4D3D3EC1008DA0F128BA7E5E",
    		"INN":"7325001144",
    		"NAME":"ПРАВИТЕЛЬСТВО УЛЬЯНОВСКОЙ ОБЛАСТИ",
    		"OGRN":"1027301175110",
    		"ADRESTEXT":"432017, ОБЛАСТЬ УЛЬЯНОВСКАЯ, ГОРОД УЛЬЯНОВСК, ПЛОЩАДЬ СОБОРНАЯ, 1",
    		"CNT":"4",
    		"DTREG":"03.12.2002",
    		"KPP":"732501001"},
    		{"T":"2ECB284C7682E5F1D1129AA3074FABB4B74BB28EA426AF79C091CEDEA0D9E391CA26FF405A7C9742466E19C78FBE5A59BDCBCD21268FFD8AFD3A8509CCA84541",
    		"INN":"7303007375",
    		"NAME":"СПЕЦИАЛИЗИРОВАННОЕ ГОСУДАРСТВЕННОЕ УЧРЕЖДЕНИЕ ПРИ ПРАВИТЕЛЬСТВЕ ОБЛАСТИ \"ФОНД ИМУЩЕСТВА УЛЬЯНОВСКОЙ ОБЛАСТИ\"",
    		"OGRN":"1027301173283",
    		"ADRESTEXT":"432063, ОБЛАСТЬ УЛЬЯНОВСКАЯ, ГОРОД УЛЬЯНОВСК, УЛИЦА ДМИТРИЯ УЛЬЯНОВА, 7",
    		"CNT":"4",
    		"DTREG":"27.11.2002",
    		"KPP":"732501001",
    		"DTEND":"01.09.2010"},
    	]
    }
    


    As you can see, the result returns a “query” object, which contains the original search parameters (so that they remain in the form fields for reuse) and an array of “rows” objects. The link to the pdf file is combined by a java script with the expression:
    "https://egrul.nalog.ru/download/"
    and the key value "T" of the object. The lifetime of the generated pdf file is a few minutes.

    The two main difficulties I encountered when creating an http request were the correct header values ​​and combining the string with the POST request parameters. But a simple analysis of the page using the built-in browser tools (in chrome, called by pressing F12) gave everything you need. Here is an example of headers with which the server gives the correct answer instead of 400 Bad request:

    POST / HTTP/1.1
    Host: egrul.nalog.ru
    Connection: keep-alive
    Accept: application/json, text/javascript, */*; q=0.01
    Origin: https://egrul.nalog.ru
    X-Requested-With: XMLHttpRequest
    User-Agent: Chrome/67.0.3396.99 Safari/537.36
    Content-Type: application/x-www-form-urlencoded
    Referer: https://egrul.nalog.ru/
    Accept-Encoding: gzip, deflate, br
    Accept-Language: ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7
    

    And here is a string with parameters:

    kind=ul&srchUl=name&ogrninnul=7716819629&namul=%D0%BF%D1%80%D0%B0%D0%B2%
    D0%B8%D1%82%D0%B5%D0%BB%D1%8C%D1%81%D1%82%D0%B2%D0%BE&regionul=73
    &srchFl=ogrn&ogrninnfl=&fam=&nam=&otch=&region=&captcha=449023&captchaToken=DAEDA
    7504CACAC82CF09E08319B68DF5F9BD62B2F44D33DD679DDE55B5CF58B17FEC84E78CEEB9639
    84D2B2BD8C3AA15

    Armed with these initial data, we proceed to the implementation of the task. I will use the following libraries for freepascal:

    Synapse is a very convenient library with the most simplified (for use) function of sending http requests to the server, it also works with SSL, but this requires the presence of openSSL libraries in the project folder or system, as well as the connection of an additional module. It is enough to connect the following library modules to our project: httpsend, ssl_openssl, synautil.

    The built-in fcl-json library contains the necessary modules: fpjson and fpjsonrtti - for maximum convenience in processing returned objects in JSON.

    Separate modules of the built-in library fcl-xml - for some functions, it will be necessary to work with parts of HTML as DOM objects, so we will connect the modules SAX_HTML, DOM_HTML, DOM.

    We describe the types and classes of objects that eventually turned out:

    TEGRULItem = class(TCollectionItem)
      private
        fT, fINN, fNAME, fOGRN, fADRESTEXT, fCNT, fDTREG, fDTEND, fKPP: string;
    public
        function GetPdfLink: string;
      published
        property T: string read fT write fT;
        property INN: string read fINN write fINN;
        property NAME: string read fNAME write fNAME;
        property OGRN: string read fOGRN write fOGRN;
        property ADRESTEXT: string read fADRESTEXT write fADRESTEXT;
        property CNT: string read fCNT write fCNT;
        property DTREG: string read fDTREG write fDTREG;
        property DTEND: string read fDTEND write fDTEND;
        property KPP: string read fKPP write fKPP;
      end;
    

    In this class, we will pack objects that will be returned in the rows array in the server's JSON response. We will read them using JSONToCollection, but for this you need to make each object a member of the collection and declare all related properties as published. RTTI functions in freepascal (as in delphi) get access to the names of properties only when they are declared in such a scope. And the JSONToCollection function from the fpjsonrtti module is just an RTTI function that matches the names of keys from a JSON object with the names of the class properties.

    Also in the class interface there is a function GetPdfLink, which returns a link for downloading a pdf-file with information from the Unified State Register of Legal Entities using concatenation of the web-address and the value of the property "T".


    The main class implementing the interface declared above will be as follows:

      TEGRULStreamer = class(TInterfacedObject, IEGRULStreamer)
      private
        HTTPSender: THTTPSend;
        Doc: THTMLDocument;
        Inputs: TDOMNodeList;
        captchaURL, captchaToken, captcha, Params: string;
        function GetCaptchaToken: string;
        function GetLegalsList: TCollection;
        procedure PrepareHeaders;
        procedure ProcessCaptcha(CaptchaFunc: TCapthcaRecognizeFunc);
      public
        procedure GetExtractByOGRN(OGRN: string; CaptchaFunc: TCapthcaRecognizeFunc;
          isLegal: boolean; var Extract: TStream);
        procedure GetLegalsListByName(Name, Region: string; CaptchaFunc: TCapthcaRecognizeFunc;
          var LegalsList: TCollection);
        destructor Destroy; override;
      end; 
    


    As you can see, in addition to the implementation of the two main functions of the interface, all other properties and methods of the class will be hidden and are needed only for the internal implementation. In general, they could be included in the main methods, but we have already passed lessons about duplicate code, visibility and refactoring in general .

    Taking into account the encapsulation of preparatory actions, the main methods in general will differ only in the formation of the parameter string of the http request and the returned data type.

    method code TEGRULStreamer.GetExtractByOGRN
    procedure TEGRULStreamer.GetExtractByOGRN(OGRN: string;
      CaptchaFunc: TCapthcaRecognizeFunc; isLegal: boolean; var Extract: TStream);
    begin
      ProcessCaptcha(CaptchaFunc);
      if isLegal then Params := 'kind=ul' else Params := 'kind=fl';
      Params += '&srchUl=ogrn&srchFl=ogrn&ogrninnul=';
      if isLegal then Params += OGRN;
      Params += '&namul=&regionul=&ogrninnfl=';
      if not isLegal then Params += OGRN;
      Params += '&fam=&nam=&otch=&region&captcha=' + captcha + '&captchaToken=' + captchaToken;
      WriteStrToStream(HTTPSender.Document, Params);
      if not HTTPSender.HTTPMethod('POST', EGRUL_URL) then
        raise Exception.Create('Сайт ИФНС не открывается');
      HTTPSender.Headers.Clear;
      if HTTPSender.HTTPMethod('GET', TEGRULItem(GetLegalsList.Items[0]).GetPdfLink) then
        Extract := HTTPSender.Document
      else
        Extract := nil;
    


    Here, as we see, the method also uses the logical parameter isLegal, and if it is not set to true, the search goes on the basis of entrepreneurs instead of legal entities.

    method code TEGRULStreamer.GetLegalsListByName
    procedure TEGRULStreamer.GetLegalsListByName(Name, Region: string;
      CaptchaFunc: TCapthcaRecognizeFunc; var LegalsList: TCollection);
    begin
      ProcessCaptcha(CaptchaFunc);
      Params := 'kind=ul&srchUl=name&srchFl=ogrn&ogrninnul=&namul=';
      Params += Name + '&regionul=' + Region + '&ogrninnfl=&fam=&nam=&otch=&region';
      Params += '&captcha=' + captcha + '&captchaToken=' + captchaToken;
      WriteStrToStream(HTTPSender.Document, Params);
      if not HTTPSender.HTTPMethod('POST', EGRUL_URL) then
        raise Exception.Create('Сайт ИФНС не открывается');
      LegalsList := GetLegalsList;
    end;
    


    The role of the service methods is as follows:

    ProcessCaptcha - loads the initial html page of the FTS service, searches for the captcha token, downloads the image generated by this token, and redirects it to the callback-method for captcha recognition. At the end, the method also sets the correct headers for the subsequent POST request.

    GetCaptchaToken - loads into the DOM structure all input fields from the page, searches for a hidden field with the identifier capthcaToken and returns its value.

    GetLegalsList - using the RTTI function, the JSONToCollection returns a collection of objects of type TEGRULItem, described above.

    Getpdflink - for search by OGRN or TIN, in the right case, only one result will always be returned, therefore in GetExtractByOGRN the function is called for the first element in the collection.

    Since this is my first experience with the network in freepascal, I am very glad that everything turned out exactly as I intended. In working form, the library was made in less than one day (thanks to the forum members with freepascal.ru, who told about synapse).

    The archive with the test of the resulting library and its code is here .

    As always I will be glad to any constructive criticism both on the project and on the implementation. I understand that there are many factors that can still be taken into account: a delay in responding to an http request, as a result of which the application will hang; Incorrect http responses and other situations.

    In the future, I plan to connect the online library with the address base of FIAS and realize the opportunity to generate completed application templates, which are generally edited in the Program of preparation of documents for state registration .


    PS Sorry, Sberbank, for the role of a guinea pig and downloaded statement hundreds of times. All in the name of science of course.