Check the information about the unreliability in extracts from the register. Glue pdf in python

    At present, the possibility of the tax authority to exclude a company from the Unified State Register of Legal Entities merely “revealing” the so-called inaccurate information regarding the company remains a very urgent topic. As statistics show, from September 2018, the Federal Tax Service excluded 90,000 organizations from the Unified State Register of Legal Entities with a record of the inaccuracy of information about the head, founder or address of the legal entity. To discover the fact that there is inaccurate information regarding the company can only be viewed by extracting from the register.

    It looks like this:



    The problem is aggravated by the fact that data on inaccuracy can appear both at the request of the interested person and “on their own”, as a result of the actions of the tax authority. In order to protect yourself from a sudden departure from the register, extracts are required to be received regularly. How to do this quickly and painlessly if there are a large number of companies in the holding, we examined in a previous post .

    This time we’ll analyze how to look for information about the unreliability in the extracts of the register of legal entities.

    We assume that we have the nth number of statements that we downloaded from the FTS website. Statements have the extension .pdf and some names.

    All that is required of us is to search for the word “lack” in each pdf file.

    Opening each pdf with a statement and searching is not our method. This may take an excessively long time. You can glue all the files in Abbyy Finereader, but it will also take enough time.

    We will write a program that will glue all pdf files into one. Python lets you do this in seconds!

    In the future, we will be able to open the created file and conduct a search for the desired word immediately on all extracts from the register.

    Let's get started.

    * Statements from the USRLE are located in the C: \ 1 directory.
    In the new python file, we import modules for working with pdf and the system as a whole:

    import PyPDF2, os

    Next, create an empty list and move to the directory C: \ 1, in which all our statements will be located.

    This directory does not have to be empty. In the program, we provided for the processing of only those files that have the pdf extension:

    pdfFiles = []
    os.chdir('C:\\1')
    for filename in os.listdir('.'):
        if filename.endswith('.pdf'):
            pdfFiles.append(filename)
    pdfFiles.sort()
    

    The next block glues statements to each other, adding each subsequent statement to the end:

    pdfWriter = PyPDF2.PdfFileWriter()
    # Loop through all the PDF files.
    for filename in pdfFiles:
        pdfFileObj = open(filename, 'rb')
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        # Loop through all the pages and add them.
        for pageNum in range(0, pdfReader.numPages):
            pageObj = pdfReader.getPage(pageNum)
            pdfWriter.addPage(pageObj)
    

    It remains only to save the result:

    pdfOutput = open('all.pdf', 'wb')
    pdfWriter.write(pdfOutput)
    pdfOutput.close()

    So, after the program’s work, we received the all.pdf file, by which it is already possible to search for the required information about the false information.

    Download the program for gluing pdf in one - here .

    Continuation from 11/08/2019


    We cut the extract from the register, keeping the first 4 pages from each extract.
    Information about the unreliability of the jur. person fall into different parts of the register.
    At the end of the extract contains records of unreliability, which were canceled by the tax.
    Thus, it is hardly advisable to run the program for entire extracts from the Unified State Register of Legal Entities: the program will also find these obsolete entries.
    Therefore, we will use pyhton to cut the downloaded statements from the USRLE, saving the first 4 pages from each. As a rule, these pages are enough to find signs of inaccuracy in the address or the sole executive body.
    Move all the extracts you downloaded earlier (pdf files) to the conditional folder along the path 'C: \ 1 \ 2' and execute the python code:
    #! python3
    import PyPDF2, os
    from datetime import datetime
    start = datetime.now()
    os.chdir('C:\\1\\2')
    pdfFiles = []
    for filename in os.listdir('.'):
        if filename.endswith('.pdf'):
            pdfFiles.append(filename)
    pdfFiles.sort()
    pdfWriter = PyPDF2.PdfFileWriter()
    # Loop through all the PDF files.
    for filename in pdfFiles:
        pdfFileObj = open(filename, 'rb')
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        # Loop through all the pages and add them.
        for pageNum in range(0, 4):
            pageObj = pdfReader.getPage(pageNum)
            pdfWriter.addPage(pageObj)
    # Save the resulting PDF to a file.
    pdfOutput = open('all-small.pdf', 'wb')
    pdfWriter.write(pdfOutput)
    pdfOutput.close()
    print(datetime.now()- start)
    

    At the exit, we will receive the register statements glued into a single pdf file - “all-small.pdf”. Moreover, from each statement will be only the first 4 pages.

    Now let's run “all-small.pdf” by searching for the phrase “flaws”:
    #!/usr/bin/python
    import fitz,os
    filename = "all-small.pdf"  
    search_term = "недостов"  
    pdf_document = fitz.open(filename)
    for current_page in range(len(pdf_document)):  
       page = pdf_document.loadPage(current_page)
       if page.searchFor(search_term):
           print("%s найдено на странице %i" % (search_term, current_page+1))
    


    The program works much faster than searching in a single glued pdf file via the 'Acrobat reader' and at the same time displays the pages on which the unreliability was found in the terminal.

    Also popular now: