Forced PostgreSQL Query Optimization

    What to do when there is a closed source application that does not access the database in the most optimal way? How to pull requests without changing the application, and possibly the database itself?

    If you have not asked such questions - you are a very successful and strict DBA.

    Well, if you ask, let me share my suffering and experience.

    Need to store more data, or problem statement


    This section can be safely browsed if the history of the issue is not interesting.

    Initially, we had a proprietary system that parsed its data from a closed format to the PostgreSQL database, from which we read this data, analyzed and processed it.

    In addition, the toolkit of this system also used this base for certain operations, so to abandon it and create a copy with its structure seemed to be a hopeless idea.

    By default, the system automatically deleted records older than a week, so there were no performance problems on the stand.

    However, we need to store the data much longer, as long as there is enough disk space on the server. Well, it is very desirable not to lose access to this data and still use the built-in toolkit of the system, even for old data.

    Therefore, the obvious solution was to make partitioning and triggers on the INSERT operation. Focus is quite simple and effective. The data is inserted into the necessary partitions, the deletion of old records is disabled, everything seems to be fine.

    Until a couple of years passed and the data did not accumulate well.

    Here, “suddenly” it turned out that the requests that the toolkit of the system used does not limit the selection by date (or rather, limit it not by the field that splits into partitions). Those. if we are looking for something - the search goes on all partitions. Also, the UPDATE operations began to slow down - in the conditions only the ID-person was used there.

    As a result, the query is executed for a long time, it sags all other requests, the load is growing rapidly.

    Of course, the first thing that comes to mind is to contact the developer.

    However, in most cases, it is either no longer in the access zone, or it will ask for the cost of another such system for the revision of several lines.

    Therefore, the idea arose that probably there is already some kind of proxy that can help us.

    We need a proxy


    Fast googling did not find a clear answer to the question of how to rewrite the incoming request on the side of PostgreSQL or some third-party software.

    Therefore (well, just for fun too, of course) a fairly simple software was written that accepts connections from clients and proxies them to PostgreSQL. At the same time, incoming SQL queries are read, and, if necessary, replaced.

    I share the link to github.

    While I did not make any binary packages, my hands did not reach. But the build is pretty simple. Everything is written in C ++ / Qt, because I have been writing this for a long time already ... The

    config is quite simple: We

    indicate which interface and port to listen to:

    listen_address=0.0.0.0
    listen_port=5433

    We force the negligent software to connect to the specified address instead of directly connecting to the PostgreSQL server.

    Specify where to forward connections (in this example, the proxy is located on the same machine as the PostgreSQL server):

    dst_address=127.0.0.1
    dst_port=5432

    Set a regular expression to catch the desired query:

    query = SELECT \* FROM tablename WHERE (.+)

    We say that you need to rewrite it:

    action = rewrite

    We say how to rewrite:

    rewrite = SELECT * FROM tablename WHERE (col3 >= '$(now-1M)') AND $(1)

    In this example, we add a filter on the column with the date in the query, indicating that we are only interested in records for the last month.

    One could write like this:

    rewrite = SELECT * FROM tablename WHERE (col3 >= now() - interval '1 month') AND $(1)

    But then the query will not be optimal due to the presence of the now () function - the search will still be carried out across all partitions. In order to search only the necessary ones, it is necessary to specify a constant value. Therefore, our proxy substitutes for the $ (now-1M) construction timestamps already with a month shift.

    Result (from the log):

    ORIGINAL query: SELECT * FROM tablename WHERE id=1;
    MODIFIED query (rule 1): SELECT * FROM tablename WHERE (col3 >= '2018-11-12 11:25:23.0+00') AND id=1;

    Thus, in principle, any request can be substituted. The responses from the server do not change and are transmitted to the client as is. This minimizes the transmission delay. In addition, the application usually waits for a response of a certain format, so it is undesirable to change the set of columns in the request and response.

    You can also easily output all requests of interest to the log:

    query = .+
    action = log

    The repository has a config with examples and a more detailed description.

    By the way, it is easy to determine how well the developer correctly writes work with the database. For example, if you see such a frequently executed request, then it’s time for someone to smoke manuals.

    INSERT INTO tablename (col1, col2, col3) VALUES('value1', 1, '2018-12-31')

    It should be like this:

    INSERT INTO tablename (col1, col2, col3) VALUES($1::varchar, $2::integer, $3::date)

    Unfortunately, so far our proxy does not know how to write: / but this is not difficult to do. Perhaps in the future it will be possible with its help to rewrite the first request for the second.

    Yes, the important point is that SSL is not supported yet, so all connections from clients to proxies will be unencrypted.

    I would welcome comments and comments.

    If there is an active interest of users, I will probably develop the project further.

    You can add work with other databases.

    Also popular now: