anastasiak2512 January 27, 2014 at 03:08

7 simple optimizations that reduce CPU load from 80% to 27%

For more than 3 years, our team has been developing such an important component of the operator’s network as PCRF. Policy and Charging Rules Function (PCRF) - a solution for managing subscriber servicing policies in LTE networks (3GPP), which allows you to assign a particular policy in real time, taking into account the services connected to the subscriber, his location, network quality in this place in current moment, time of day, volume of consumed traffic, etc. In this context, a policy means a subset of services available to a subscriber and QoS (quality of service) parameters. Analyzing the price-quality ratio for various products in this area from a variety of suppliers, we decided to develop our product. And for more than 2 years, our PCRF has been successfully operating on the Yota commercial network. The solution is fully software, with the ability to install even on ordinary virtual servers. Works in commerce on Red Hat Linux,

Of all the features of our PCRF, the most successful were recognized:

flexible tool for direct decision-making on subscriber policies, based on the Lua language, which allows the operating service to easily and on-the-fly change the algorithm for assigning policies;
support for a variety of PCEF (Policy and Charging Enforcement Function - a component that directly sets policies for subscribers), DPI (Deep Packet Inspection - a component for analyzing traffic packets, in particular, which allows calculating the amount of consumed traffic by categories), AF (Application Function - a component that describes flows service data and informing about the resources required by the service). All these network nodes can be installed in any quantity, many sessions are supported from various network components per subscriber. We have conducted many IOTs with many major manufacturers of this kind of equipment;
a whole family of external interfaces for systems located on the network, and a monitoring system that describes all the processes occurring in the system;
scalability and performance.

Actually, later in the article we will focus on one of the many criteria of the latter.

We have a resource on which we six months ago posted an image for testing, available to everyone under the appropriate license , a list of equipment suppliers with whom IOTs were conducted, a package of product documents and several articles in English about our development experience (about Lua -based engine , for example, or various testing ).

When it comes to performance, there are many criteria by which it is evaluated. The article on testing on our resource describes in detail the load tests and tools that we used. Here I would like to stop on such a parameter as the use of the CPU.
I will compare the results obtained on the test with 3000 transactions per second and scenarios of the following form:

CCR-I - setting up a subscriber session,
CCR-U - update session information with information about the amount of traffic consumed by the subscriber,
CCR-T - end of the session with information on the amount of traffic consumed by the subscriber.

In version 3.5.2, which we released in the first quarter of last year, the CPU load in this scenario was quite high and amounted to 80% . We were able to lower it to 35% in version 3.6.0, which is currently on the commercial network, and up to 27% in version 3.6.1, which is currently at the stabilization stage. Despite such a huge difference, we did not perform any miracle, but simply carried out 7 simple optimizations, which I will describe below. Perhaps in your product you can also use one of the above to make it better in terms of CPU usage.

First of all, I want to say that most of the optimizations concerned the interaction of the database and the application logic. More thoughtful use of requests and caching of information is, perhaps, the main thing that we have done. To analyze the time of work of queries on the database, we had to make our own utility. The fact is that initially the application used the Oracle TimesTen database, which does not have built-in developed monitoring tools. And after implementing PostgreSQL, we decided that using one tool to compare the two databases is correct, so we left our utility. In addition, our utility allows you not to collect data constantly, but to turn it on / off as needed, for example, on a commercial network with a slight increase in CPU load, but with the ability to analyze immediately on production which request is causing problems at the moment.
The utility calls tt_perf_info and simply measures the time spent at different stages of the request: fetch, execution directly, the number of calls per second, the percentage of the total time. Time is displayed in microseconds. The top 15 queries on versions 3.5.2 and 3.6.1 can be seen in the tables at the links:
3.5.2 top 15
3.6.1 top 15 (empty cells correspond to the value 0 in this version)

Optimization 1: decreasing commits

If you carefully look at the output of tt_perf_info on different versions, you can see that the number of pcrf.commit calls has been reduced from 12006 times per second to 1199 , that is, 10 times! The very obvious decision that came to our head was to check whether there really were any changes in the database, and only in case of a positive answer to commit. For example, for an UPDATE query, PCRF checks the number of records that have changed. If it is 0, then no commit is made. Similarly with DELETE.

Optimization 2: removal of MERGE request

Based on Oracle TimesTen, it was noticed that the MERGE request sets the lock on the entire table. Which in the conditions of constantly competing process tables led to obvious problems. So we just replaced all MERGE queries with a combination of GET-UPDATE-INSERT. If there is a record, it is updated; if not, a new one is added. We did not even begin to wrap all this in a transaction, but recursively called the function in case of failure. On pseudo code, it looks something like this:

our_db_merge_function() {
    if (db_get() == OK) {
        if (db_update() == OK) {
            return OK;
        } else {
            return out_db_merge_function();
        }
    } else {
        if (db_insert() == OK) {
            return OK;
        } else {
            return out_db_merge_function();
        }
    }
}

In practice, this almost always works out without a recursive call, since conflicts on one record still rarely occur.

Optimization 3: configuration caching for counting the volume of traffic consumed by subscribers

The algorithm for calculating the volume of consumed traffic according to the 3GPP specification has a rather complicated structure. In version 3.5.2, the entire configuration was stored in the database and consisted of monitoring tables of keys and batteries with a many-to-many relationship. The system also supported the accumulation of traffic accumulators from different external systems into one value on PCRF and this setting was stored in the database. As a result, when the next data on the accumulated volume arrived, a complex sample was taken along the database.
In 3.6.1, most of the configuration was taken out in an xml file with notification of processes about changing this file and calculation of the checksum from the configuration information. Also, the current traffic monitoring subscription information is stored in the blob attached to each user session. Reading and writing blob is undoubtedly faster and less resource-intensive operation than a huge selection of tables with many-to-many relationships.

Optimization 4: reduce the number of export Lua engine

The Lua engine is called for each request of type CCR-I, CCR-U and RAR processed in PCRF, and executes a Lua script that describes the policy selection algorithm, since it is likely that the subscriber’s policy will change when processing request data. But the idea of a check sum has found its application here. In version 3.6.1, we saved all the information on which a real policy change may depend, into a separate structure and began to calculate the checksum for it. Accordingly, the engine began to twitch only in case of real changes.

Optimization 5: removal of the network configuration from the database

Network configuration is also stored in the Database from the earliest versions of PCRF. In release 3.5.2, the application logic and the network part overlapped quite a lot with the network settings in the tables, since the logical module regularly read connection parameters from the database, and the network part used the database as a repository of all network information. In version 3.6.1, information for the network part was transferred to shared memory, and periodic processes were added to the main logic that update it when changes are made to the database. Thereby, locks on common tables in the database were reduced.

Optimization 6: Selective Diameter Parsing

PCRF communicates with external systems using the Diameter protocol, analyzing and parsing many commands per unit time. These commands usually contain many fields (avp) inside themselves, but not every component needs all the fields. Often, only a few fields from the first (header) part of a command are used, such as Destination / Origin Host / Realm, or fields that allow you to identify a subscriber or session, that is, id (which are also often located at the beginning). And only one or two main processes use all message fields. Therefore, in version 3.6.1, masks were introduced that describe which fields should be read for this component. And also almost all memory copy operations have been removed. In fact, only the original message remained in memory, and all processes use structures with pointers to the necessary parts,

Optimization 7: time caching

When PCRF began to process more than 10,000 transactions per second, it became noticeable that the logging process took a significant part of the time and CPU. Sometimes it seems that logs can be sacrificed in favor of greater productivity, but the operator should be able to reproduce the whole picture of what is happening on the network and on a specific component. Therefore, we sat down to analyze and found out that the most frequent entry in the log is a time and date stamp. Of course, it is present in every log entry. And then, limiting the accuracy of time to a second, we simply began to cache the line with the current time and rewrite it only for the next second.

All these seven optimizations will surely seem to the experienced high-performance developer simple and obvious. They also seemed to us like that, but only when we realized and realized them. The best solution often lies on the surface, but it is also the hardest to see. So I summarize:

Check that the data is really changing;
Try to minimize the number of locks on entire tables;
Cache and remove configuration data from the database;
Do only those actions that are really needed, even if it seems that it's easier to make the whole list.

Tags: