On the issue of the implementation of persistent processes in real-time control systems (part 1)

    Recently, persistence has become another fashionable term in information technology. Many articles are published on persistent data, dzavalishin develops a whole persistent operating system, and we will share for a change the materials of a recent report on persistent processes.

    Persistence, in simple terms, means independence from the state of the surrounding medium. Thus, in our opinion, it is quite legitimate to talk about the persistence of processes, how their ability to run regardless of the state of the environment that generated them, including from failures at the lower levels, which, generally speaking, is one of the most important tasks in the development of automatic control systems real time.

    The article provides a classification of the main levels of implementation of the functions of a fault-tolerant control system, consideration of failure levels characteristic of these data, and the study of specific technical solutions used at each level to ensure persistence.

    Depending on the way the control system is implemented, its hierarchical model can be organized in various ways.

    For example, like this:
    Computing Processes
    Specialized Redundant Equipment
    Resource Communication Environment
    External resources

    such:
    Computing Processes
    Clustered System Services and Operating Environments
    Host operating system
    Hardware and firmware
    Resource Communication Environment
    External resources

    or, theoretically, even like this:
    Computing Processes
    Clustered Application Server
    System Services and Operating System
    Hardware and firmware
    Resource Communication Environment
    External resources

    If you confidently feel like the father of computing architectures and have in abundance (regarding the functional complexity of the task) the workforce and creative potential of programmers and electronic engineers, and even more so, God forbid, are vested with significant responsibility before the law for the results of using your system, then it is intended for you the first of these paths is the construction of a redundant hardware-software complex with specialized architecture. This path has its roots in embedded systems and is a wonderful field for upstream careers of hardware and low-level interface programmers. The author will try to cover this direction in more detail in one of the following articles (how we developed the transputer), here we restrict ourselves to the remark that, unfortunately,

    Let us immediately outline our considerations regarding the use of the application server in high-availability control systems, illustrated by the third scheme. With external attractiveness for the inexperienced in the tasks of automatic control of minds brought up by the development of information systems, this approach contains a number of difficult to eliminate shortcomings. The goal of the main modern application servers is to provide load balancing and increase processing throughput, which is in traditional conflict with the task of minimizing the reaction time (latency) required from real-time systems. Also, such application servers are highly complex and in themselves are a vulnerable link in terms of fault tolerance. Finally,

    Thus, in this article we will dwell on the architecture of a cluster of virtual machines, illustrated above by scheme number 2, and consider in more detail its main levels, moving from bottom to top.

    1. External resources

    Sometimes, novice developers lose sight of the fact that, often, the most vulnerable link in the control loop may be the managed resources themselves or other external objects. This situation is perfectly illustrated by an old joke:

    - I am the smartest! - Wikipedia said.
    - I will find anything! - said Google.
    “I am everything!” - said the Internet! ..
    - Well, well - said Electricity and ... blinked.


    Literally understanding this joke, if you did not provide power supply to the object from two independent power lines, or, for example, transporting diesel fuel to a standby diesel generator with an efficiency not worse than the battery life, then all your successes in the field of server hardware backup are, in terms of fault tolerance, a purely cosmetic nature.

    Less literally interpreting what has been said, you should always check to see if your magnificent duplicated control circuit ends with a single actuator or resource source, and if so, what to do about it.

    The most advanced automatic control systems allow, in the event of failure of some mechanism of the managed system, to try to perform part of its functions using the remaining mechanisms, which have other functions. For example, the terminal control system of a space rocket can compensate for the premature shutdown of the third-stage engine with additional work of the upper stage.
    Note
    It is not necessary to understand this in the sense that in the missile terminal control system there is a special code branch “Operation of the upper stage in case of a third stage malfunction”. In fact, it’s just that the control loop is designed in such a way that the capabilities of various controlled systems overlap each other, and each of them tries to do the maximum possible to achieve the ultimate goal from the situation in which it actually found itself.


    2. Communication environment with resources

    In addition to the resources themselves, the communication environment between them is of fundamental importance. For us, the most important mediums are, first of all, the object power supply system and data transmission network.

    When designing an object power supply system for a high-availability complex, it is necessary to ensure at least double physically separate wiring of the power supply network, by connecting critical equipment to each of the power supply lines, either by duplicating the equipment or by using duplicated power supplies in it with the possibility of working from different power supplies chains. These points seem obvious, however, in real life, the author had the opportunity to see an automation object deciding important tasks, the power of which was organized from two independent electrical substations in such a way that the measuring equipment was completely powered by one of them, and the computer complex controlling it was from the other.

    Hot backup of data networks implies a number of problems that, to varying degrees, attract the attention of the general public.

    The use of alternative routes for packet transmission through redundant connections is well supported by conventional intelligent network equipment, except when using non-standard lower-level protocols.

    Moving up the protocol stack, it is necessary to raise the issue of using data transfer protocols that are resistant to full or partial failures. Part of this question is the well-known TCP vs UDP flame.

    The advantages of using the TCP protocol in control systems include:
    - automatic integrity control;
    - arbitrary size of the transmitted data.

    The advantages of using the UDP protocol in control systems include:
    - lack of status;
    - the possibility of half duplex;
    - quick return from calls *;
    - quick diagnosis of problems at the stack level and return of an error code.

    Using TCP in real-time systems requires the developer to become familiar with the stack settings, primarily the tcp_keepalive family of parameters. Using UDP requires a clear understanding of the implementation of the ARP protocol (the caveat for the footnote * above is associated with this). Using both protocols implies creative ownership of the receive buffer size settings.

    The issue of the lack of state in the UDP protocol becomes important when you restart one of the parties to the connection, including restarting on physically different equipment (a backup server).

    Separately, it is necessary to touch on the rarely covered half-duplex issue. The implementation of some common network environments is such that, as a result of a physical or logical violation of the integrity of the connection, it becomes possible that the data is transmitted from A to B, but cannot be transferred from B to A. The TCP protocol cannot function in such conditions. The UDP protocol is able to maintain one-way communication during a one-way break (provided that the underlying network equipment is working correctly, and eliminating the issues of using ARP when establishing a connection).

    In general, according to the author, UDP protocol with the organization of message delivery control or unconditional retransmission at the application level is more suitable for transferring short control messages to IP networks for a fault-tolerant system. For the transfer of large amounts of data, the use of the TCP protocol coordinated by the manager level is suitable, with the organization of connections for a short time.

    Continued: Part 2

    Also popular now: