Space error: $ 370,000,000 for Integer overflow

    Start.
    37 seconds of flight ... broads!
    10 years and $ 7 billion spent on development.
    Four one and a half-ton satellites of the Cluster scientific program (studying the interaction of solar radiation with the Earth's magnetic field) and the Ariane 5 carrier rocket turned into “sweets” on June 4, 1996.
    And the blame was blamed on programmers.



    The previous model - the Ariane 4 rocket - successfully launched more than 100 times. Something went wrong?

    To storm heaven, you need to know the language of Hell well.

    image

    Dossier


    Ariane 5 ("Arian-5") is a European disposable launch vehicle, part of the Ariane family (the first launch took place in 1979). It is used to bring medium or heavy spacecraft into orbit, can simultaneously launch two or three satellites and simultaneously up to eight microsatellites.

    Project History
    Created in 1984-1995. European Space Agency (ESA; ESA), the main developer is the French National Center for Space Research (CNES). The program participants are 10 European countries, the project cost is 7 billion US dollars (46.2% is the contribution of France).

    About a thousand industrial firms took part in the creation of the rocket. The main contractor is the European company Airbus Defense and Space (Airbus Defense & Space; Airbus Group, Airbus Group, Paris). Ariane 5 is marketed by the French company Arianespace (Arianspace; Evry), with which ESA signed an agreement on November 25, 1997.

    Features The
    Ariane 5 is a two-stage heavy-class launch vehicle. Length - 52-53 m, maximum diameter - 5.4 m, starting weight - 775-780 tons (depending on configuration).

    The first stage is equipped with a Vulcain 2 liquid-propellant rocket engine (Vulcan-2; Vulcain was used in the first three rocket versions), the second is HM7B (for the Ariane 5 ECA version) or Aestus (Aestus; for Ariane 5 ES). Vulcain 2 and HM7B operate on a mixture of hydrogen and oxygen, manufactured by the French company Snecma (Snecma; part of the Safran group, Safran, Paris).

    Aestus uses long-life fuel - nitrogen tetroxide and monomethylhydrazine. The engine was developed by the German company Daimler Chrysler Aerospace AG (DASA, "DASA", Munich).

    In addition, two solid fuel boosters (manufacturer - Europropulsion, Europelzhn; Suren, France; a joint venture of the Safran group and the Italian company Avio, Avio) are attached to the first stage, which provide more than 90% of thrust at the start of the launch. In the Ariane 5 ES variant, the second stage may be absent when the payload is brought to a low reference orbit.


    On-board computers
    www.ruag.com/space/products/digital-electronics-for-satellites-launchers/on-board-computers

    Investigation


    The day after the disaster, the Director General of the European Space Agency (ESA) and the Chairman of the Board of the French National Center for the Study of Space (CNES) ordered the formation of an independent Commission to Investigate the circumstances and causes of this emergency, which included well-known experts and scientists from all interested European countries.

    On June 13, 1996, the Commission began work, and already on July 19 its comprehensive report ( PDF ) was published , which immediately became available on the Web .

    The commission had telemetry data, trajectory data, as well as recording optical observations of the flight.
    The explosion occurred at an altitude of about 4 km, and the fragments were scattered over an area of ​​about 12 square meters. km in the savannah and swamps. The testimonies of numerous experts were heard and the production and operational documentation was studied.

    image

    Technical details of the accident


    The position and orientation of the launch vehicle in space were measured by the Inertial Reference Systems (IRS), an integral part of which is a built-in computer that calculates angles and speeds based on information from the onboard Inertial Platform equipped with laser gyroscopes and accelerometers. Data from the IRS was transmitted via a special bus to the On-Board Computer (OBC), which provided the information necessary for the implementation of the flight program and directly - through hydraulic and servo drives - controlled solid-fuel accelerators and a cryogenic engine like Vulkain.



    To ensure the reliability of the Flight Management System, duplication of equipment was used. Therefore, two IRS systems (one active, the other its hot standby) with identical hardware and software functioned in parallel. As soon as the on-board computer OBC detects that the “active” IRS is out of normal mode, it immediately switches to another. There were two on-board computers too.

    Significant phases of the development of the process





    7 minutes before the scheduled launch, a violation of the “visibility criterion” was recorded. Therefore, the start was postponed for an hour.

    H0 = 9 hours 33 minutes 59 sec Local time, the “launch window” was again “caught” and, finally, the launch itself was carried out, which happened normally up to the moment of H0 + 37 seconds.

    In the following seconds, a sharp deviation of the rocket from a given trajectory occurred, which ended in an explosion.

    At the moment H0 + 39 seconds, due to the high aerodynamic load, due to the “angle of attack” exceeding the critical value by 20 degrees, the rocket launch accelerators were separated from its main stage, which served as the basis for the inclusion of the rocket auto-blasting system.

    The change in the angle of attack occurred due to abnormal rotation of the nozzles of solid fuel boosters, such a deviation of the nozzles of the accelerators from the correct orientation at the moment H0 + 37 seconds caused a command issued by the On-Board Computer based on information from the active Navigation System (IRS 2).

    Part of this information was basically incorrect: what was interpreted as flight data was actually diagnostic information of the IRS 2

    embedded computer. The IRS 2 embedded computer transmitted incorrect data because it diagnosed an abnormal situation by catching an exception. discarded by one of the software modules.

    At the same time, the On-Board Computer could not switch to the backup IRS 1 system, since it had already stopped functioning during the previous cycle (taking 72 milliseconds) - for the same reason as IRS 2. The

    exception "thrown" by one of the IRS programs was the result of converting data from a 64-bit floating-point format to a 16-bit signed integer, which led to Operand Error.

    An error occurred in a software component designed solely to perform the “adjustment” of the Inertial Platform. Moreover, this software module produces significant results only up to the moment H0 + 7 seconds of rocket separation from the launch pad. After the rocket took off, the functioning of this module could not have any effect on the flight.

    The “adjustment function” really had to (in accordance with the requirements set for it) act for another 50 seconds after initiating the “flight mode” on the bus of the Navigation System (moment H0-3 seconds), which it did.

    The “Operand Error” error occurred due to an unexpectedly large BH value (Horizontal Bias - horizontal slope) calculated by the internal function based on the “horizontal speed” value measured by the sensors located on the Platform.

    The value of BH served as an indicator of the accuracy of the positioning of the Platform. the BH value turned out to be much higher than expected because the Ariane 5 flight path at an early stage was significantly different from the Ariane 4 flight path (where this program module was used earlier), which led to a significantly higher "horizontal speed".

    The final action, which had fatal consequences, was the cessation of the processor. Accordingly, the entire Navigation System ceased to function. It was technically impossible to resume her actions.

    This chain of events was fully reproduced using computer modeling, which - together with materials from other studies and experiments - made it possible to conclude that the causes and circumstances of the disaster were fully identified.

    image

    Causes and origins of the accident


    The initial requirement to continue the adjustment operation after taking off the rocket was laid more than 10 years before the fateful event, when the earlier models of the Ariane series were being designed.
    With some improbable development of events, the take-off could be canceled just a few seconds before the start, for example, in the interval H0-9 seconds when the “flight mode” was launched on the IRS, and H0-5 seconds when a command was issued to perform some operations with rocket equipment.

    In the event of an unexpected cancellation of take-off, it was necessary to quickly return to the “countdown” mode - and at the same time not to repeat all installation operations first, including restoring the Inertial Platform to its initial position (operation requiring 45 minutes - the time for which you can lose the “launch window”).

    It was justified that in the event of a launch cancellation event, a period of 50 seconds after H0-9 would be sufficient so that the ground equipment could regain full control of the Inertial Platform without losing information - during this time, the Platform will stop the movement that has begun, and the corresponding software module will information about its state will be recorded, which will help to quickly return it to its original position (this is the case when the rocket continues to be at the launch site). Once, in 1989, when starting at number 33 of the Ariane 4 rocket, this feature was successfully used.



    However, Ariane 5, unlike the previous model, already had a fundamentally different discipline for performing pre-flight actions - so different that the work of the rock software module after the start time did not make sense at all. However, the module was reused without any modifications .

    ADA language




    The investigation showed that in this program module there were as many as seven variables involved in type conversion operations. It turned out that the developers analyzed all operations that could potentially throw an exception for vulnerability.

    This was their very conscious decision to add proper protection to the four variables, and leave three — including BH — unprotected. The basis for this decision was the belief that for these three variables the occurrence of an overflow situation is impossible in principle .

    This confidence was reinforced by calculations showing that the expected range of physical flight parameters, on the basis of which the values ​​of the mentioned variables are determined, is such that it cannot lead to an undesirable situation. And this was true - but for the trajectory calculated for the Ariane 4.

    And the new generation Ariane 5 rocket launched along a completely different trajectory, for which no estimates were performed. Meanwhile, it (together with high initial acceleration) was such that the "horizontal speed" exceeded the calculated (for Ariane 4) more than five times.

    Protection for all seven (including BH) variables was not provided because a maximum workload of 80% was declared for the IRS computer. Developers had to look for ways to reduce unnecessary computational overhead and they weakened protection where a theoretically undesirable situation could not have arisen. When it arose, then such an exception handling mechanism came into effect that turned out to be completely inadequate.

    This mechanism included the following three main actions.

    • Information on the occurrence of an emergency should be transmitted via bus to the on-board computer OBC.
    • In parallel, it, together with the entire context, was written to the EEPROM reprogrammable memory (which during the investigation was able to recover and read its contents).
    • The IRS processor should have crashed.

    The last action turned out to be fatal - it was it that happened in a situation that was actually normal (despite the software exception generated due to unprotected overflow), and led to disaster.



    conclusions


    The defect on Ariane 5 was not caused by one cause. Throughout the development and testing processes, there were many stages at which this defect could be identified.

    • The software module was reused in a new environment where operating conditions were different from the requirements of the software module. These requirements have not been revised.
    • The system detected and recognized the error. Unfortunately, the specification of the error handling mechanism was inconsistent and caused final destruction.
    • The erroneous module has never been properly tested in the new environment - neither at the hardware level, nor at the system integration level. Consequently, the fallacy of development and implementation was not detected.




    From the commission report:
    The main objective in the development of Ariane 5 is a bias towards reducing an accidental accident. The exception that arose is not due to an accident, but to a design error. An exception was found, but it was handled incorrectly, because it was accepted that the program should be considered as correct until the opposite is shown. The Commission is of the opposite view that software should be considered erroneous until the use of currently recognized best practices demonstrates its correctness.


    A happy ending





    Despite the fail, 4 more Cluster II satellites were built and put into orbit on the Soyuz-U / Frigate rocket in 2000.

    The accident at start-up attracted the attention of the public, politicians and heads of organizations to the high risks associated with the use of complex computing systems, which contributed to increased investment in research aimed at improving the reliability of systems with special security requirements . The subsequent automatic analysis of Ariane code (written in Ada) was the first case of applying static analysis in a large project using abstract interpretation techniques .

    Sources



    Also popular now: