Phobos Grunt. Lessons for Remaining on Earth



    Let me remind you the history of the issue. On November 9 of last year, after almost 15 years of development, several project suspensions and launch postponements, the Zenit-2SB launch vehicle with the new Russian Phobos-Grunt spacecraft was launched from Baikonur. The goals were very ambitious: to launch an automatic station to Mars, to reach its satellite - Phobos, to take soil samples from it, which would then be returned to Earth. These would be the first tests of extraterrestrial material physically delivered to researchers (well, the letter-eaters here would have blamed me on the Japanese "Hayabusa", due to the constant delays of the "Phobos" delivering several microscopic particles of interplanetary dust several years ago earlier than our apparatus), since moon research "in the last century." And given the fact that, according to today's theory, Phobos is an asteroid captured by Mars, that is, a sample of the same source material from which all the planets of the Solar System were generally formed (the Moon is still a “piece” of the Earth chopped off in the past, and not a real “planet”), this expedition also had unprecedented scientific significance. It would also be the first “return” of the device from Mars and its satellite.
    Another important issue was the prestige and open return of Russia to “deep space”, to interplanetary research, which had ceased during the Soviet era.

    Alas, the entire expedition ended pretty soon. Immediately after the launch, it turned out that the device was “stuck” in the near-Earth “parking” orbit, did not respond to commands, and was there in a “frozen” state, not executing the program. On November 24, attempts to restore operability were officially stopped, and in February of this year the device uncontrollably entered the dense layers of the atmosphere, and fell into the ocean, fortunately, without hitting anyone on such a descent on Earth.

    A brief official report was published in February on the website of Roscosmos. Here is what it essentially says:

    The main provisions of the Conclusion of the Interdepartmental Commission for the analysis of the causes of an emergency that arose during the flight tests of the Phobos-Grunt spacecraft, formed in accordance with the order of the head of Roscosmos dated December 9, 2011 No. 206

    Source http://www.roscosmos.ru/ main.php? id = 2 & nid = 18647

    [various possible causes considered and their sources are listed]
    An analysis of the possible failures of these systems and assemblies by experts of the commission showed (taking into account their condition and TMI) that at the time of the occurrence of the emergency situation] they could not be its root cause.
    2.2. The reason for the occurrence of the NShS is the restart of two half-sets of the CVM22 BVK device [Onboard Computing Complex] (double “restart”), performing Phobos-Grunt spacecraft control on this flight section, after which, in accordance with the BKU logic of operation, the regular spacecraft flight sequence interrupt “ Phobos-Grunt ”, and he switched to the mode of maintaining constant solar orientation and waiting for commands from the Earth in the X-band communication, which was provided for by design solutions for the flight path. [...]
    2.3. The most likely factor that could become the root cause of the double “restart” is the local impact of heavy charged particles (TZZh) of outer space, which led to a malfunction in the RAM of the computing modules of the TsVM22 sets during the flight on the second orbit of the Phobos-Ground spacecraft.
    RAM failure could be caused by a short-term inoperability of the EMI due to the impact of the voltage transformer on the cells of the computer modules TsVM22, which contain two microcircuits of the same type WS512K32V20G24M (the cells of the computer modules are located in a single housing parallel to each other). The impact led to a distortion of the program code and the triggering of the “watchdog” timer, which caused the “restart” of both of the TsVM22 half-kits. The model of such interaction between the TZZh and the ECB is not regulated by normative and technical documents. The Commission considers it necessary to develop and implement normative and technical documents in RCP organizations containing modern models of ionizing radiation in outer space and guidelines for their use.


    From scattered and fragmentary information about how onboard computers of Russian spacecraft are designed and what they are, it was possible to understand that in Phobos-Grunt it was decided to use the new on-board computer complex BVK TsVM22, manufactured by Tekhkom, a division of KB Argon, it was the transition to the TsVM22 that accounted for the last delay and the transfer of the launch from the previous launch “window” to the current one. For about two years (among other things) Phobos was converted to a new, compact BVK, created using modern microelectronics, and weighing not 30, like the previous one, but only 1.5 kg. But in space, everyone is not even a kilogram, a gram worth its weight in gold (the approximate cost of putting a kilogram of cargo into the lowest low Earth orbit is about 3000-4000USD)! But the flight to Mars is not only the conclusion to low Earth orbit.
    It is not surprising that taking advantage of such savings was very tempting.

    On board the Phobos were two independent TsVM22 modules working in parallel, independently, and providing hot duplication, in case of failure of any module in pair. Such duplication is a common practice in aviation and space technology.

    In the wake of the general annoyance caused by regular failures, recently, in the Russian space program, one even heard quite insulting and ridiculous rumors that, allegedly, mass-produced "Chinese" electronics were used in Phobos, here it is, and failed. This is actually not the case.
    Here's what he writes about the chip in his blogJames Hamilton, in an article on the effect of memory failures on server hardware:

    These SRAMS are manufactured by White Electronic Design and the model number can be decoded as “W” for White Electronic Design, “S” for SRAM, “512K32” for a 512k memory by 32 bit wide access, “V” is the improvement mark, “20” for 20ns memory access time, “G24” is the package type, and “M” indicates it is a military grade part.

    “This is SRAM (Static RAM, a memory chip whose cell, unlike the usual DRAM - Dynamic RAM, retains its state in the absence of circulation and does not require“ regeneration ”, is widely used in industrial electronics) produced by White Electronic Design ( "W"), has the organization StaticRAM ("S"), "512K32" means 512K words with a bit size of 32 bits. "V" is the mark for improved characteristics, "20" is 20ns access time to the memory cell, "G24" is the case type, "M "- indicates the" military "class of manufacture and tolerances."

    Source: http://perspectives.mvdirona.com/2012/02/26/ObservationsOnErrorsCorrectionsTrustOfDependentSystems.aspx

    However, alas, even using genuine “white” American military-grade microelectronics was not enough.

    There is a classic problem of insufficient constructive study, and if we take it more broadly, then, apparently, low engineering competence as a whole. Of course, to design such an arrangement of two BVK boards so that the memory chips in them were so close that they would be flashed with a single particle, and cause (simultaneous!) Failure of both duplicated computers at once, this is an obvious constructive flaw of the "top level".

    This, apparently, is the classic problem of “who made the costume?” their famous monologue Zhvanetsky-Raikin. "Do you have any complaints about buttons?" A wonderful, possibly computing complex in itself. Just no one thought that placing two microcircuits side by side we increase the likelihood of a detrimental simultaneous radiation effect on its elements. No one looked at that angle at the assembly assembly. Or, as the official report dryly puts it: "The model of such interaction between the TZZh [heavy charged particles (" cosmic rays ")] and the ECB [electronic command unit] is not regulated by normative and technical documents." .

    But this, alas, is not all. Even worse, apparently, is the case with design competence.
    Surprisingly, it is a fact: back in 2005, in the collection of works “Radiation Effects Data Workshop”, published by IEEE, devoted to the topic of radiation exposure and the effect of heavy charged particles on electronics components, it was explicitly noted:

    Recent SEE testing of 1M and 4M monolithic SRAMs at Brookhaven National Laboratories has shown an extreme sensitivity to single-event latchup (SEL). We have observed SEL at the minimum heavy-ion LET available at Brookhaven, 0.375 MeV-cm2 / mg.

    “The recent testing of 1M and 4M monolithic SRAM chips at the Brookhaven National Laboratory showed their extreme sensitivity to the snap effect. "We observed this effect when exposed to at least the heavy ions available at the Brookhaven accelerator, with an energy of 0.375 MeV cm2 / mg."
    Source http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?reload=true&arnumber=1532657

    But these are the very microcircuits that were chosen in the Techcom to create the TsVM22! And this behavior has been known since at least 2005.

    Apparently Phobos was doomed from the start. Sooner or later, in the rather radiation-harsh conditions of “open interplanetary space,” this effect would be caught. But if the “latching”, in principle, with luck, is treated with a “cold reset” of the complex from the backup system, then the simultaneous failure of both complexes (caused primarily by the design, as indicated by the report) turned out to be fatal. As the report indicates, the failure occurred so “in a friendly manner” that Phobos did not even send a message about the failure, and with foreign help the Control Center was able to obtain only rather fragmentary telemetry data (apparently from “dumb” automation), which spoke only about almost complete inaction of the digital computer on board and the failure of the entire complex.

    A few explanatory words about the “latchup effect” or “snap effect” mentioned above. This is a specific effect that causes a kind of “freezing” of the SRAM memory cell (as shown above, it occurs when a heavy charged particle of cosmic rays travels), and, as a rule, it requires the SRAM module to be completely turned off and on to restore its functionality, and sometimes the output out of order happens and is irreversible.

    In the article “Did Bad Memory Chips Down Russian Mars Probe? Moscow blames radiation wreckage on an SRAM chip, but does it add up?”
    “Was a bad memory chip ruined by a Russian Martian spacecraft?” Moscow blames the impact of cosmic rays on the SRAM chip, but is that just the point? ”

    A sourcehttp://spectrum.ieee.org/aerospace/space-flight/did-bad-memory-chips-down-russias-mars-probe

    published in the IEEE Spectrum e-journal, Steven McClure, NASA Specialist at the Jet Propulsion Laboratory (JPL , NASA's oldest space engineering division), director of the Radiation Effects Group, explicitly states that NASA's use of such SRAM chips in space equipment is not considered due to their low radiation resistance, well-known to experts.

    “The WS512K32 chip is well known and widely used in military and aviation technology, but not in space technology,” says McClure,“Neither its manufacturer nor the commercial vendors using this chip conducted radiation testing and published standards and specifications for such an impact on this chip.” “It can possibly be used in space technology, for small-time tasks, in orbiting vehicles, and at non-critical positions, but not as a component of the main control computing module of the interplanetary station, which should work in outer space for several years.” - says the author of the article McClure.

    It was also noted in the article that, for some unknown reason, the Phobos algorithms did not consider the option of failure, similar to what happened, in near-Earth orbit, where, in fact, the misfortune happened. In the event of a failure similar to what happened, the device switches to the so-called Safe-mode, in which the device uses the stupid, non-computer tools of simple automation to orient the solar panels to the Sun in order to avoid battery discharge, and turns on the command radio line for receiving commands from the Earth (“gives console "), with which you can restore the system.

    The automation worked, the device was correctly oriented and turned on the radio on the emergency channel, however, the algorithm did not provide for failure (and, accordingly, receive commands via the emergency channel) at the output stage, the possibility of failure and, accordingly, interference from the Earth was provided only with the moment of departure to the “flight trajectory”.

    The article cited rather rigidly states: “The release of the official accident investigation results on February 3 served only to further rumors of fundamental hardware and software design flaws, and of blatant violations of safety standards.
    Source http://spectrum.ieee.org/aerospace/space-flight/did-bad-memory-chips-down-russias-mars-probe

    "The release of the official report of February 3 provides only food for further rumors about the presence of fundamental errors in the hardware and software, as well as gross violations of safety standards (during development)."

    The fact is such an absurd, by and large design error, comments James Hamilton:

    " This mistake is striking. Reasonable people, it would seem, could never have made such a mistake, the error is obvious and lies on the surface. Nevertheless, such errors in large systems are made here and there, again and again. Experts, azhdy in your field, do a good job, but the interaction between such "vertical" segments(separately - designing a computer complex, separately - its placement in the device, separately - its programming, separately - development of a “cyclogram”, or a sequence of operations and actions performed at launch and during flight. Note) turn out to be difficult, and if a general understanding the product and the “cross-vertical” relations in it are not deep enough, these design flaws may remain unseen (see above about the problem of “who made the suit?” Note). Good specialists create good components, but when connecting all the components into an integrated system, here and there we see problems between the components and their interaction.

    Often, good “vertical” specialists do not see the product being created as a whole, knowing only their component well. Two solutions are 1) well-defined and well-documented interfaces (in the broad sense) between components, whether hardware or software, and 2) dedicated experienced and knowledgeable engineers who deal specifically with the interaction of components and the operation of the system as a whole. Appointment to such a position, as it happens, of a technically unskilled manager, is often not effective.

    The problems and errors caused by “complexity blindness” are often very serious and, at the same time, depressingly obvious by the “backdating”, as in the example considered above. ”


    PS. A few years ago I had a chance to talk with a graduate of the Moscow Aviation Institute, who had undergone pre-graduation practice at the Tupolev Design Bureau. He enthusiastically talked about those specialists with whom he had a chance to communicate there. “Grandfathers are real bison, with an exorbitant level of experience, walking guides and encyclopedias, but they are all retirees there, and they are simply stupidly dying out. The average age in KB is under 60 years. All are either retiring to work, or working pensioners. If anyone is younger, they are so rare enthusiasts, yesterday’s students who last two or three years, after which they dump those salaries and hopelessnesses either into business or into management. And what kind of students are leaving MAI now ... In the “middle” there is no one. ”
    I think that in the space industry the situation is not much different. And as a result, these are such stories.

    PPS I thought for a long time whether such an article was needed on Habré, and where to post it in general. But there was somehow a case, they discussed the story of Phobos, retold rumors from the TV and scolded the “shitty Raska”, as usual, and it seemed to me that someone would be interested in “how it ended” and how things really were.

    PPPS I intentionally wanted to confine myself to facts only, and to do without the hysteria that was already familiar on the hub, “sawing”, “blaming”, “Skolkovo” and “Russian enemies let into the socket of poisonous gases knock down our Martian stations with their tsarous radars." Only facts and direct speech of specialists.

    UPD:The comments provided a link to an open letter from a former leading specialist from the NGO named after Lavochkin, our main space engineering center, in which, among other things, the Phobos was designed and created.
    open-letter.ru/letter/26645
    Everything is quite expected, in accordance with what was said in the article above:
    “I want to mention the artificial division of the design bureau. It is divided into Centers, each of which has its own director, his deputies, planning departments, etc. And such an organization led to the real disunity of the once unified design bureau.
    [...]
    Hence, on the one hand, duplication of services (for example, several departments are engaged in gearboxes, each in its own way), on the other hand, the same drive is designed in three Centers - a control unit in one, an electrician in the other, and a mechanic in the third. And each of these parts does not want to understand the other. ”


    The letter was written in the name of Deputy Chairman SB Ivanov in March 2011.

    Also popular now: