[updated] Curiosity problems: causes and current situation
- Transfer
As many of you have probably heard, last week the Curiosity rover, which was busy analyzing the samples drilled by it with a drill, had some problems with the main on-board computer. Let's see what exactly happened and how JPL specialists plan to solve this problem.
According to NASA experts, the cause of memory damage on board Curiosity may be cosmic radiation. Let me remind you that last Thursday, for reasons that experts associate with damage to the memory area of the rover, the engineers had to switch Curiosity to a spare computer.
Now the rover team checks the telemetry data, and also conducts diagnostic tests in order to understand what exactly went wrong and how to return the system to working condition.
“We were in a rather strange situation - our software worked, but only partially worked, so we decided to switch to a“ clean ”version of the on-board software, which also works on a“ clean ”hardware,” said Curiosity project manager Richard Cook. “The easiest way to do this is to just start using a spare computer.”
Curiosity is equipped with two computers with the uncomplicated names A and B, each of which can be used to control the rover. Computer B was used during the flight to Mars, and after landing, the rover switched to computer A, and has been using it ever since.
[Curiosity on-board computers are called RAD750, and they are radiation-resistant single-board computers based on the processor of the same name. They are produced using 250- or 150-nm technology and are able to withstand radiation up to 1,000,000 rad, and operate in the temperature range from -55 to 125 degrees Celsius, consuming about 5 watts of energy. A system consisting of the processor itself and the motherboard can withstand up to 100,000 rad and a temperature of -55 to 70 degrees. Computers have 256 kilobytes of EEPROM, 256 megabytes of RAM, and 2 gigabytes of flash memory. Of course, this is not very impressive in 2013, but compared with the hardware of the previous generation rovers, the increase in performance is very large, approx.
The switch from the primary to the backup computer occurred around 5:30 pm EDT (GMT-5) last Thursday. After that, the rover switched to the so-called “safe mode”. Over the next few days, engineers will continue to connect Computer B to all on-board systems, and restore the normal operation of the rover.
Since landing, this problem has become the most significant of those that fell on the head of Curiosity.
“Most likely, we will soon return to normal operation,” Cook said, “And yet, this is not the most pleasant experience - you see, the rover is an extremely complex device. It’s enough trifle for something to go wrong, and we have to take this into account all the time. ”
The problem first appeared on Wednesday morning. It all started with the fact that the employees of the control center noticed data that, as it seemed to them, indicated damage to the flash memory of the rover. The on-board software did not write any new data to the memory, and refused to transmit the data recorded earlier. The only information that could be obtained from the rover was real-time telemetry.
On the same day, during a communication session via the MRO satellite, telemetry showed that the memory corruption was still not fixed. In addition, as it turned out, the computer did not do some pre-programmed actions - it had to go into sleep mode for an hour, and then wake up during the next communication window with the Odyssey satellite.
MRO satellites (left) and Odyssey (right)
“During the second flight, we received some information, which briefly boiled down to the following: Hey guys, the memory is still damaged, and besides, I did not go to bed when I had to, I was awake all this time! ", Said Cook.
The next communication window was between 22:30 and midnight on the same day (in the time zone of the JPL control center). The rover computer was still working, and the engineers decided to switch to system B.
At the same time, Cook noted that the rover's memory was initially made resistant to errors that could be caused by cosmic rays or radiation. However, everything indicated that the most sensitive area of memory was damaged - a directory that contains information about the location of certain data.
“Without going into details, we have several degrees of protection. The memory itself is self-correcting, and the software is designed to be tolerant to data corruption. We believe that we were extremely unlucky - we received errors precisely in those areas of memory that were most sensitive to them ”
[I recall that the rover software itself has several levels of action in an emergency. In case of especially serious problems, the rover usually goes into “safe mode”, stops all its activities and waits for the next communication window to transmit information about the problem to the control center and receive further instructions, approx.]
“Thus, we simply lost information about where what data is located. I repeat - in theory, the rover software should be tolerant of errors of this kind, but we were in a situation where some of the software worked as expected, and some began to fail in anticipation of data changes in memory - the software simply could not understand where this data was read from . ”
Cook noted that the chances of cosmic rays causing this kind of problem are extremely low, but this has happened before.
“Imagine an address book that is full of entries. Instead of damaging one of these records, cosmic radiation damages the table of contents. "This is an extremely rare occurrence, but - alas - such things sometimes happen."
If this hunch is correct, restarting the main computer should fix the problem. However, engineers are not going to rush - they conduct a detailed analysis of the situation in order to be sure of the causes of the problem before taking any action.
“Of course, we can use computer B, and it is absolutely as effective as the main one. So in the coming week we will configure the software of the second computer to make sure that all systems work as they should. ”
“In the end, we plan to return to the main computer. If the problem is really memory corruption, then it will disappear by itself during loading, as the on-board software will overwrite the partition table from scratch. ”
NASA experts expect Curiosity to be able to continue its scientific research over the next few days.
Today (03/03/2013) NASA announced that Curiosity is again in “active” mode. According to calculations, he should fully recover and continue scientific research next week.
The safe mode was exited on Saturday, and on Sunday the rover again began to use the HGA (high-gain antenna) to communicate with the Earth.
“The recovery process is going well,” said Richard Cook, who is already familiar to us. “It consists of two parts. Firstly, we want to understand exactly what happened to computer A, and secondly, to carry out a number of operations with computer B, for example, tell him about the state of the rover - the current position of the arm, mast, and so on. ”
However, the exact cause of the memory failure is still being clarified.
Please report all errors and typos in PM!
As usual, many thanks to Zelenyikot for the material found.
According to NASA experts, the cause of memory damage on board Curiosity may be cosmic radiation. Let me remind you that last Thursday, for reasons that experts associate with damage to the memory area of the rover, the engineers had to switch Curiosity to a spare computer.
Now the rover team checks the telemetry data, and also conducts diagnostic tests in order to understand what exactly went wrong and how to return the system to working condition.
“We were in a rather strange situation - our software worked, but only partially worked, so we decided to switch to a“ clean ”version of the on-board software, which also works on a“ clean ”hardware,” said Curiosity project manager Richard Cook. “The easiest way to do this is to just start using a spare computer.”
Curiosity is equipped with two computers with the uncomplicated names A and B, each of which can be used to control the rover. Computer B was used during the flight to Mars, and after landing, the rover switched to computer A, and has been using it ever since.
[Curiosity on-board computers are called RAD750, and they are radiation-resistant single-board computers based on the processor of the same name. They are produced using 250- or 150-nm technology and are able to withstand radiation up to 1,000,000 rad, and operate in the temperature range from -55 to 125 degrees Celsius, consuming about 5 watts of energy. A system consisting of the processor itself and the motherboard can withstand up to 100,000 rad and a temperature of -55 to 70 degrees. Computers have 256 kilobytes of EEPROM, 256 megabytes of RAM, and 2 gigabytes of flash memory. Of course, this is not very impressive in 2013, but compared with the hardware of the previous generation rovers, the increase in performance is very large, approx.
The switch from the primary to the backup computer occurred around 5:30 pm EDT (GMT-5) last Thursday. After that, the rover switched to the so-called “safe mode”. Over the next few days, engineers will continue to connect Computer B to all on-board systems, and restore the normal operation of the rover.
Since landing, this problem has become the most significant of those that fell on the head of Curiosity.
“Most likely, we will soon return to normal operation,” Cook said, “And yet, this is not the most pleasant experience - you see, the rover is an extremely complex device. It’s enough trifle for something to go wrong, and we have to take this into account all the time. ”
The problem first appeared on Wednesday morning. It all started with the fact that the employees of the control center noticed data that, as it seemed to them, indicated damage to the flash memory of the rover. The on-board software did not write any new data to the memory, and refused to transmit the data recorded earlier. The only information that could be obtained from the rover was real-time telemetry.
On the same day, during a communication session via the MRO satellite, telemetry showed that the memory corruption was still not fixed. In addition, as it turned out, the computer did not do some pre-programmed actions - it had to go into sleep mode for an hour, and then wake up during the next communication window with the Odyssey satellite.
MRO satellites (left) and Odyssey (right)
“During the second flight, we received some information, which briefly boiled down to the following: Hey guys, the memory is still damaged, and besides, I did not go to bed when I had to, I was awake all this time! ", Said Cook.
The next communication window was between 22:30 and midnight on the same day (in the time zone of the JPL control center). The rover computer was still working, and the engineers decided to switch to system B.
At the same time, Cook noted that the rover's memory was initially made resistant to errors that could be caused by cosmic rays or radiation. However, everything indicated that the most sensitive area of memory was damaged - a directory that contains information about the location of certain data.
“Without going into details, we have several degrees of protection. The memory itself is self-correcting, and the software is designed to be tolerant to data corruption. We believe that we were extremely unlucky - we received errors precisely in those areas of memory that were most sensitive to them ”
[I recall that the rover software itself has several levels of action in an emergency. In case of especially serious problems, the rover usually goes into “safe mode”, stops all its activities and waits for the next communication window to transmit information about the problem to the control center and receive further instructions, approx.]
“Thus, we simply lost information about where what data is located. I repeat - in theory, the rover software should be tolerant of errors of this kind, but we were in a situation where some of the software worked as expected, and some began to fail in anticipation of data changes in memory - the software simply could not understand where this data was read from . ”
Cook noted that the chances of cosmic rays causing this kind of problem are extremely low, but this has happened before.
“Imagine an address book that is full of entries. Instead of damaging one of these records, cosmic radiation damages the table of contents. "This is an extremely rare occurrence, but - alas - such things sometimes happen."
If this hunch is correct, restarting the main computer should fix the problem. However, engineers are not going to rush - they conduct a detailed analysis of the situation in order to be sure of the causes of the problem before taking any action.
“Of course, we can use computer B, and it is absolutely as effective as the main one. So in the coming week we will configure the software of the second computer to make sure that all systems work as they should. ”
“In the end, we plan to return to the main computer. If the problem is really memory corruption, then it will disappear by itself during loading, as the on-board software will overwrite the partition table from scratch. ”
NASA experts expect Curiosity to be able to continue its scientific research over the next few days.
Update
Today (03/03/2013) NASA announced that Curiosity is again in “active” mode. According to calculations, he should fully recover and continue scientific research next week.
The safe mode was exited on Saturday, and on Sunday the rover again began to use the HGA (high-gain antenna) to communicate with the Earth.
“The recovery process is going well,” said Richard Cook, who is already familiar to us. “It consists of two parts. Firstly, we want to understand exactly what happened to computer A, and secondly, to carry out a number of operations with computer B, for example, tell him about the state of the rover - the current position of the arm, mast, and so on. ”
However, the exact cause of the memory failure is still being clarified.
Please report all errors and typos in PM!
As usual, many thanks to Zelenyikot for the material found.