Chip aging accelerates
In the process of adding chips manufactured by advanced technological processes to cars, and changing the models of their use in data centers, new questions begin to arise related to their reliability
Reliability is becoming one of the important advantages for new chips entering markets such as cars, cloud computing and the industrial “Internet of things”, but it is becoming more and more difficult to prove that the chip will work for a long time, as it should.
In the past, reliability was usually considered a problem for an integrated circuit factory. The chips, designed for computers and phones, were designed to work at their peak with an average of two to four years of normal use. After that, their functionality began to degrade, and users were updated to the next revision of the product, boasting new features, speed and increased battery life. But with the development of chips for new markets, or markets in the past which used less sophisticated electronics - cars, machine learning, the Internet of things, the industrial Internet of things, virtual and augmented reality, home automation, cloud technology, cryptocurrency mining - reliability has ceased to be simple item from a large checklist.
Each of these target markets demonstrates unique needs and characteristics that determine how and under what conditions chips are used. And this, in turn, seriously affects their aging, safety and other factors. Consider the following statements:
- Reliability is no longer measured simply for years. Usage patterns are changing dramatically. A modern car can be idle for 90-95% of the time, and the robomobile will stand idle for 5-10% of the time. This affects the development of electronics, and the main business model used in the development of technology.
- Definitions of what can be considered “functional” or “good enough” are changing, as advanced electronics are constantly becoming more complex. In the past, a cracked or dirty drone or robot camera was replaced. But with the addition of more sophisticated electronics to advanced devices, the effect of a cracked lens can be compensated for by remaining within the framework of a functionally adequate operation. On the other hand, what was acceptable for less complex systems can now be considered unacceptable based on tightened system tolerances.
- Modeling degradation and quality is subject to a much larger number of factors than it was before, some of which may even be unobvious when developing chips. For example, a chip known for good quality may behave differently when combined with other chips or devices on a printed circuit board.
The use of electronics is changing. This happens even in data centers, which are historically the most conservative in terms of adapting new technologies and methodologies.
“Aging depends on clock speed and power, but in the past the servers were sometimes turned on to do the work, and then most of the time they were in standby mode,” said Simon Sigars, CEO of ARM . “Turning to cloud technologies, we need to change development criteria, because they are based on continuous use. This raises a lot of questions about how to properly design a chip for long-term performance. "
At the beginning of the millennium, the average server load was at the level of 5-15%, and this trend has persisted since the 1990s, since IT specialists, fearing equipment failures, were reluctant to run more than one or two applications on one server. Two events have occurred that have changed this state of affairs. Firstly, the cost of energy began to rise. Secondly, and more importantly, the companies were reorganized in such a way that IT departments were responsible for the cost of energy used, and not equipment maintenance departments. Both factors led to an increase in sales of virtualization software, increasing server utilization, which led to a decrease in the number of servers that needed to be powered and cooled.
Cloud technology takes operational efficiency to a new level. Their goal is to maximize the load by balancing computing tasks across the entire data center. This way you can increase the load percentage for all servers, and not just for servers in the same rack, or allow you to quickly turn off those that are not needed now. This approach is effective in terms of energy use, but seriously affects the degradation and aging of electronic circuits.
“We are seeing acceleration of aging, until the chip completely fails,” said Magdi Abadir, vice president of marketing at Helic. “They're starting to skip measures or increasing jitter. Or there is a breakdown of the dielectric. And every time something breaks down, a whole avalanche of things happens that also need to be taken care of. Many of the aging models developed at a time when electronics were used on a case by case basis. And now the chips work all the time. Blocks are heated inside the chip, which accelerates aging. Because of this, you can encounter various strange phenomena. Many companies have not updated their aging models. They speculated that their devices would last three to four years, but failure could occur earlier. Deviations from the original design may be small from the very beginning, but aging increases them. ”
The trend of increasing load penetrates into cars, and will continue to do so until fully autonomous vehicles replace human drivers. Robomobiles are processing more and more information, part of which flows from sensors such as radar, LIDAR and cameras. All this data needs to be processed faster than before, and with greater accuracy - and this greatly loads the electronics.
“Minimum reliability of ADAS [ advanced driver assistance systems / approx. perev. ] is 15 years, which is much more than 2-5 years for former modules, ”says Norman Chen, chief technology officer at ANSYS . - Aging is not just about working hours. There is also NBTI [negative bias temperature instability - threshold voltage shift / approx. perev. ], electromigration , which may be related to temperature, ESD [ electrostatic discharge ] and thermal coupling ”.
Temperature simulation for chip and housing
And although many automotive parts suppliers have already manufactured chips that can withstand extreme temperatures, mechanical vibration, and various noise levels, such loads have never been applied to CMOS chips manufactured using advanced process technology for a long time. Many industry-related sources confirm that for the processing of all data, automakers are developing chips using the 10/7 nm process so that their circuits do not become outdated too quickly - because the latter are often developed for several successive generations of vehicles. The problem is the lack of a sufficient amount of real data indicating the reliability of these devices that work for a long time when exposed to the environment.
“You have to do a different design,” said Segars. - According to one of the ideas, in the end we will need fewer cars, since they will almost not stand idle. But there is another thing: robomobiles will work more and wear out faster. After all, everything eventually wears out. The task is to ensure that the electronics do not wear out faster than the mechanics, but for this you need to make a different design. Consider everything from a more accurate attitude to noise to minimizing surges. ”
Finer insulation, thinner substrates
One of the ironic aspects of increasing reliability is that it contradicts five decades of progress, the purpose of which was to reduce the size of microcircuit elements once every couple of years for reasons of economy. And this usually means using thinner dielectric and wires, as well as increasing dynamic power. And increasingly, this means using a thinner substrate. For the most advanced technical processes, this resulted in an increase in current leakage, the amount of noise, electromigration and other effects.
“From the point of view of the circuit, it is necessary to somehow deal with process variances,” said Andre Lang, Quality and Reliability Manager at Fraunhofer EAS. - But from the development point of view, you need to consider how the system will cope with the known defects. If you take robomobiles, they have a central processor that needs to decide what information from which sensor to use. One of them may become dirty or fail. ”
This makes modeling degradation more difficult because it must be carried out in the context of the entire system. “Most parts of the system contribute to the degradation of electronic circuits, whether NBTI, an increase in the number of defects per unit area or process deviations,” Lang said. He noted that another big problem is to determine the causes of the defect without processing all available data, because their volume may be excessive.
An example of what might go wrong
The deviation of the process increases with each new manufacturing process. In the last decade, smartphones have set the tone (the iPhone appeared in 2007). Today, the largest users of advanced technical processes are servers for data mining, machine learning, AI and cloud services.
The relationship between process deviation and reliability is described in detail, but deviations make it more difficult to accurately model effects associated with aging. Because of this, several different approaches have already appeared to solve this problem, from complex statistical modeling and simulations to the location of sensors on chips or inside cases.
“You need to track the temperature rise at the heat source using a random walk approach that works both locally and globally,” said Ralph Iverson, chief engineer for research and development at Synopsys. “Using random traversal, the voltage is averaged, so the delta is zero.”
This helps build models, but resistivity on a scale of 5 nm or less does not always remain constant, Iverson says. Surface effects play a role, and data does not always represent copper contact, so more localized data is required. It is in this area that a hybrid approach is beginning to emerge, since such a level of uncertainty is difficult to describe abstractly.
“In the world of automobiles, everything works well at the level of BiCMOS“But now there are already requests for an advanced version of CMOS,” said Mick Tegetof, marketing director at Mentor, Siemens Business. - We are seeing an increase in interest from manufacturers, and companies involved in the automation of electronics design are already simulating chip aging under load. is that enough? Any model is just an attempt to approximate the real world. You do the simulation, do everything you can to create a chip that needs to work for a long time, but then you need to return to physical testing and, for example, put the chips in the oven to create physical activity. Before our eyes, more and more electronics are being tested. ”
Analog versus numbers
So far, all obsolescence modeling has focused on digital circuits. Analog systems add a whole new perspective to aging.
“Companies are well versed in the aging process and process deviations of chips located somewhere close to the engine compartment, so they don’t move blindly,” said Oliver King, Moortec’s CTO. “But analog circuits are much more volatile.” The digital chip will simply stop working. Analog can start to work a little worse, a little less accurately, so you have to adapt to it. The developers of analog systems have not traditionally presented such strict geometry requirements as the developers of digital ones. Electromigration is still a problem, as is current density. But the effects of aging are not so pronounced. Nevertheless, the chips need to be developed more proactively, given the condition of the repair and whether it is necessary to take some action. "
Frank Ferro, senior director of product management at Rambus, concurs. “The main problem with PHY chips is the ambient temperature. When it grows, performance starts to “float”, so recalibration is required. For users there is the so-called “Christmas test”. This is when a Playstation or other electronics is stored in a garage in cold weather, and then you turn it on on Christmas morning, and the device needs to instantly switch from cold to operating mode. The same goes for memory systems in cars and base stations. Aging has an effect on these systems, and they have to be recalibrated to eliminate the negative impact. ”
Ferro says PHY passes the same tests as digital components, including forced failure tests, voltage and temperature fluctuations. But PHYs are designed to change due to these fluctuations, which is quite difficult to integrate into digital circuits - especially in advanced manufacturing processes in which deviations affect power and speed.
Analog circuits are often developed on the basis of so-called “mission profiles”. The specific function of the robomobile is a cyclogram for an integrated circuit designed specifically for robomobiles.
“One of the big problems that we have encountered is that these devices can be used in different cases,” says Art Schaldenbrand, chief marketer at IC and PCB Group. - The device may fail in many ways. We select various loads designed to disable it. Temperature instability can lead to failure of 10% of devices, but this is the worst case scenario. We need ways to better express chip degradation. For finFETs, the loads will be different from flat, so we have to model various phenomena. ”
Shells and other unknowns
With Moore's Law slowing down, more and more companies are turning to advanced packaging for improved performance and development flexibility. It is not yet clear how to model advanced packaging for determining stress and aging. In particular, difficulties arise due to a very large number of packaging options, because of which no one knows which one will win. It is also influenced by the relative novelty of some of these technologies, and what is happening inside the cases should be shown by time.
“The body layers may be too close to other components or to loads on the other hand,” said Abadir from Helic. - All this needs to be modeled. And even before obsolescence, aging must be modeled, since the number of factors affecting the work is growing. Therefore, the location becomes important. If you start moving components around the circuit, you will change the resonance frequency. There are no simple rules for this. We’ll have to analyze the whole scheme, and if we encounter a problem, we may have to move something. ”
In complex circuits, other anomalies are also found that can eventually affect reliability. Some usage patterns can turn on and off circuits more often than others, which loads them more.
“If something is on standby for too long, it will age differently than other schemes,” said Jusan Syai, chief software architect at Cadence. - And the smaller the device, the stronger the effect of aging. The loads will be higher and aging will be faster. ”
How to approach all the problems described is not yet fully understood. Some of them will obviously require new materials and technologies.
“Power electronics is moving from silicon devices to SiC and GaN, capable of operating at higher switching frequencies, more efficiently, at higher temperatures,” said John Perry, director of industrial marketing for electronics at Mentor. “In some cases, this will allow the power electronics to be located closer to the motor, that is, in higher temperature conditions. In other cases, the use of semiconductors capable of withstanding higher temperatures means less cooling. However, semiconductors must be enclosed, after which this enclosure must also withstand high temperatures. A lot of money is poured into new technologies - for example, in sintered silver, which is used as a material for planting a crystal, and clips instead of wire connections,
The attitude towards the fact that aging, stress and other effects are bringing more and more problems when switching to advanced technological processes or when used for extended periods of time in those markets where device safety is important.
"The starting point is that customers are asking us questions today," Lang said. - Their starting points differ from client to client, but questions are often asked. Many are just starting to address this issue. They are faced with an increase in voltage or temperature, certain experiments are being conducted to extrapolate the influence of excessive loads. But to understand exactly how degradation will affect the whole scheme is harder. Much more needs to be done for complex chips. "
But with the change of attitude, the contribution of people to solving these problems also changes. Chip developers are just starting to think about modeling degradation and aging. As with power electronics a decade ago, all of this will change soon.