The story of one Crash and NSLog who treated him

    I am treating Crashs and NSLogs. Inexpensive. Many years of experience. 100% guarantee.

    Something like this headline could describe what happened three and a half months ago at my project. Rather, it was not even my project, but I had to deal with the crash problem.

    It all started with the fact that on one of the relatively large projects, an exception began to fall out during user authorization. “Well, what is it? Everyone happens. They forgot to put a check on nil or messed up somewhere. "Also, to me, a big event - crash on the project", - will think aboutThe majority of programmers. In principle, I absolutely agree. Crash is not such a rare occurrence in programming for the iPhone, and you come across it ten times a day. But this one was special. He already began to smell of “magic” when he was told about him some parameters and features:

    • Simulator reproducibility : 100%
    • Playability on the device : 0%
    • Crash path (after crash localization): ~ 40 seconds
    • Compilation optimization settings (-O1, -O2 ...) do not affect reproducibility
    • XIBs are not used in the project


    Yes, he looked pretty harmless:

    // Code
    UITextView * textView = [ [UITextView alloc] initWithFrame:CGRectMake(0, 150, _width, _height)];
    // Exception
    *** Terminating app due to uncaught exception 'CALayerInvalidGeometry', 
        reason: 'CALayer bounds contains NaN: [0 0; nan 200]'


    "Well, right then, a hedgehog understands that width - after calculating - NaN!", I thought. Having a quick glance at where and how the width of the view is calculated, and not finding anything particularly dangerous, I set NSLog before creating the view to confirm it And in addition, a breakpoint on the line with the creation of the element.
    // Source:
    NSLog(@"width = %f", _width);
    //Output:
    width = 200
    


    “Um,” I thought to myself, and continued to run the program after the breakpoint. And the crash did not happen ...



    Test project in which this Crash is reproduced


    For those people who are interested in looking at this beast themselves - a test project lies with me .
    It is reproduced exclusively on iOS Simulator, SDK Version 5.1 (it doesn’t play on SDK 6+, but if there are good people who will reproduce - write in a personal email, post it)

    If you still think that something is trivial here, then read it calmly further. When we get to the fun, I'll stop and give a warning about the spoilers.

    What have I done?


    If you think logically, then everything becomes clear. Since crash disappeared, this is because I changed something (because before that, reproducibility was 100%). The question was only what exactly had changed.

    1. I added some code before creating the view
    2. I added NSLog
    3. I paused the stream
    4. I accessed the class variable _width , before passing it as a parameter
    5. ....

    If you still have ideas what I could change - you can write down. Then check if the guesses were correct. But, most likely - hardly;)


    All items were easily checked. I removed all the changes and started experimenting.

    I paused the stream

    The idea here is quite simple. Most likely - the error occurs due to multithreading.
    This was the easiest to check; I set a breakpoint without NSLog 'a. The code continued to crash, with the same 100%.

    I accessed the class variable _width or I added some code before creating the view

    The idea of ​​these points is in the “features” of code generation, which, miraculously, could lead to unforeseen consequences.
    No dances with a tambourine and additional, randomly generated, lines of code with access to the variable _width did not give a result. The application crashed steadily.

    I added NSLog

    “No, you’re not serious, are you?”
    It turned out to be serious. NSLog continued to "cure" the crash. Stable Regardless of the parameters passed to it. More remarkable features were discovered. The application continued to crash , even if you did not use _width , _height as parameters. Without NSLog, any value passed to - (id) initWithFrame: would crash the application with the same error. Among other things, it turned out that it was not necessary to install NSLogright before creating the view. It can be put at the very beginning of the method where the view was created ... or in the method that called this method ... or even in the method that called the method that called ..., you understand, in general. The main thing is that before creation.

    Success!


    That's it. The problem was solved". The blocker is eliminated, and nothing prevents you from calmly testing the application on a simulator. And also, there was a lot of work that needed to be done, and I had to switch to it.

    Is the problem solved?


    This question haunted me for several evenings, while I was finishing the main project. Is the problem resolved, or did the “cure” with NSLog simply remove the symptoms of some larger problem that lurks and waits for the project to go on release? And then, when everyone forgets about it, and for its fixation it will be necessary to wait a week for a submit in the AppStore, will it manifest itself and begin to break and destroy everything, but not on the simulator?

    Magic?


    On the one hand, I understood that the problems were in my code. And, most likely, this is not a UIKit library error . Familiar programmers said so - look for a problem in your code. And I agreed with them. But how to look for her? The project is rather big; attempts to localize the problem in several lines fail. I wanted to drop everything, say “magic”, and score. And live with a sense of magic in the project. And then, because, if you know how magic works, any interest is lost, everything becomes ordinary, according to the documentation.

    Nevertheless, the interest in the cause of the problem overpowered, and I began to dig. By this time, my ideas revolved around the desire to understand: so what is NSLog doing so interesting, and how does he "cure" the fall of the program? There were a lot of options both with me and in the local chat of developers. In the end, they agreed that the most likely cause of crash is stack stack problems . This move did not explain the reason for stable operation on devices, but it was necessary to start somewhere. But before I started looking towards the stack, I tried to figure out where, after all, does this NaN arise ? . Just at that time, the familiar darkproger was actively telling right and left how cool the dtrace thing is . And after a few hours, using this tool, I was able to slightly reduce the search for the cause of the fall.

    Dtrace magic


    At that moment, all I had was a stacktrace to the crash site. The problem was somewhere inside him, well, or something led to the fact that all this fell safely.
    #4	0x01d8b04a in -[CALayer setBounds:] () <---- NaN lives here
    #5	0x02d1d714 in WebCore::TileGrid::updateHostLayerSize() ()
    #6	0x02d1af26 in WebCore::TileCache::TileCache(WAKWindow*) ()
    #7	0x02d52507 in -[WAKWindow initWithLayer:] ()
    #8	0x002ee5e9 in -[UIWebTiledView initWithFrame:] ()
    #9	0x001af9b5 in -[UIWebDocumentView initWithWebView:frame:] ()
    #10	0x001af89e in -[UIWebDocumentView initSimpleHTMLDocumentWithStyle:frame:preferences:groupName:] ()
    #11	0x0012cb6a in -[UITextView commonInitWithWebDocumentView:isDecoding:] ()
    #12	0x0012bf0e in -[UITextView initWithFrame:] ()
    


    Using a relatively simple script for dtrace, I was able to find out where the ill-fated NaN appears . It turned out that the parameters are "lost" right at the very end:

    QuartzCore`-[CALayer setBounds :]                   <------ (0, 0, NaN, height)
    WebCore`WebCore::TileGrid::updateHostLayerSize()+0x140
    WebCore`WebCore::TileCache::TileCache(WAKWindow*)+0x1c6
    WebCore`-[WAKWindow initWithLayer:]+0xd7
    UIKit`-[UIWebTiledView initWithFrame:]+0xec "       <------ (0, 0, width, height)"
    UIKit`-[UIWebDocumentView initWithWebView:frame:]+0xb5
    UIKit`-[UIWebDocumentView initSimpleHTMLDocumentWithStyle:frame:preferences:groupName:]+0xfe
    UIKit`-[UITextView commonInitWithWebDocumentView:isDecoding:]+0x198
    UIKit`-[UITextView initWithFrame:]+0x70 "            <------ (0, 0, width, height)"
    


    What to do? Apparently the problem is not in our code! With a heightened sense of self-importance, I'm already preparing to send a bug report to Apple. And I climb to see how the stack is doing there .

    Spoiler alert!

    If you were still going to solve this problem yourself, then it's time to stop reading this article;)
    For those who are going to figure it out themselves, and think that the problems are working with the stack, or vice versa, they think that the problems are not connected at all with stack, small spoiler
    Do not read me!
    The problem is not stack related at all *


    Stack troubles?


    How in Xcode to see what is currently going on in the stack? What parameters are passed to the function? How are they transmitted and by what rules? And through the stack?
    This is just a small part of the questions that at that time I had no answers, but darkproger helped again (in general, Vova very diligently and patiently listened, or pretended to listen, all my delusional and not very ideas of cause and effect;) poking my nose into a very, very useful documentation . And to be more specific, then to a specific part of it . After a quick reading of the technotes, I managed to display the contents of the stack.

    However, these data did not help my understanding of what was happening. The fact is that the stack was identical in both cases. No difference. All values ​​in memory were exactly the same and stack registers referred to the same memory addresses. Checked several times, restarted the application, checked memory addresses ±. Everything coincided.

    What's next?


    As I said, all the values ​​in the memory addresses match. But something was still different. Something was different. True, the search for this "something" temporarily faded into the background - I found the ill-fated NaN .



    It was located in one of the st registers of the FPU processor. Since I did not know what these registers were, I had to climb into the documentation from Intel . According to the documentation, the register group st0-st7 is a cyclic stack for the FPU, through which the processor instructions responsible for floating point operations work. Everything began to clear up. Someone puts the NaN value on the stack in the FPU , and then removes it as a result of the operation. It remains only to find out who does this.

    About an hour I killed on a piece of magic that suddenly appeared, which was explained by a simple "read the documentation carefully." This magic manifested itself as follows. The “ fstp ” instruction worked incorrectly (page 190). The command should perform the following operations: Take the current value from the top of the FPU of the st0 stack , write it to the specified address (in my case, in $ ebp-0x64 ), and push the value from the FP stack.
    But only after the execution of this instruction, the address $ ebp + 0x64 turned out to be 0, instead of the desired value of 320.

    Before executing the fstp command


    After executing the fstp command


    Attentive experts probably already see the problem. Those who want to verify this behavior themselves can download the test project and put a breakpoint on the WebCore :: TileGrid :: updateHostLayerSize () function . Inside it is easy enough to find the right piece. Whoever is simply interested in the reason for my plugging - feel free to open the spoiler.
    And $ ebp-0x64, they say, is not real!
    The reason is data types. The fstp command writes a 40-bit floating-point number to memory, which is commonly referred to as long double in common people . In this example, in the debugger, I look at the correct address, but I only show not 40 bits, but 32 bits. Everything fell into place when I indicated the correct data type. An error that cost me an hour and a half to search might not have occurred if I had read the documentation more carefully. Moral: read the documentation. It is useful and saves time.


    Stack overflow?


    After I figured out fstp , and fld . I began to look for differences again. And since I already knew for sure that the reason was not related to the stack, I concentrated on the values ​​of the processor registers. And there were differences. And it even became clear where NaN comes from . First, I looked at the $ ftag register , the value of which immediately told me about the presence of something bad in the FPU stack registers (page 182).


    This could, of course, be observed simply by looking at $ st0- $ st7 , but $ ftag allows you to find out what state the registers are currently in - which are used, which are busy, and which are empty. The example shows that all the registers are busy (valid), and only one of them has something special.

    But finally, all questions regarding magic were dispelled when I saw the values ​​of the register $ fstat (x87 FPU Status Register, p. 178).



    Stack overflow


    Stack overflow in FPU processor registers. Actually stack overflow by itself does not cause any exceptions, and just sets a flag in the control word that an exception has occurred.
    If the invalid-operation exception is masked, the x87 FPU returns the floating point, integer, or packed decimal integer indefinite value to the destination operand, depending on the instruction being executed. This value over- writes the destination register or memory location specified by the instruction.

    But, if we use the value of the register with the invalid state, we get NaN , which, once in memory, comes as a parameter when creating CALayer .

    So, I went down to the very lowest level of the problem. I knew exactly where the magic NaNs come from , in the program, and about stack overflow in FPU stack registers. That's just the answer, where and when this happens I did not have. The idea was again born in my head that it messes with UIKit , but you can’t make a claim without proof. On the other hand, I had absolutely no idea how the code could call the FPU Stack Overflow (the existence of which I learned that day).

    Culprit


    In a relatively short time, a function was nevertheless localized, which was the culprit of everything that happened. Here he is, the culprit of everything that happens!
    // ACLabel : UILabel
    - (CGFloat)resizeToContents {
       CGSize size = [self.text sizeWithFont:self.font
                      forWidth:self.frame.size.width
                      lineBreakMode:self.lineBreakMode];
       CGRect oldFrame = self.frame;
       oldFrame.size.height = size.height;
       self.frame = oldFrame;
       return size.height;
    }
    


    Trying to find something scary? Try. I tried too. Long and tedious. I rewrote the method, renamed it, made it a category. And all in vain. This method left garbage in FPU Stack'e. Moreover, magic has taken on a new form. The same method left garbage in the stack every other time. That is, "sometimes" the method normally "took" the value from the top of the stack, and sometimes he completely forgot about it, and ignored it, thereby approximating the crash. So what is wrong with this code?

    On systems based on the IA-32 architectures, when an application calls a function that returns a floating-point value, the returned floating-point value is supposed to be on the top of the floating-point stack. If the return value is not used, the compiler must pop the value off of the floating-point stack in order to keep the floating-point stack in the correct state.


    “How I feel better about it!” (Sarcasm). Now the guilt vector has smoothly shifted from UIKit 'and again to the compiler. After all, he should remove the value from the FPU stack! A must, but does not. Rather, it does ... but not always. So when?

    One step to victory


    And I will answer this question for you. But first, I have a small question: “Are the lines shown in the example identical?”
    // Пример безобидного вызова метода
    // Вариант №1
    [obj someMethod];
    // Вариант №2
    [obj perfromSelector:@selector(someMethod)];
    

    A simple question like that. He is even asked at interviews. And you can’t even say right away that they can work in different ways. After all, in fact, this is one and the same thing, only a side view. True, the example is greatly simplified due to the lack of the fact that these methods return. It is implied that they return void . And if the methods returned id , or int or float . Would something change?
    While you are thinking about the answer to the question, I will give a small excerpt from the documentation:
    The performSelector: method is equivalent to sending an aSelector message directly to the receiver. For example, all three of the following messages do the same thing:

    id myClone = [anObject copy];
    id myClone = [anObject performSelector:@selector(copy)];
    id myClone = [anObject performSelector:sel_getUid("copy")];
    

    . For methods that return anything other than an object, use NSInvocation.


    When the lead is in bold, then, of course, they notice it. The documentation quite clearly hints that if suddenly a method should return something other than id , we need to use NSInvocation . But I will say that even if you return not an object, then everything works. It worked, anyway. Until that ill-fated day.

    Profit


    Probably, many have already guessed what is happening here. Everything is quite simple (especially after reading the article, well, or two and a half days of games with dtrace, debugger, digging in the documentation and so on).

    1. The compiler sees the method that the float should return and generates the code correctly, which, after calling the method, removes the returned value from the FPU stack.
    2. В случае с использованием метода performSelector, компилятор не знает, что возвращает метод, поэтому надеется на добросовестность программиста — считает, что метод будет возвращать id.
    3. В случае, если в коде используется performSelector для метода, который должен вернуть float (даже если не использовать возвращенный результат), то компилятор не знает о том, что надо снимать возвращенное значение со стэка FPU. Так что, при каждом вызове return-float-метода через performSelector, «забивается» один из регистров стэка FPU.
    4. After eight such calls (8 st-registers), the processor sets a flag that the stack is full, and some ordinary floating-point operations begin to return NaN as the result. This, in turn, leads to anything. Up to the point that a dinosaur can run behind your back.


    Moral : Read the documentation carefully. If something works contrary to the documentation - don’t really find that it will not cause big problems.

    NSLog


    “Hey, hey, so what's up with NSLog?” Why does NSLog “cure” this problem? ”
    NSLog uses floating point operations. Moreover, using the poke method, it was found that for its operation two registers are needed in the FPU stack. And the FPU stack, if fully populated, sets the appropriate overflow flag. And in an operation that removes the value from the stack, marks one of the stack registers as "free to use."
    // Tag word  (state of st0 - st7 registers)
    // 00 - Used
    // 10 - Invalid
    // 11 - Unused
    // before NSLog
    st0 st1 st2 st3 st4 st5 st6 st7
    10  00  00  00  00  00  00  00 
    // After NSLog
    st0 st1 st2 st3 st4 st5 st6 st7
    11  11  00  00  00  00  00  00 
    


    In this way , the Invalid state disappears from the register st0 , because of this, when trying to take a value from it, no one receives NaN , and does not pass it further as a parameter to the UITextView .

    So, NSLog does not cure the problem. He lets her live a little longer, delaying the death of the application until the time, codenamed "Suddenly."

    Here is such a “magic." Now you can safely write a line in your resume
    I am treating Crashs and NSLogs. Inexpensive. Many years of experience. 100% guarantee.


    I hope it was not only interesting, but also in some places - informative.

    PS Thanks again to darkproger and vixentael for helping and listening to my thoughts.

    Also popular now: