The experience of another engineering investigation

    We had the opportunity to conduct another small but extremely instructive tactical lesson

    The topic of this post was inspired by the newsletter from Sherlock Oms - periodically there are stories about non-trivial engineering problems that arose when diagnosing various electronic devices. So I thought, why not? Although I understand perfectly well that the subject is quite specific, requires very specific highly specialized knowledge and is unlikely to be interesting to a wide circle of readers, it can deliver a few pleasant minutes to a narrow circle of connoisseurs of hardware riddles. So for those who know what a data bus is and how it is built - a story in which there will be ships, and shoes, and sealing wax, and cabbage palm trees.

    In the process of designing the device on the MK 1986BE1T, which I already wrote about, it became necessary to interact with external FLASH memory chips through a fairly fast interface, preferably parallel. Fortunately, such a possibility is present in the MK under consideration, and for organizing access to devices mapped to memory but not included in the MK itself, you can use the full (32 bits of the address, 32 bits of data, 2 control signals, 4 tracking signals) external bus, and the exchange for the user program looks absolutely transparent. As always, thanks to the developers for including this option, and, as always, expressing dissatisfaction with clearly insufficient documentation, although the post is not about that. Due to a number of development and controller features, not all data bus width was used, but only 8 digits, starting from the 3rd, which is absolutely insignificant for the description of the detected problem. Mk along with all output buffers is powered by +3.3, in addition, to ensure the operation of other devices, pull-up resistors 2 kΩ were connected to the voltage +5 to the data bus. After assembling a prototype device, debugging began and the first test cases (of course, or, as they say, of course) did not work, and then they took (hereinafter, the plural is used, since we carried out this work with a young colleague, who for some reason categorically does not want to write posts on Habr) an oscilloscope and climbed to look at makeshift. And then an interesting phenomenon was discovered. The expected waveform of the signals on the data bus should look like this (red and green colors - this is not what I came up with,



    The fragment of the waveform marked with the number 2 is the expected behavior of the bus in the absence of an addressed external device (this is achieved by transferring the sample input to an inactive state). MK takes data from the bus (black line in the upper diagram) and the voltage on it begins to be pulled to power through the feed resistor. After a while, the MK issues an active level (zero) of the read signal (green signal in the bottom diagram) and at that moment the external device should transmit data (since it is inactive, the pull-up continues), then after a certain time the active level of the read signal is removed, the external device frees the tire, the further state on the bus is vague, in our case the pull-up continues. Everything is logical and understandable, but the fact is that the diagrams shown in section 1 were originally discovered. In this case, before the read signal was sent, the MK issued a high level to the bus and continued to hold it for the duration of the reading and even further, only after a considerable time (about milliseconds), the data bus went into an off state. Somewhat unexpectedly, but at first I reacted to the situation without due attention - I decided that somewhere there was an error in the pin settings and I had to look for it (since the program was written by a young colleague, it was easy for me to assume the presence of possible errors in it, if I wrote it , then the situation would not be so unambiguous :)). The firm belief in the presence of a settings error disappeared after it had been established that ALL 16 data lines (out of 32) were configured the same, and only 4 of them had faults, and these were bits 4,5, 8 and 11.

    We think further and experiment. There is an idea that it is impossible to read immediately after recording (this is not reflected in the documentation, but when working with Milander we are already used to speculating something), so we do 2 consecutive readings, in the hope that the second will go right.

    data=*buffaddr;
    data=*buffaddr;
    


    And here the most interesting begins - the second reading really goes right, BUT the first also becomes correct - a very interesting phenomenon - I absolutely can not imagine its mechanism - that is, I can not imagine a reasonable mechanism for the influence of the next command on the previous one. A quick look at the generated assembler code gives a hint - the address of the location of the first read command has changed due to the features of the linker - it’s better, it’s easier to think of a mechanism for how the address affects the execution of the command. In order to investigate the behavior of MK, we select the fragment related to the exchange with the external bus from the general program by removing everything that is unnecessary. And we get another surprise - incorrect reading is not observed even with a single call, although the address of the command remains unchanged. By inserting the deleted fragments back, we find out that when the CRC16 calculation function is connected, incorrect reading is observed, but if it is absent, it doesn’t, and this function obviously doesn’t interact with the external bus and cannot influence reading in any reasonable way. Further experiments showed that it is not the CRC16 counting function as such that is important, but the presence in it of a block of data of intermediate sums, moreover, the size of this block, that is, with the code:

     static CRC16Buff[256]; ошибка наблюдается а при
     static CRC16Buff[215]; (и менее 215) - ошибка обращения отсутствует
    


    How can this fragment affect the code executed in a completely different place? We find that the only change is in the value of the stack, since the required place for global variables has changed. That is, it turns out that the wrong treatment occurs when the command is executed from certain places at certain stack values, and the number of error bits in the word is small? It's time to remember the first rule of the engineer - "There are no miracles in the world." We can assume that this is the remainder of some sort of debugging function VHDL, which signaled about certain situations and was not removed from the release. It looks like the thought of a highly-smoked developer, but so far there is no other hypothesis, since we reject divine intervention. Another thought - “here you are, the reindeer” - we found a BOOKMARCH, though quite senseless,
    We continue our research and are amazed to find that moving the command to different addresses (by adding NOP) does not lead to anything - the error does not appear, or, accordingly, does not disappear for different values ​​of the stack, that is, the hypothesis with the address should be rejected. But how then does adding the second command affect the first? We look at the assembler code more closely and find more changes, namely, when reading it once, the compiler generates

    mov r0, sp
    ldrh r1,[r4]
    strh r1,[r0]
    


    And during two consecutive readings, he conducted an optimization:

    mov r2, sp
    ldrh r1,[r4]
    strh r1,{r2]
    ldrh r1,[r4]
    strh r1,[r2]
    


    It was hard to believe, but we really establish further that incorrect reading takes place if and only if the register r0 contains a very specific value, and it does not matter if this register is used in the future. Compared to the previous absolutely crazy hypothesis about the relationship between the stack pointer and the command counter, we are seeing clear progress. Further experiments show that a forced high error level is observed on the data bits for which units were recorded in the last cycle and in which units were written in register r0, moreover, the phenomenon is clearly trigger - occurs during the first reading after recording and is held for a certain time, and this time is not connected in any way with the MK frequency (within the error Ia) but it has a pronounced relationship with the temperature of the crystal (with increasing temperature, the retention time increases). It can be assumed that the control signal for the output buffer of the upper stage of the data bus has an inactive level and it is driven by a signal from the corresponding register bit until the capacitance is recharged with leakage currents. The hypothesis is good, but the trigger, unfortunately, does not explain, if anyone comes up with a more suitable explanation - please comment. Well, in the practical part, as a bottom line, before reading data into the register r0, we write zero and the bus behaves as it should, which is confirmed by the above oscillogram obtained with the following code that the control signal of the output buffer of the upper cascade of the data bus has an inactive level and the signal from the corresponding discharge of the register is induced on it until the capacitance is recharged with leakage currents. The hypothesis is good, but the trigger, unfortunately, does not explain, if anyone comes up with a more suitable explanation - please comment. Well, in the practical part, as a bottom line, before reading data into the register r0, we write zero and the bus behaves as it should, which is confirmed by the above oscillogram obtained with the following code that the control signal of the output buffer of the upper cascade of the data bus has an inactive level and the signal from the corresponding discharge of the register is induced on it until the capacitance is recharged with leakage currents. The hypothesis is good, but the trigger, unfortunately, does not explain, if anyone comes up with a more suitable explanation - please comment. Well, in the practical part, as a bottom line, before reading data into the register r0, we write zero and the bus behaves as it should, which is confirmed by the above oscillogram obtained with the following code

    mov r0, #0xFFFFFFFF ; это псевдокод операции
    ldrh r1,[r4] ; здесь наблюдается ошибка - фрагмент 1
    strh r1,[r4] ; проводим запись с установлеными единицами, сигнал записи - синяя линия на нижней диаграмме
    mov r0,#0x00000000
    ldrh r1,[r4] ; а вот тут ошибки нет - фрагмент 2
    


    By the way, like O'Henry, there were no kings or cabbage.

    Also popular now: