Debugging a bug that does not play
- Transfer
On October 10, 2018, our team released a new version of the application on React Native. We are pleased and proud of it.
But what a horror: after a few hours, the number of crashes under Android suddenly increases.
10,000 crashes under Android
Our Sentry crash monitoring tool is going crazy.
In all cases, we see a type error
In React Native, this usually happens if you set a property with the wrong type. But why didn't the error manifest itself during testing? With us, each developer thoroughly tests new releases on several devices.
Also, the errors seem quite random, as if they fall on any combination of properties and the type of shadow node. For example, here are the first three:
It seems that the error occurs on any device and in any version of Android, judging by the report Sentry.
Most crashes under Android 8.0.0 are falling, but this is consistent with our user base.
So, the first step before fixing a bug is to reproduce it, right? Fortunately, thanks to the Sentry logs, we can find out what users are doing before a crash occurs.
Ta-a-ak, let's see ...
Hmm, in the overwhelming majority of cases, users simply open the application and - boom, it crashes.
Well, try again. Install the application on six Android devices, open it and exit several times. No failure! Moreover, it is impossible to reproduce it locally in dev-mode.
Well, that seems pointless. Failures are still quite random and occur in 10% of cases. It seems that you have a chance of 1 out of 10, that the application will fall at startup.
To reproduce this failure, let's try to understand where it comes from ...
As mentioned earlier, we have several different bugs. And they all have similar, but slightly different traces.
Well, take the first one:
So the problem is
Hmm, we are very deep in the library of support for Android, it is unlikely that there can be any benefit.
Another way to find the root cause of the error is to check for new changes made to the latest release. Especially those that affect native Android code. Two hypotheses arise:
At the moment we can not reproduce the error, so the best strategy:
But what a horror: after a few hours, the number of crashes under Android suddenly increases.
10,000 crashes under Android
Our Sentry crash monitoring tool is going crazy.
In all cases, we see a type error
JSApplicationIllegalArgumentException Error while updating property 'left' in shadow node of type: RCTView"
. In React Native, this usually happens if you set a property with the wrong type. But why didn't the error manifest itself during testing? With us, each developer thoroughly tests new releases on several devices.
Also, the errors seem quite random, as if they fall on any combination of properties and the type of shadow node. For example, here are the first three:
Error while updating property 'paddingTop' in shadow node of type: RCTView
Error while updating property 'height' in shadow node of type: RCTImageView
Error while updating property 'fill' of a view managed by: RNSVGPath
It seems that the error occurs on any device and in any version of Android, judging by the report Sentry.
Most crashes under Android 8.0.0 are falling, but this is consistent with our user base.
Let's replay!
So, the first step before fixing a bug is to reproduce it, right? Fortunately, thanks to the Sentry logs, we can find out what users are doing before a crash occurs.
Ta-a-ak, let's see ...
Hmm, in the overwhelming majority of cases, users simply open the application and - boom, it crashes.
Well, try again. Install the application on six Android devices, open it and exit several times. No failure! Moreover, it is impossible to reproduce it locally in dev-mode.
Well, that seems pointless. Failures are still quite random and occur in 10% of cases. It seems that you have a chance of 1 out of 10, that the application will fall at startup.
Stack trace analysis
To reproduce this failure, let's try to understand where it comes from ...
As mentioned earlier, we have several different bugs. And they all have similar, but slightly different traces.
Well, take the first one:
java.lang.ArrayIndexOutOfBoundsException: length=10; index=-1
at android.support.v4.util.Pools$SimplePool.release(Pools.java:116)
at com.facebook.react.bridge.DynamicFromMap.recycle(DynamicFromMap.java:40)
at com.facebook.react.uimanager.LayoutShadowNode.setHeight(LayoutShadowNode.java:168)
at java.lang.reflect.Method.invoke(Method.java)
...
java.lang.reflect.InvocationTargetException: null
at java.lang.reflect.Method.invoke(Method.java)
...
com.facebook.react.bridge.JSApplicationIllegalArgumentException: Error while updating property 'height' in shadow node of type: RNSVGSvgView
at com.facebook.react.uimanager.ViewManagersPropertyCache$PropSetter.updateShadowNodeProp(ViewManagersPropertyCache.java:113)
...
So the problem is
android/support/v4/util/Pools.java
. Hmm, we are very deep in the library of support for Android, it is unlikely that there can be any benefit.
Find another way
Another way to find the root cause of the error is to check for new changes made to the latest release. Especially those that affect native Android code. Two hypotheses arise:
- We updated Native Navigation , where native snippets for Android are used for each screen.
- We updated react-native-svg . There were a few exceptions related to SVG components, but this is hardly the case.
At the moment we can not reproduce the error, so the best strategy:
- Roll back one of the two libraries. Roll it out for 10% of users, which is trivially done in the Play Store. Check with several users if the crash persists. Thus, we will confirm or disprove the hypothesis.
But how to choose a library for rollback? Of course, you can throw a coin, but is this the best option?Getting to the bottom
Let's analyze the previous trace more carefully. Perhaps this will help determine the library./** * Simple (non-synchronized) pool of objects. * * @param The pooled type. */publicstaticclassSimplePoolimplementsPool{ privatefinal Object[] mPool; privateint mPoolSize; ... @Overridepublicbooleanrelease(T instance){ if (isInPool(instance)) { thrownew IllegalStateException("Already in the pool!"); } if (mPoolSize < mPool.length) { mPool[mPoolSize] = instance; mPoolSize++; returntrue; } returnfalse; }
There was a failure. Errorjava.lang.ArrayIndexOutOfBoundsException: length=10; index=-1
means thatmPool
- an array of size 10, butmPoolSize=-1
.
Okay, how comemPoolSize=-1
? In addition to the methodrecycle
above, the only place to changemPoolSize
is theacquire
class methodSimplePool
:public T acquire(){ if (mPoolSize > 0) { finalint lastPooledIndex = mPoolSize - 1; T instance = (T) mPool[lastPooledIndex]; mPool[lastPooledIndex] = null; mPoolSize--; return instance; } returnnull; }
Therefore, the only way to get a negative valuemPoolSize
is to reduce it whenmPoolSize=0
. But how is this possible with the conditionmPoolSize > 0
?
Put a breakpoint in Android Studio and take a look at what happens when you start the application. I mean, here is the conditionif
, this code should work fine!Finally, a revelation!
See theDynamicFromMap
static link forSimplePool
.privatestaticfinal Pools.SimplePool<DynamicFromMap> sPool = new Pools.SimplePool<>(10);
After several dozen clicks of the Play button with carefully placed breakpoints, we see that the mqt_native_modules streams access functionsSimplePool.acquire
andSimplePool.release
use React Native to control the style properties of the React component (below thewidth
component property )
But they are also accessed by the main thread main !
Above, we see that they are used to update propertiesfill
in the main thread, usually for a componentreact-native-svg
! Indeed, the libraryreact-native-svg
began to be usedDynamicFromMap
only from the seventh version to improve the performance of native svg animations.
And-and-and ... the function can be called from two threads, butDynamicFromMap
does not useSimplePool
thread safe way. “Thread safe,” you say?Thread safety, some theory
In single-threaded JavaScript, developers usually do not need to deal with thread safety.
On the other hand, Java supports the concept of parallel or multi-threaded programs. Multiple threads can run within the same program and can potentially access the overall data structure, which sometimes leads to unexpected results.
Take a simple example: the image below shows that streams A and B are parallel:- read integer;
- increase its value;
- return it.
Stream B can potentially access the data value before Stream A updates it. We expected two separate steps to give a finite value19
. Instead we can get18
. This situation, where the final state of the data depends on the relative order of the flow operations, is called a race state. The problem is that this state does not necessarily occur all the time. Perhaps in the above case, stream B has another job before increasing the value, which gives sufficient time for stream A to update the value. This explains the chance and inability to reproduce the failure.
The data structure is considered thread-safe if operations can be performed simultaneously by multiple threads without the risk of a race condition.
When one thread reads for a particular data element, another thread should not have the right to modify or delete this element (this is called atomicity). In the previous example, if the update cycles were atomic, the race conditions could have been avoided. Flow B will wait for thread A to complete the operation, and then start itself.In our case, this can happen:
Because itDynamicFromMap
contains a static link toSimplePool
, several callsDynamicFromMap
come from different threads, simultaneously calling theacquire
c methodSimplePool
.
In the illustration above, flow A calls the method, evaluating the condition as true , but it has not yet managed to reduce the valuemPoolSize
(which is used in conjunction with flow B), while flow B also calls this method and also evaluates the condition as true . Subsequently, each call will decrease the valuemPoolSize
, and as a result the “impossible” value is obtained.Correction
Studying the options for correction, we found a pull-request for react-native , which has not yet joined the branch - and it provides thread safety in this case.
Then we rolled out the corrected version of React Native for users. Crash finally fixed, hurray!
So, thanks to the help of Jenik Duplessi (contributor to the core of React Native) and Michael Sand (maintenerreact-native-svg
), the patch is included in the next minor version of React Native 0.57 .
Fixing this bug required some effort, but it was a great opportunity to dig deeper into react-native and react-native-svg. A good debugger and several well-located breakpoints are important. I hope you learned something from this story too!