eyeofhell April 3, 2018 at 13:39

Hijacking in 1100 seconds is the strangest bug I've seen

Transfer

Two days ago, I received a strange message from a client: a video call mysteriously ended in exactly 18 minutes. And then it happened again, also after the 18th minute. Coincidence?

This bug was not only strange, but also terribly awkward. Our goal is to make such a simple tool for video calls so that a conditional doctor or psychologist would like to use our service. Needless to say, a constant break after 18 minutes is not consistent with this goal?

WebRTC is when you always deal with bugs

Bugs are not a phenomenon for us. Our product is based on WebRTC, a fairly new web standard that allows two browsers to communicate directly (for example, video and audio in real time).

In September 2017, WebRTC finally became supported in all major browsers . Over the past months, we have snatched our fair share of bugs , while forcing WebRTC to work for everyone. (Working with advanced technologies is a good way to get to know how to write bug reports to browser makers).

However, this last bug was the strangest in my 5 years of working with WebRTC.

WebRTC Bug Tools

Fortunately, we have many tools for catching bugs (thanks to some cool contributors). It is naive to think that someone will dump chrome: // webrtc-internals and send it to you along with a script that is easy to reproduce.

WebRTC Dump Output Example . Green lines indicate a successful connection, a red line indicates a connection failure.

One of the best tools is rtcstats , which provides output from getStats and stores it in regular files (along with other useful tricks).

Bug confirmation

We quickly found the desired files from two failed calls. These were video calls between Chrome 64 and Edge 16. Chrome dropped 18 minutes 20 seconds. Subsequent restart of ICE also failed, which automatically closed RTCPeerConnection.

The mystery is that Edge did not agree with Chrome - on the Edge side, iceConnectionState remained connected until the ICE restart, the browser broke the connection, only the ICE restart failed. What was it?

Screen from Edge, look at the timestamps connected and get an description of iceRestart, between them about 18 minutes 20 seconds. And iceConnectionStateChange has not even changed to failed!

Play bug: magic number “18 minutes 20 seconds”

A couple of months earlier, we bought a cheap laptop with Windows to run Edge 16 and see bugs in action. We made a video call between this laptop and MacBook with Chrome 64.

And then we waited. And of course, exactly after 18 minutes 20 seconds, the ICE status changes to failed, the ICE restart fails and the connection breaks. We have never seen anything like it, it was a shock.

What to do next? Where to begin? Our code did not have a magic timer that would break the connection after 18 minutes 20 seconds in some browsers. All this seemed insurmountable, but we could only do one thing.

The team is blissfully unaware of the bugs that await it. Photo taken in Aalesund, Norway during a working trip.

One step at a time - methodical bug localization

We found a reproducible script, but still had to reduce the search area. A series of tests was supposed to test our assumptions.

First, will this happen between two Chrome browsers, which usually have the least problems with WebRTC? On the same laptops, with the same initial conditions, we checked the call from Chrome to Chrome. After 18 minutes 20 seconds, the call was still active, the ICE connection status remained stable. That is, the bug did not happen.

However, when testing from Firefox to Edge, the call again fell through.

The main question is: is it a bug in Edge or a bug in our code? Service appear.inalso built on WebRTC and much has been done in our product in the same way, because half of our team used to work in appear.in. We decided to test the call from Chrome to Edge through this platform in peer-to-peer mode (free), all on the same machines. Not a single break!

We did not know to cry or laugh. The search area narrowed and indicated an error in our code. Confrere could not work for Chrome-Edge bundle.

When expectations fall apart

Since each test took 20 minutes, we decided that we should pause and continue from home.

For the sake of verification, we tested Chrome-Edge from home. Imagine surprise when, after 40 minutes of the test, the call was still active. What?!

We checked everything again and again. Chrome logged that the connection in the office did this: relay <-> stun. This means that one shoulder of the call must go through our servers (relayed), but the second shoulder can send data directly (STUN). This is not uncommon for modern network configurations.

The RTCStats dump shows the connection status of each pair of ICE candidates. Look at the very top: the local address is in relay status, and the remote address is in stun status.

Due to the specific network configuration in our office, my machine running Chrome was connected via cable to allow direct connections. The Edge laptop clung to very unstable Wi-Fi, and used a relay server. However, in the home network, the parties to the call were connected through local candidates, that is, they did not leave the home network.

IceTransportPolicy to the rescue!

The next step is to determine why the test was unsuccessful in the office, but now it suddenly worked. Could the reason be in the stun or relay candidates? Fortunately, we can force PeerConnection to use only relay candidates - we must put the appropriate value for iceTransportPolicy when the PeerConnection is created. And of course, after 18 minutes 20 seconds the call ends.

Was it a coincidence that the Chrome-Edge bundle worked in appear.in? Maybe our connection through appear.in used STUN, but TURN won in our service? A quick test showed that it was our mistake, because the forced relay ( TURN ) for appear.in did not cause bugs.

Finally, we tried something else. We built a special version of Confrere and, with permission from appear.in, used their TURN servers for a single test. After 25 minutes, we concluded that the code in our client application was fine (because this code did not provoke a bug when using other TURN servers).

When each test takes at least 18 minutes, this requires patience. On the left, on the right - Svein (CEO)

Logs rush to the rescue!

When we localized the problem - our configuration of TURN servers is to blame - could we fix everything? Not. We still did not know what exactly caused this bug. Our TURN infrastructure is mostly based on what we learned from 4 years of work in appear.in, it uses the same open source components, works in the same cloud and with the same configuration.

When you are completely at a loss and do not know where to move on, you can always turn to the logs . Logging was for us the main tool in finding a bug, because it allows you to monitor the working system in real time. By launching a new video call, we closely monitored the logs.

Somewhere at the 10-minute mark, messages like:

turnserver: 1054: session 000000000000000061: realm  user : incoming packet message processed, error 438: Stale nonce

When the call ended after 18 minutes 20 seconds, we again looked at the logs and found the first message mentioning Stale nonce :

turnserver: 1053: session 000000000000000034: refreshed, realm=, username=, lifetime=600
turnserver: 1053: session 000000000000000034: realm  user : incoming packet REFRESH processed, success
turnserver: 1053: handle_udp_packet: New UDP endpoint: local addr :443, remote addr :56031
turnserver: 1053: session 000000000000000061: realm  user : incoming packet message processed, error 438: Wrong nonce
turnserver: 1053: session 000000000000000061: realm  user : incoming packet message processed, error 438: Stale nonce

Why are there so many posts about Stale nonce ? What kind of one-time code (nonce) is this and how can it be expired (stale)?

Understand the bug: Hello, "stale nonce"

If you, like me, did not know what a one-time code (nonce) is:

Nonce .

- Wikipedia

And if this number is overdue:

Stale nonce is more a warning than an error. For SIP, your credentials are encrypted in SIP headers. To prevent other people from intercepting this data and making calls at your expense, a one-time code (nonce) is used.
The SIP RFC standard requires nonce to change periodically. If the client uses the old nonce, then this is “stale nonce”. In this case, the client must use the current nonce instead of the old one. Such a message means that the client is trying to use stale nonce, that is, either a replay attack occurs, or the client could not receive a new nonce

- from the Internet

. Confrere uses open source coturn software , it manages TURN servers. If you read the documentation carefully, you will see that there is a parameter for stale nonce in the configuration:

# Uncomment if extra security is desired,
# with nonce value having limited lifetime.
# By default, the nonce value is unique for a session,
# and has unlimited lifetime. 
# Set this option to limit the nonce lifetime. 
# It defaults to 600 secs (10 min) if no value is provided. After that delay, 
# the client will get 438 error and will have to re-authenticate itself.
#
#stale-nonce=600

In our configuration, we turned on stale-nonce and everything worked flawlessly until the client base with the Edge browser began to grow.

The logs of stale nonce were familiar to Philip Hank ; he also helped us understand what was what. Shijun Song reported a similar problem in appear.in in May 2017, which allowed appear.in to avoid problems.

We had the opportunity to change the configuration of the TURN server and remove the stale-nonce flag so that nonce had no time limits.

50 minutes in a test environment - and we were finally able to say: “We found a bug!”

And yet, why 18 minutes 20 seconds?

You must have noticed the title of this post. Gone in 1100 seconds. I wrote “18 minutes 20 seconds” in this text, but do you know how many seconds it is? 1100 .

We know that after 600 seconds from the start of a conversation, the initial nonce becomes invalid. A stale nonce message appears and after another 500 seconds the connection is broken. Where do another 500 seconds come from?

If you look for the number 500 in the sorts of coturn , then there is something interesting .

It looks like a built-in sleep timer 500 seconds after some kind of connection check. I am not very familiar with the internal coturn device, but it seems that if nonce becomes invalid, then 500 seconds after that coturn will stop sending packets to another participant. Of course, another participant will see that the connection is disconnected (because he no longer receives packets).

This explains what is happening on the part of Chrome, since it stops receiving packets. But why doesn't Edge notice this?

The story of two packages

Connections in WebRTC can be of two types: UDP and TCP. UDP packets do not require acknowledgment of receipt, delivery is not guaranteed, and they may arrive in the wrong order. In real-time communications, this is not a big problem, since the codecs do pretty well with packet loss.

TCP is different. TCP is very useful when you need to deliver all the data to the other side, in the correct order and with delivery confirmation. Most WebRTC traffic is UDP .

In our story, Edge sent UDP packets through a TURN server to the other side . When the WebRTC server stopped sending packets, Edge could n’t know if Chrome received packets or not., so Edge continued to joyfully send data. Chrome, on the other hand, behaved correctly and sent packets through STUN. So when iceRestart took place, Edge did not know what to do and simply fell silently .

Summary

The conclusions are as follows: we will send a bug report to the Edge bug tracker and for a while we will deploy new TURN servers without the stale-nonce flag. New servers should ensure that in the future there will be no failures, and our users will receive long and high-quality video calls in our service .

I am also incredibly happy to be part of a community of people who willingly share knowledge and help each other. From browser developers who answered our bug reports and fixed bugs, to individual contributors who make excellent and free tools for the WebRTC community.

Special thanks to Philip Hankefor pointing us to the stale-nonce flag and enduring my endless lamentations when all this seemed nonsense. In addition, I express my gratitude to Shijun Sun for discovering this problem in May 2017.

We hope this story has been helpful to you. The author of the original article can be contacted via Twitter or mail , and I am ready to share our experience in using WebRTC in the comments!

Tags: