WebRTC Signaling: Here be Dragons

HereBeDragons_01

It’s the sign I wish I’d seen before I agreed to produce the WebRTC prototype you’ll find at the bottom of this post. If there’s one thing I learned from the exercise, it’s that WebRTC signaling is difficult. Really disturbingly complex. And despite their best efforts most of the reference materials made available by Google, Mozilla, and others are misleading—they recommend using technologies that don’t easily scale beyond toy applications, or they cast powerful “Not my problem” spells that inflict unneeded complexity upon developers. When you venture into the land of WebRTC, beware. Here be dragons.

Because the wizards at the W3C saw fit to leave signaling undefined in their scrolls, you have to come equipped with a scalable server solution already in hand, or vett and then decide upon one before going into battle. In order to allow our prototype to stand on its own, we listened to the advice of bloggers and browser evangelists, and just used Socket.IO. Imagine our regret when we deployed our stateful application in a clustered environment, and several issues cropped up that threatened to block completion of our prototype. Learn from our mistakes, and make decisions about your signaling solution that won’t come back to bite you when your application needs to scale.

Say no to Socket.IO

The first beast we encountered on our journey into the land of WebRTC was the Signaling dragon “Blarg the Undefined.” We were confident that we would emerge victorious over the foe, wielding the Socket.IO spell that we purchased before leaving town. We had heard stories of Socket.IO’s efficacy in dealing with Blarg, but they ended up being just that, stories. And when we were slain shortly after our encounter began, we had to start our journey again. But this time we did our research and found evidence that Socket.IO has an imperfect concurrency model that will cause you pain at scale, unless you actively work around it.

To illustrate a concurrency problem with Socket.IO, consider the following example: in a Node.JS cluster environment, client connections become susceptible to race conditions that can result in failed handshakes and disconnects. This is because Socket.IO uses pub/sub to synchronize data across cluster workers, but lacks a mechanism for notifying publishers about when their data has been delivered. Therefore, when Socket.IO writes data that is to be accessed by other workers, it cannot guarantee that they will always have up-to-date information. You can readily reproduce this situation by setting up a local clustered Socket.IO server, and connecting it to a Redis store with some latency. This is what it looks like:

socket-io-race-condition

For interested parties, you can see the handshake data is written here, and unsafely retrieved here. While the upcoming 1.0 release of Socket.IO has a different architecture, this problem still appears to be present. It looks like the team has decided to support scalability through load balancers.

We experimented with fixing the concurrency problems with Socket.IO, however because all of its stateful data is pumped through the same pub/sub storage interface, we decided to choose a different messaging system for our demo: Faye. Faye is a popular implementation of Bayeux, which provides a rich pub/sub interface and is not dependent on Socket.IO. Here is a nice explanation from the author about why he doesn’t use Socket.IO under the covers. It held up well in our testing, and provided us with reliable behavior even when clustering with latency.

Stateless Scalable Signaling Servers

Once we ensured that our transport layer would not succumb to the attacks of the Signaling dragon, we had to ensure that our peer state would distribute properly across a cluster.  Rather than construct a stateful server, we chose to use pub/sub to communicate presence information solely between clients. By communicating this data between peers, we were able to scale our server by simply relying on our pub/sub dispatcher to scale. This decision also more loosely couples our system to the server side signaling implementation, making it easier to replace with something more performant, should our application need to scale immensely.

Let’s consider an example of how this is accomplished with pub/sub:

  • Jill walks into a dark room and says “Hi this is Jill” (Subscribe, Publish)
  • No one responds, so Jill knows that there are no other people in the room.
  • Jack comes in to the same room and says “Hi this is Jack” (Subscribe, Publish)
  • Jill responds “Hi Jack, this is Jill.” (Receive, Publish)
  • Both Jack and Jill now know who is in the room. (Receive)

This approach does require a bit more logic on the client side, but we believe it’s worth it if our servers have less work to do. Translating the previous example into pub/sub messages looks like this:

pubsub-presence

In the example above we have a “room”, and two peers “jill” and “jack.” From this we can generate two pub/sub channel names that will allow for broadcast and peer-to-peer communication in a virtual ‘room’.

  • /room – This channel represents the room, and all peers use it for communicating initial presence, and determining P2P channel names.
  • /room/jack/jill – This channel is used for publishing messages between two peers, and is determined by combining the room id and both user ids separated by slashes. To ensure that both peers generate the same channel name the user components are alpha sorted beforehand.

By choosing a stateless server, we were able to scale out our pub/sub dispatcher without having to deal with the complexities of mutable signaling state in our application server code. In a shipping application you will almost certainly need to have some state, for example to deal with user authentication or channel permissions, but with a little ingenuity that shouldn’t be a problem with this setup.

Now You’re Ready

By now you should have an apprentice level understanding of the tactics that you’ll need to employ before facing the Signaling dragon on your adventures in the land of WebRTC. Think carefully about which transport you choose–Socket.io can be a powerful spell, but it requires some extra effort to scale.  And consider using a stateless server implementation to avoid concurrency concerns in clustered environments. While neither of these things are strictly required, they may very well save you headache when your application becomes wildly successful.

At Sococo we believe that WebRTC must be accessible to all levels of developers if it is to succeed, not just to engineers with solid computer science backgrounds. Therefore we’ve decided to open-source our prototype implementation, in order to help demonstrate the techniques we have talked about today.

There are other formidable dragons that we also had to defeat along the way, such as application-layer session renegotiation (Mozilla bug tracker) and interaction with the many state machines of RTCPeerConnection (W3C spec), but those are tales for another day.