Thursday, July 20, 2023

Some history about Greenfield Part 2

Part 1

This is the second part of some history about Greenfield and was not posted on Twitter, but instead some time later as a blog post in the summer of 2023.

Some time after a I got Greenfield working on kubernetes I wasn't really satisfied with the result. Although kubernetes offers a lot of niceties in the form of abstracted distributed file systems and networking, there were some hard blockers. 

Starting an application takes at least 10 seconds (often more) which is way too long. This is not surprising, considering that kubernetes was never meant to run user facing applications. Other big blockers were the complexity and cost of running the system. 

To have some kind of good performance, video encoding needs to be done on the GPU which rules out hosting the system in the cloud because of absolutely exuberant GPU rental prices. There is also the issue of GPU partitioning for containers, namely that it doesn't exist, and I wasn't that crazy to implement it myself... (for now?)

So exit kubernetes and focus back on the core product.

During the kubernetes experiments some other core issues came to light. The streaming performance was quite bad and even a short network disconnect could bring the whole streaming pipeline down.

There were multiple reasons for this.

Applications were forced to go through system memory when handing over a frame to the compositor. This was promptly fixed by implementing the wl_drm and zwp_linux_dmabuf_v1 protocols. Application frames could now stay on the GPU until the video encoder was done but things were still slow.

The streaming performance was bad simply because the whole pipeline was implemented exactly how a Wayland compositor ought to function. An application hands over a frame and waits for presentation acknowledgement from the compositor before drawing the next frame. 

Since Greenfield talks over network there is a whole network round trip involved between the present-acknowledgement dance, effectively syncing the time between frames to a network round-trip.

The fix was easy. The remote predicts when the frame would be presented to the user if no network was involved. It does this by using a bunch of measurements from both itself and from the browser. Awesome! No more application presentation network round trips and everything should work smooth now.

Except it didn't.

Applications were still inexplicably slow, their fps still synced to the network. What was going on?!

Wayland is an async protocol meaning requests don't wait for the previous request to be processed before sending the next. So why still the synchronous nature? Turns out there is an explicit sync call in the core protocol which effectively allows clients to wait for previous requests to be processed before sending the next.

A common pattern for pretty much all Wayland application is to attach a buffer + do some other stuff and send a commit request to atomically apply all these changes, and then also do a sync requests before starting with the next presentation cycle. Ugh.

This was bad. Really really bad. We can't really predict or eagerly send out a 'sync done' reply as it needs to be sent after all other replies from requests that were done before the sync request. Since these other replies are generated by the browser compositor, the sync done reply is always implicitly tied to the the network.

In other words: It's impossible to make the whole presentation pipeline detached from the network.

Queue existential project crisis. Was this the end? For a brief moment it seemed that way...

However, it quickly downed. There is a nuance in the sync reply requirement that does allow to eagerly send a sync done reply. Only if there is a previous request that always immediately sends a reply, do we need to wait before sending the sync done reply. In all other cases, we can immediate send a sync done reply.

Rejoyce! An eager sync done mechanism was implemented and things started to run smoothly!


Greenfield running DOOM3 remotely 1920x1080@60FPS. Demonstration was performed on a remote server connected to the internet with around 25ms of network latency. Total input latency is around 40ms.

With the rendering pipeline going full throttle it was time to look at the second big issue. Network reliability.

Up until now all network communication was done through WebSockets after some bad experience with WebRTC & WebRTC data channels 6 years earlier. Perhaps it was time to revise and see if we could get some proper UDP like low latency stateless communication going, instead of TCP based WebSockets.

WebSockets were replaced with WebRTC data channels in unordered and unreliable mode which -according to all sources found on the interwebz- gives you effectively a UDP socket. Great!

We still require video frames to be presented to the end-user in-order and we can't really deal with missing pixels because we don't continuously stream application frames, so we still need some kind of ARQ protocol like KCP.

KCP allows you to basically sacrifice bandwidth for lower latency, something which you can't really control at the application level with TCP.

After some fiddling around and getting KCP working in TypeScript, the whole thing was working smoothly. Network switching between WiFi and ethernet, network disconnects. Everything was working flawlessly.

Awesome. Time for a last real-life test with a remote server.

Stuttering. My 300Mbit connection couldn't handle the video load. What?! Turns out WebRTC data channels is actually just a very thin layer over SCTP, and doesn't really care if you configure it in unordered and unreliable mode.

It's effectively is just a really shitty implementation of TCP over UDP but much worse round-trip blocking wise. My 300Mbit connection could barely do 10Mbit using SCTP with a 25ms round-trip latency!

The whole setup was working great aside from the round-trip bound SCTP throughput. Perhaps we can run a custom WebRTC data channel/SCTP library on the server that doesn't wait for browser SCTP acks? Most of the bandwidth is uni-direction server to browser data anyway!

Werift-webrtc was forked and all required changes were made to strip any SCTP acks or round-trip bottle necks. Rejoyce! It worked! The world's first true UDP like library for the browser was created!

...but only for this first 100 000 messages or so. You see, the browser doesn't really like it when it doesn't get any acks from an SCTP protocol violating library, and eventually just gives up. Since we can't really change the browsers SCTP implementation, it was game over for WebRTC data channels.

Back to WebSockets.

Eventually only KCP was kept and some simple reconnecting WebSocket logic on disconnect. So Greenfield is stuck using TCP until Http3 WebTransport comes along.

No comments:

Post a Comment

Porting the Linux kernel to WebAssembly

Ok, but why? WebAssembly is an execution sandbox. It has no intrinsic functionality whatsoever. Any kind of interaction with the outside (fu...