Thursday, November 23, 2023

Porting the Linux kernel to WebAssembly

Ok, but why?


WebAssembly is an execution sandbox. It has no intrinsic functionality whatsoever. Any kind of interaction with the outside (functions, memory, global variables) has to be imported. There are no sys-calls, there is no standard library. In short, there is an entire OS missing.

Existing solutions try to solve this by providing an application runtime. Nearly all of them focus on the POSIX runtime by either hacking an extending existing libc implementation eg. Emscripten, or writing a new one entirely from scratch eg. Wasix. This generally works OK'ish but has some inherent drawbacks. The most prevalent being the limited portability of existing applications. 

Linux applications are the primary source of potential wasm applications because of their open-source nature. However the POSIX runtime is too limited to what is available on Linux. The POSIX runtimes that are currently available are also not suited for multi-application interaction. Wasix is currently Rust & backend focussed, while Emscripten lacks any kind multi wasm application interaction.


Solution


What if we look at the WebAssembly sandbox as a special virtualised architecture? Because that's almost exactly what it is. We could port the Linux kernel to this new architecture and use the VirtIO framework to make the kernel talk to the outside world. The outside would be your browser, or whatever wasm engine you would run on the server-side.

It would look something like this:
Each application runs in it's own isolated wasm sandbox and basically consist of an application wasm modules and a kernel wasm module. The application module imports it's own application memory, while the kernel module imports both application memory and shared kernel memory. 

[For the uninitiated: a wasm module conains the executable wasm code and has nothing to do with kernel modules in the native world.]

This makes application memory accessible by both the application and kernel, while kernel memory is only accessible to the kernel module. The kernel memory is shared with other kernel wasm modules inside other wasm sandboxes. This make it possible for application interaction to happen through the kernel module which shares it's memory and state with other kernel modules.


Leroy Jenkin' it


Somebody already managed to compile the linux kernel (or rather lkl) to ASM.js using Emscripten. Porting it to wasm should be easy and totally doable! ASM.js is the predecessor of WASM so all that was left to do is change some compilation settings so we get wasm instead of asm.js.

Sadly things aren't that easy. The original ASM.js port of the Linux kernel is based on an older Linux lkl version (4.x) so patches had to be ported to a the newer Linux lkl version (6.x). Next was the switch to wasm. Turns out wasm doesn't like Linux spinlock implementation because of computed gotos which wasm does not support. This can be fixed by implementing wasm own build-in locking mechanism. After some trial-and-error, it was time to take a step back and look at the bigger picture. As it turns out, porting the Linux kernel to wasm requires a lot more work than some minimal patches on top of lkl. 

The wasm standard deviates quite a lot from existing architectures. Wasm modules have no relation with the ELF format, so loading & linking is going to problematic. There is no standard way to dynamically link wasm modules. Computed GOTOs are not supported and a lot of more advances instructions are not available. This effectively calls for wasm to be treated as an entirely new architecture inside the Linux kernel.

First step is to investigate if there is any way we can even link wasm modules. There is an unofficial standard implemented by llvm but sadly there is no stand-alone mapper & loader available. Cool, let's just implement it ourself.

This experiment made some things clear on how and what is needed to link a wasm modules, especially when dealing with multiple threads/processes in the form of web workers. The mapping positions of wasm modules into memory needs to be cached per process while update and retrieval of this information must be thread-safe. 

All of this is already done in some form by the Linux kernel for the ELF executable format, which first maps the file into memory and then hands over execution to the ld.so defined in the ELF file itself. The Linux kernel gives us some handles in the form of binfmt_misc to load arbitrary executable formats but more investigation is required if this is enough to load wasm modules.

This exploration barely scratches the surfaces but already reveals some things that will be needed to port the Linux kernel to WASM.

edit 3 September 2024: An exciting and ongoing linux kernel port to WASM can be found here: https://github.com/tombl/linux

To be continued.

Thursday, July 20, 2023

Some history about Greenfield Part 2

Part 1

This is the second part of some history about Greenfield and was not posted on Twitter, but instead some time later as a blog post in the summer of 2023.

Some time after a I got Greenfield working on kubernetes I wasn't really satisfied with the result. Although kubernetes offers a lot of niceties in the form of abstracted distributed file systems and networking, there were some hard blockers. 

Starting an application takes at least 10 seconds (often more) which is way too long. This is not surprising, considering that kubernetes was never meant to run user facing applications. Other big blockers were the complexity and cost of running the system. 

To have some kind of good performance, video encoding needs to be done on the GPU which rules out hosting the system in the cloud because of absolutely exuberant GPU rental prices. There is also the issue of GPU partitioning for containers, namely that it doesn't exist, and I wasn't that crazy to implement it myself... (for now?)

So exit kubernetes and focus back on the core product.

During the kubernetes experiments some other core issues came to light. The streaming performance was quite bad and even a short network disconnect could bring the whole streaming pipeline down.

There were multiple reasons for this.

Applications were forced to go through system memory when handing over a frame to the compositor. This was promptly fixed by implementing the wl_drm and zwp_linux_dmabuf_v1 protocols. Application frames could now stay on the GPU until the video encoder was done but things were still slow.

The streaming performance was bad simply because the whole pipeline was implemented exactly how a Wayland compositor ought to function. An application hands over a frame and waits for presentation acknowledgement from the compositor before drawing the next frame. 

Since Greenfield talks over network there is a whole network round trip involved between the present-acknowledgement dance, effectively syncing the time between frames to a network round-trip.

The fix was easy. The remote predicts when the frame would be presented to the user if no network was involved. It does this by using a bunch of measurements from both itself and from the browser. Awesome! No more application presentation network round trips and everything should work smooth now.

Except it didn't.

Applications were still inexplicably slow, their fps still synced to the network. What was going on?!

Wayland is an async protocol meaning requests don't wait for the previous request to be processed before sending the next. So why still the synchronous nature? Turns out there is an explicit sync call in the core protocol which effectively allows clients to wait for previous requests to be processed before sending the next.

A common pattern for pretty much all Wayland application is to attach a buffer + do some other stuff and send a commit request to atomically apply all these changes, and then also do a sync requests before starting with the next presentation cycle. Ugh.

This was bad. Really really bad. We can't really predict or eagerly send out a 'sync done' reply as it needs to be sent after all other replies from requests that were done before the sync request. Since these other replies are generated by the browser compositor, the sync done reply is always implicitly tied to the the network.

In other words: It's impossible to make the whole presentation pipeline detached from the network.

Queue existential project crisis. Was this the end? For a brief moment it seemed that way...

However, it quickly downed. There is a nuance in the sync reply requirement that does allow to eagerly send a sync done reply. Only if there is a previous request that always immediately sends a reply, do we need to wait before sending the sync done reply. In all other cases, we can immediate send a sync done reply.

Rejoyce! An eager sync done mechanism was implemented and things started to run smoothly!


Greenfield running DOOM3 remotely 1920x1080@60FPS. Demonstration was performed on a remote server connected to the internet with around 25ms of network latency. Total input latency is around 40ms.

With the rendering pipeline going full throttle it was time to look at the second big issue. Network reliability.

Up until now all network communication was done through WebSockets after some bad experience with WebRTC & WebRTC data channels 6 years earlier. Perhaps it was time to revise and see if we could get some proper UDP like low latency stateless communication going, instead of TCP based WebSockets.

WebSockets were replaced with WebRTC data channels in unordered and unreliable mode which -according to all sources found on the interwebz- gives you effectively a UDP socket. Great!

We still require video frames to be presented to the end-user in-order and we can't really deal with missing pixels because we don't continuously stream application frames, so we still need some kind of ARQ protocol like KCP.

KCP allows you to basically sacrifice bandwidth for lower latency, something which you can't really control at the application level with TCP.

After some fiddling around and getting KCP working in TypeScript, the whole thing was working smoothly. Network switching between WiFi and ethernet, network disconnects. Everything was working flawlessly.

Awesome. Time for a last real-life test with a remote server.

Stuttering. My 300Mbit connection couldn't handle the video load. What?! Turns out WebRTC data channels is actually just a very thin layer over SCTP, and doesn't really care if you configure it in unordered and unreliable mode.

It's effectively is just a really shitty implementation of TCP over UDP but much worse round-trip blocking wise. My 300Mbit connection could barely do 10Mbit using SCTP with a 25ms round-trip latency!

The whole setup was working great aside from the round-trip bound SCTP throughput. Perhaps we can run a custom WebRTC data channel/SCTP library on the server that doesn't wait for browser SCTP acks? Most of the bandwidth is uni-direction server to browser data anyway!

Werift-webrtc was forked and all required changes were made to strip any SCTP acks or round-trip bottle necks. Rejoyce! It worked! The world's first true UDP like library for the browser was created!

...but only for this first 100 000 messages or so. You see, the browser doesn't really like it when it doesn't get any acks from an SCTP protocol violating library, and eventually just gives up. Since we can't really change the browsers SCTP implementation, it was game over for WebRTC data channels.

Back to WebSockets.

Eventually only KCP was kept and some simple reconnecting WebSocket logic on disconnect. So Greenfield is stuck using TCP until Http3 WebTransport comes along.

Wednesday, July 19, 2023

Some history about Greenfield Part 1


This post is part 1 about Greenfield and was originally posted on Twitter (2021).

Early 2017 I started implementing an entire linux wayland display server in the browser because "wouldn't it be cool if ...", but I never really shared my experiences that eventually lead me to implement a kubernetes powered cloud desktop computer.

It basically started with a discussion in #wayland on irc where it was suggested that one should use (s)rtp for real time video stream. The browser lacking such things, only offers WebRTC so first thing was to check if that could be utilized.

Turns out it's nearly impossible to attach meta data to a video frame coming from WebRTC, something you really need in wayland protocol as you need to link a buffer with a commit request. (Ironically only MS err edge? browser supported this at the time).

So exit WebRTC video, hello GStreamer. GStreamer was surprisingly easy to implement as you can basically setup an entire pipeline using a string, input file and output file. WebRTC was still used for data transport using a data-channel.

Note at this point, there was till zero Wayland code. Next step was to see if we can decode a video frame, and control the exact moment it is presented on screen. HTML5 video offers something called MSE which allows you to feed it chunks of video.

Great, all we need to do is feed it single frame chunks and we should be set. This works great except that it doesn't. MSE doesn't tell you when it's showing the frame, it's also expects extra (useless overhead for us) mp4 container data and playback time. More complexity we don't want or need. Exit MSE.

Looks like we need to handle the video frame decoding ourself. After some trial and error journey into H264 decoders in ASM.js and WASM there was finally something that ticks all checkboxes. TinyH264.

Great. All functionality was here. But will it be fast enough? Luckily the whole pipeline could do 1080p@30fps. Good enough for now. Time to port the the wayland protocol to JavaScript!

If you want to support wayland compositor in your browser, you have to:
  • make sure you can handle native file descriptors over network
  • make sure you can deal with native wayland server protocol libraries (drm & shm implementations)... over network
  • deal with slow clients without blocking other clients
But there's a solution for all these requirements:
  • native file descriptors in the protocol are represented as URLs (strings)
  • native protocol libraries require a proxy compositor and a libwayland fork
  • slow clients are handled by an async compositor implementation.
After some tinkering a first version was working and was promptly featured on
Phoronix



Great success! But the work was far from over. Things were still too slow, too complex not really usable. 😿

Turns out writing an async wayland compositor is really hard. Eventually WebRTC was completely dropped and replaced by WebSockets, each application surface is now a WebGL texture instead of it's own Html5 canvas (canvas doesn't do double buffering) and more.

Meanwhile another idea was brewing. Since we now have a compositor in the browser that talks vanilla wayland. We don't *have* to run the apps remotely. They can run directly in your browser as well using web workers!

A poc was implemented and promptly featured again on Phoronix.





Building on this idea, there was one big blocker. There are literally no good widget libraries that allow you to render directly to a Html5 SharedMemoryBuffer let alone to a WebGL canvas. So to prove this idea was feasible, I experimented if I could get something going...

First there was the need for a good drawing library. Since there are none that fitted al the needs, I resorted to compiling Skia to wasm. This was a challenge, not having any C++ experience but got something working eventually... and Google noticed!





Turns out they were wanting to do the same for their upcoming port of Flutter to the browser (although they never mentioned that at the time but 1+1=...). So pretty cool, meant that I didn't have to put my spare time in it anymore and could eventually start with the next phase.

Write a custom React renderer that outputs to an offscreen webgl canvas, which talks wayland protocol to a compositor in your browser, while running in a web worker. No biggy.

Writing a custom react renderer is hard. Not because it's hard (it's quite easy actually) but because there is basically no documentation about it. It doesn't help that it has 2 render modes and all documentation is about the the first one... I wanted the second one.

Eventually something was working and the basis for a browser widget toolkit that can output to offscreen (and onscreen) webgl is there.





Great. So now the pure browser wayland app use-case has been proven. But there is still other things lacking. Not all remote Linux apps run on wayland, in fact most still use X11. So if we want to support all linux apps, we need to support xwayland... in the browser.

They way xwayland works, is that the X11 server presents itself as a wayland client, but still requires the wayland compositor to act as an X11 window manager. In our case, the compositor is running in the browser. So that means the browser has to function as an X11 client.

Fun fact: there are no libraries that allow a browser/webpage to act as an X11 client. So... let's implement it ourself! Looking at xcb and how xpyb works, xtsb (x typescript bindings) was finally born,

Implementing the X window manager was quite a challenge but luckily Weston the Wayland reference compositor had lit what would otherwise be a dark path, and I could basically rewrite their C code to TypeScript and eventually got something working!





Adding POC Xwayland support was nice (and is still not further implemented for now), but there was still another fundamental issue that was left unaddressed. The whole setup and way of running applications was extremely clumsy and messy... enter kubernetes

Btw up until now all development was done during my spare time while working full-time. I also had the privilege of welcoming 2 wonderful kids into the world. So if you tell me you don't have time for x or y, I just think you must have a very healthy sleep schedule.

Back to kubernetes. I noticed people had trouble setting everything up. So with a little help from a friend we started implementing a custom kubernetes operator that can manage browser compositor sessions and the applications they display.

A wayland compositor proxy runs as a sidecar inside the pod that hosts the desktop application container. This means that you can distribute all your desktop apps over your entire k8s cluster and finely tune the resources, filesystem and caps the app has access too.




Part 2

Porting the Linux kernel to WebAssembly

Ok, but why? WebAssembly is an execution sandbox. It has no intrinsic functionality whatsoever. Any kind of interaction with the outside (fu...