2023-01 Station Runtime Research

Introduction & preparations

In Q1’23, we want to decide on the architecture for Station Runtime. Should we build it on top of Deno, wasmtime, or something else? (See for a list of different runtimes and projects that can run WASM.)

We need to answer two sets of questions before making a good decision.

What do we want to build in 2023?
- What is the scope of the initial Station Runtime release?
- What features do we need to implement?
- How can we tell whether our work on the implementation is done?
- How are we going to validate the feasibility of our architecture and test the correctness of the implementation?

Based on the requirements coming from the answers to the above questions:
- What criteria to use when deciding which WASM runtime to use as the foundation for Station Runtime?
- What is the smallest but sufficient scope for a proof-of-concept implementation that will allow us to make an informed decision about the suitability of different WASM runtimes?

What do we want to build?

This brings us back to the long-standing problem that we don’t know for sure which module is the most important one to support, i.e. which one will be the first to “go live”.

What modules are we aware of?

Saturn L2 nodes. We know L2s in the current architecture are unlikely to improve the performance of the Saturn network. We need to find a different architecture that would make L2s running in Stations actually useful. However, without deeper knowledge of the new architecture, we don’t know what primitives must be provided by Station Runtime to allow Saturn L2 nodes to run as Station-native modules, as opposed to the current approach of running as a native OS process (a Go/Rust binary).
🚨
Plan as of January 2023: The initial L2s are going to be built primarily for server side usage. This is not to say that they can’t also be a Station module, but we shouldn’t be reliant on this as the first module now.

Bacalhau workers. We have a very limited understanding of Bacalhau architecture. Miroslav expects workers to have two major components - job management (scheduling, reporting) and job execution (the engine performing computations by running WASM code).

Data retrievability. A module envisioned by the Bedrock team. The idea is to periodically fetch the given CID from the IPFS network and/or a given Filecoin Storage Provider, observe the performance and report statistics for future analysis. This module shares some concerns with Saturn L2 (fetching CID data) and some concerns with Bacalhau (job management).

Service availability monitoring - a decentralized https://www.thousandeyes.com alternative. This is very similar to data retrievability, but we make an HTTP call instead of fetching a CID.

Maybe we shouldn’t focus too much on any single module. Instead, let’s focus on building a platform that will be (eventually) usable for many different modules. We can extract generalised requirements from the above list of modules, giving us the following feature requirements:

HTTP client, e.g. Fetch API

IPFS & Filecoin retrieval client, e.g. await fetch('ipfs://bafysomecid'). The initial implementation can fall back to the IPFS gateway. Later we can swap it for the retrieval client built by the Bedrock team.

A built-in libp2p node allowing modules to share the same DHT registration, node registry, and peer connections.

API for interacting with the built-in libp2p, including the ability to:
- Define custom protocols (e.g. Saturn L1-L2 link).
- Create nodes (peers) participating in different DHT networks or not dial-able at all.

Access to the local filesystem for storing temporary data, e.g. content cached by Saturn L2.

API to execute arbitrary WASM code, pass data (inputs) from the Station module to the executed code, and receive the result (outputs) of the execution. (Think of Bacalhau.)

A built-in IPFS node that can serve content over Bitswap and possibly other protocols like Graphsync.

API allowing modules to add content to the built-in IPFS node. E.g. the Bacalhau module can add job output to the IPFS node so that the Bacalhau orchestrator can fetch job outputs over IPFS.

The IPFS node API should allow modules to tell the Station what algorithm to use for picking content for eviction when the cache outgrows the storage limit imposed by the Station. E.g. first-in-first-out, least-recently-used, least-frequently-used, explicit expiration time, etc. I think that for the initial version, it’s enough to implement LRU (for Saturn) and FIFO (for Bacalhau?).

A DAG-based storage layer, possibly shared with the built-in IPFS node, allowing e.g. Saturn L2 to cache CAR files with block de-duplication.

That’s too many features for a PoC. Remember, our goal is NOT to build a full platform. We want to define a PoC that will help us evaluate different JS/WASM runtimes and pick the one most suitable for our needs. Many of the features above can be incrementally added once we figure out the foundation. For example, a built-in IPFS node can be added using the same mechanics we need to implement for the built-in libp2p node.

PoC scope

With that in mind, I propose using the following demo module to drive our exploration:

A naive Saturn L2-like node based on my L1-L2 link PoC.

The module registers a new libp2p protocol that works the following way:
- The requestor (e.g. L1) sends a CID.
- The module responds back with the CAR file.

The module creates a new libp2p node (peer) that’s dialable but does not advertise its address to DHT. We will not ship this module to Station, but we need a way to invoke the Saturn protocol for testing and demonstration purposes.

The module maintains a cache of CAR files using a simple file-based layout: encode CID in the filename, and store the entire CAR as the file content.

The module fetches the content from the IPFS Gateway on cache-miss using HTTP API.

As part of PoC, the runtime should provide visibility into the module’s network & filesystem access. E.g. report outgoing network connections and creation of new files. Ideally, this should be implemented using a mechanism that will make it easy to expand it later to limit resource usage.

As part of PoC, the runtime should either provide visibility into module CPU usage (e.g. op count or CPU time) or limit the module to run on a single CPU core only, leaving all remaining cores available for other applications running on the machine.

This way, we can explore all essential parts of our runtime:

Network access that we can observe & intercept

Filesystem access that we can observe & intercept

Restricted CPU usage

Adding a new built-in API (libp2p, IPFS) on top of APIs provided by the lower-level platform (Web API, WASI)

Deployment infrastructure - package the module for distribution, load it into Runtime, etc.

Again, our goal is to evaluate different JS/WASM runtimes. We DO NOT need to build a useful module at this time. It’s entirely expected the module we build for PoC will be thrown away afterwards.

Non-technical criteria

Besides the technical criteria based on building a PoC, we should also consider non-technical criteria for choosing the project into which hands we will put our fate.

Is the project healthy and sustainable?

Can we build the Station Runtime on-top of the project as it is, or do we need to implement modifications (e.g. new extension points we can hook into)?

If we need to modify the project, how likely are our contributions to be accepted? Are our requirements compatible with project’s vision, direction and roadmap?

Is there a diverse community of contributors and maintainers, or is it a project run by a single company? Projects run by a single company often start stagnating when company’s priorities shift. This happens often for VC-backed companies that need to focus on hyper-growth and monetization.

What programming languages (JS, WASM, Rust) are supported? There are many more JS developers than Rust developers. If we enable JS/TS developers to contribute Station Modules, then we can reach much wider audience.

MicroPoC

In , we collected over dozen of different WASM runtimes for our consideration. Building a complex PoC described above for each runtime would take too long, it would not be a good investment of our time. We need to find a much smaller example that would give us basic feeling of working with each of the runtime. Ideally, this hands-on evaluation should take around one hour per runtime.

I am thinking about the following process:

Follow project’s Getting Started guide to setup the dev stack.

Write a script that makes an HTTP call and stores the result in a file.

Build our code, run in on the runtime being examined.

Study the docs to find how to extend built-in APIs with custom new APIs backed by native modules (libp2p, IPFS node). If it’s easy and there is still some time left, then build a custom API returning the location where the module can store files.

Runtime options

We have ruled out Docker-based options because virtualisation is not available on cheaper consumer computers.

We are also excluding runtimes written in languages other than Rust (Go, C/C++, Zig), because we want to leverage Rust’s momentum in the Web3 community.

Colour coding:

Green - the project fully meets this criterium

Orange - the project meets this criterium partially, or there are concerns/caveats

Red - the project does not meet this criterium

Project	Deno TypeScript	Deno Rust-WASM	Wasmtime Rust-WASM	WASM Edge	Wasmer	Wasm3	WebAssembly Micro Runtime	v8 JavaScript	v8 Rust-WASM
Total score
Homepage	https://deno.land	https://deno.land	https://wasmtime.dev	https://wasmedge.org	https://wasmer.io	https://github.com/wasm3/wasm3	https://github.com/bytecodealliance/wasm-micro-runtime
Developer Experience (based on MicroPOC)	🤩 Writing a simple streaming wget-like downloader is less than 11 lines of code.	🤨 Writing a streaming wget-like downloader required a lot of googling and reading through dozens of doc pages. This was partially caused by my inexperience with running Rust in the browser environment, but also by the lack of documentation for our particular use case, which involves making network and filesystem calls.	🤨 Similar story as for Deno/Rust/WASM. It’s not very clear what creates/APIs to use to make HTTPS requests.	😢 Embedding WasmEdge into Rust app is difficult. One has to download `libwasmedge.0.dylib` The versions published to GitHub Releases are not singed, therefore this library cannot be loaded on M1 Macs.	🤨 The documentation is very short, I had to do a lot of digging on GitHub to find answers to my questions. Other than that, the DX was similar to Wasmtime.
Supports JS/TS	YES	YES	NO	YES	NO
Rust (blocking/std)	n/a, see —>	NOT OOTB. The ability to combine wasm-bindgen and WASI is experimental.	YES for Filesystem. NO for TCP networking.		YES for Filesystem. NO for TCP networking.
Rust (non-blocking/tokio)	n/a, see —>	NOT OOTB. Tokio does not support network calls in WASI environments yet. Plus there is the issue with WASI in the browser (see the line above).	I could not figure out how to export `async fn` from a WASM/WASI module. ❗️HELP WANTED		NO
Filesystem access	YES, using Deno’s filesystem APIs like `Deno.create()`	It’s not possible to use either std or Tokio to work with files. It is possible to create files by calling Deno’s filesystem APIs for JS/TS, e.g. `Deno.create()`	Straightforward using blocking I/O (std::fs::File). Probably easy using non-blocking I/O (Tokio) as well, once we figure out how to integrate Tokio’s event loop with async WASM host.		Straightforward using blocking I/O (std::fs::File). Probably not possible to use non-blocking I/O (Tokio).
⌙ Restrict access to certain directories only	Deno offers fine-grained permissions defining which files can be read and written. https://deno.land/manual@v1.29.2/basics/permissions#permissions-list https://deno.land/manual@v1.29.2/runtime/permission_apis	Same as in the TS version (see the cell on the left.)	YES		Wasmer’s WASI config allows us to configure read/write access to different directories or even set up a virtual FS. However, access to FS is not limited by default, and I could not find an easy way to restrict it.
⌙ Limit storage space	Deno does not allow control over storage used and does not provide (documented) extension points to achieve that either.	Same as in the TS version (see the cell on the left.)	??		??
Network access	YES, using Fetch API.	It’s not possible to use either std or Tokio to open TCP network connections. However, it’s possible to make HTTP/HTTPS requests using the Fetch API and web_sys crate.	Using custom APIs provided by the WASM host, e.g. https://github.com/deislabs/wasi-experimental-http WASI networking API is not stable yet, therefore it’s not possible to open a new outgoing TCP connection using the standard Rust API. There is also no support for TLS in WASM/WASI yet.		Using custom APIs provided by the WASM host. I was not able to find any existing create I could use. There is an ongoing effort to implement support for WASIX networking, including an HTTP client. Unfortunately, the timeline is not clear.
⌙ Restrict access to certain hostnames only	Deno offers fine-grained permissions defining which files can be read and written. https://deno.land/manual@v1.29.2/basics/permissions#permissions-list https://deno.land/manual@v1.29.2/runtime/permission_apis	Same as in the TS version (see the cell on the left.)	Yes, using wasi-experimental-http.		?? This depends on how we implement network access, see above.
⌙ Limit bandwidth usage	Deno does not allow control over network bandwidth used and does not provide (documented) extension points to achieve that either.	Same as in the TS version (see the cell on the left.)	??		??
Gas metering	Can we measure event-loop latency to calculate relative CPU usage? Deno Deploy offers up to 50ms per request, so there must be a way how they are cutting of scripts taking too long to finish.	Via instrumentation of WASM opcodes?	YES		YES			not available	Via instrumentation of WASM opcodes?
Limit CPU usage			This should be possible according to docs. Related: https://github.com/bytecodealliance/wasmtime/issues/5306		I am not sure. Wasmer is executing WASM code in a blocking way, and it allows only interrupting the code after the gas limit is exhausted. Since we want Station Modules to be long-running processes, a gas limit is probably not the solution for limiting CPU usage.
Add APIs backed by libraries written in native languages, preferably Rust	User land modules can call libraries written in native languages using FFI. They offer tooling to simplify integration with Rust libraries. https://deno.land/manual@v1.29.2/runtime/ffi_api Example module: sqlite https://github.com/andykais/sqlite-native/blob/main/src/ffi.ts This is actually creating a backdoor allowing Station Modules to execute arbitrary OS-native code. We would have to find a way how to disable this functionality for Station Modules.	Same as in the TS version (see the cell on the left.)	Yes, but the integration is via FFI, using C types only. A better solution is to use WebAssembly Interface Types, but that’s been deprecated in favour of WebAssembly Components that haven’t been finalised yet.		Yes, although it's unclear how to support complex types and async host functions.
Support for macOS, Linux and Windows	Yes	Yes	Yes		Yes
Support for ARM CPUs (e.g. Apple Silicon)	Yes	Yes	Tier 3 - Not Production Ready		Yes
Do we need to contribute any extension points, or can we build directly on top of the project as it is now?	We will likely need to make changes to Deno runtime to support resource usage limiting and limit which modules can load native (FFI-based) code.	Same as in the TS version (see the cell on the left.)	No extensions are needed for sandboxing. However, I think we will likely need extensions to implement resource usage limiting.		No extensions are needed for sandboxing FS access. However, I think we will likely need extensions to implement resource usage limiting for FS access. Since there is HTTPs client available, we will need to build one ourselves, and thus it’s up to us how we implement sandboxing & resource usage limiting.
Project popularity
Project health and sustainability
How many active committers & reviewers are there?
How many companies are backing the project?	Single one, VC funded.		Bytecode Alliance
Interactive debugger
Performance profiler
Testing framework
Notes & references	@Julian Gruber mentioned that in his experience, Deno often makes breaking changes in new releases. https://choubey.gitbook.io/internals-of-deno/		Used by FVM. Related projects: https://wasmlabs.dev/articles/run-workers-anywhere/ Wasmtime can execute WASM code in an async fashion. I think it means we can implement host functions as async, and the WASM executor will automatically yield when an async host function is called from a module. However, modules must be written in a synchronous style. (Under the hood, this is implemented using fibers.)	WasmEdge supports JavaScript and a subset of Web APIs like Fetch API. https://wasmedge.org/book/en/write_wasm/js/networking.html Built using C++, are we concerned about security issues?	Popular in Web3 space (ChainSafe, Fluence, Hyperledger, and so on). They claim to have better performance, smaller footprint and more language integrations than wasmtime. The runtime is synchronous, e.g. it’s not possible to instantiate WASM modules asynchronously. See https://github.com/wasmerio/wasmer/issues/1127

Outcome

👉🏻 https://github.com/filecoin-station/zinnia/pull/1

Learnings & takeways

Building Rust code for WASI and running it inside a browser-like environment like Deno is not yet stabilised enough.

Interacting with the outside world (HTTP, Filesystem) requires different APIs in TypeScript and Rust.
- In TypeScript, we prefer to get as close to the official Web APIs as possible.
- In Rust, we prefer to write idiomatic Rust code using Tokio or stdlib.

WASM and WASI has very limited set of features when it comes to networking.
- There is no TLS implementation yet that can run in WASM/WASI. WasmEdge is https://github.com/WasmEdge/WasmEdge/issues/1430
- There is no good solution for opening TCP sockets from WASM/WASI yet.
- Running async Rust functions on WASI runtime is difficult or impossible.

WASI is modelled after POSIX standard. Implementing WASI using Windows API is not trivial. We should run our PoCs on Windows to verify our assumptions about supported platforms.

WASM host-module interface supports only primitive/scalar types like integers. In order to support complex types (arrays, structs), we need to build a custom bridge with extra bits both in the host and in the modules. There few alternatives available:
- wasm-bindgen is a stable project supporting JS runtimes only (e.g. Deno). See e.g. https://github.com/rustwasm/wasm-bindgen/issues/3013. However, according to https://github.com/rustwasm/wasm-bindgen/issues/1862, it may be possible to run WASM modules exporting functions decorated with #[wasm_bindgen].
- wit-bindgen is a new kid on the block from the Bytecode Alliance (the org behind wasmtime). It’s designed to support the upcoming WASM Component Module. It should work flawlessly with wasmtime. However, it’s a young project.
- fp-bindgen is an alternative for Wasmer.

Links & references

Comparison of WASM runtimes in 2023:
https://00f.net/2023/01/04/webassembly-benchmark-2023/

What happened since the last benchmark
• SSVM became WasmEdge, the runtime from the Cloud Native Computing Foundation. The project is very active, and has been focused on performance since day one. It comes with a lot of features, including the ability to run plug-ins.
• Wasm3’s development pace seems to have slowed down. However, it remains the only WebAssembly runtime that can easily be embedded into any project, with minimal footprint, and amazing performance for an interpreter. It still doesn’t have any competition in that category.
• Wasmtime quickly went from version 0.40 to version 3.0.1, with version 4 being round the corner. Every release is an opportunity to update Cranelift, the code generator it is based on. A lot of improvements were recently made to Cranelift, so it’s high time to see how they reflect in benchmarks.
• Wasmer kept releasing unique tools and features, such as the ability to generate standalone binaries. Their single-pass compiler also got updated. Let’s put it to the test!
(…)

Keygen benchmark
Calling external functions in wasmedge may have more overhead than with other runtimes.

LLVM/Cranelift/V8-based runtimes
node, wasmtime, wasmedge and wasmer are in the same ballpark.
(…)
For most users, there are no significant differences between these three runtimes. They share similar features (such as AOT compilation) and run code the same way, roughly at the same speed.
(…)
If the intent is to run arbitrary, untrusted code outside a browser environment, wasmtime feels like the most secure option.

Async I/O in Rust and WASM is tricky. WASI does not support network sockets yet.

Tokio, a popular library for async I/O, used e.g. by Deno, does not support WASM/WASI, but there is work done by WasmEdge to support network communication. See https://github.com/tokio-rs/tokio/issues/5331 and https://github.com/tokio-rs/tokio/issues/4827

https://github.com/fiberplane/fp-bindgen - Bindings generator for full-stack WASM plugins.