SPARK roadmap to LabWeek’23

In and , we are discussing a fully decentralised design for SPARK. However, such design cannot be implemented by LabWeek’23 (mid-Nov), where we want to present concrete SPARK improvements and future plans (see ).

This page outlines a set of incremental improvements we can deliver in the next ~10 weeks.

Context

At the moment, we have a centralised SPARK orchestrator service hosted at Fly.io with a hard-coded list of 200 (CID, SP, proto) job templates using bitswap or graphsync protocols. SPARK’s Filecoin Station module (checker node) is running on 40+ Station instances. It periodically asks the orchestrator for a random job, performs the retrieval and reports the stats to the orchestrator.

There is no verification or fraud detection. Checker nodes can easily cheat, and we won’t know.

The sections below outline the next few features to implement by LabWeek. These features will give use minimal fraud detection we can incrementally improve later.

I am versioning the milestones as `2.X` since this is the second iteration of SPARK roadmap. We already released SPARK 1.0 earlier this year.

M2.1 Faster deployments of SPARK updates [✅ DONE]

While Station Desktop provides a super easy way to update to a newer version, it still requires manual intervention from the user - they have to restart the app.

As a result, less than 50% of users are running the latest Station version, and about ~25% of users are running an older SPARK module version. This makes it difficult for us to quickly iterate on the SPARK protocol because we have to support older server versions for a long time.

I am proposing two small changes:

Introduce a new Orchestrator API error response with HTTP Status Code 400 and the response body equal OUTDATED CLIENT
→ https://github.com/filecoin-station/spark-api/pull/55

In the SPARK module, when it receives this new error for the first time, it will log an error activity to the Station UI and stop doing any work.
→ https://github.com/filecoin-station/spark/pull/14

Additionally, Station Desktop will automatically restart after it downloads the installer for the new version? To not interfere with the user’s actions, we can trigger the restart only if the Station’s window is not shown (we are running in the tray).

EDIT: This is already happening → https://github.com/filecoin-station/desktop/pull/931

Auto-restart does not solve the problem for Station Core (headless), so we should probably still implement some way for the server to tell the SPARK module that it should not do any more work because it’s outdated.

M2.2: SPARK rounds [Sept 30th]

→ https://github.com/filecoin-station/spark/issues/13 and https://github.com/filecoin-station/roadmap/issues/41

In order to verify that checker nodes retrieved all data, we need several checkers to redundantly perform the same retrieval so that we can use the “honest majority” approach to determine what are the expected values, like the Blake3 hash of the CAR file retrieved.

To enable the formation of majorities, we need a concept of a time-limited period - a round - during which we collect redundant measurements and then compare the results after the round is over. So, the first logical step is to introduce the concept of a SPARK round.

We need to define the round length. First, we must discuss the tradeoffs involved and how easy it will be to change the round length. Example length: 30 minutes.
- 2023-08-28 UPDATE
  SPARK rounds will be aligned with MERidian rounds. MERidian rounds have a variable length based on the rate of new FIL blocks being mined. See
  Links:
  - https://github.com/Meridian-IE/evaluate-service/blob/main/index.js
  - https://github.com/Meridian-IE/evaluate-service/blob/main/lib/contract.js

We want checkers to perform multiple retrievals in each round. There are two parameters involved:
- TC = How many (cid, sp) tasks we define per round, e.g. 100 jobs/round
- TN = How many tasks each checker executes, e.g. 20 jobs/checker
See

Introduce a new Orchestrator API that returns the current SPARK round number and the list of tasks selected for that round.
- IE rounds are reset when a new version of MERidian contract is deployed.
- We want SPARK round numbers to be always increasing.
- Solution: map SPARK round number (e.g. 300) to a tuple (smart contract address, IE round), e.g. (0xasdwe, 10).

Rework the way the SPARK module schedules jobs. It should get the list of tasks at the beginning of each epoch, randomly sort them, and then perform the first TN tasks sequentially.

Update the measurements reported by the SPARK module to include the SPARK round.

Add the first fraud-detection step to the SPARK/MERidian evaluation service: When evaluating a measurement for (round, cid, sp), where round is a SPARK round, verify that:
- The measurement was submitted within the time period of the SPARK round specified in the measurement.
- (cid, sp) is a valid job for that round,
- there is at most TN measurements submitted by that node for that round,
- all measurements are for unique set of (cid, sp) pairs per (round, checker)
We should add an API to spark-api allowing the evaluation service to get the data it needs.
Let’s run all of this only once per SPARK round, as part of the final evaluation. We will improve performance when there is a need for that.

✅ Create a follow-up task to change the old API endpoint for task scheduling to start returning “outdated client” error.
→ https://github.com/filecoin-station/spark-api/issues/73

⚠️

This tasking protocol can be easily cheated since each checker node can freely decide which tasks to pick from the list.
That’s fine.
This milestone enables the formation of committees where a majority of members report the same measurement, which allows us to start working on steps that require these committees.

M2.3: CAR checksums [Oct 15th]

Enhance the SPARK module to compute and submit a ~~Blake3~~ ~~SHA-512~~ SHA-256 hash of the CAR file downloaded.
- ~~This may require changes in Zinnia because Blake3 is not available via WebCrypto API, and the official implementation ships WebAssembly for JavaScript users.~~
- We can use any hash function provided by WebCrypto API in this iteration because we are not building Blake3 inclusion proofs yet.

Add another fraud-detection step to the evaluation step:
After each SPARK round is over, process all measurements reported during that round (excluding measurements already filtered out as invalid).
- (Conceptually,) Group all measurements by (cid, sp) pair.
- For each (cid, sp) pair, compare the checksums reported by measurements in this round.
  - If the majority of measurements agree on the same value, then assume it’s the correct one and flag all disagreeing measurements as invalid.
  - If there is no majority, then flag all measurements as invalid.
  - Submit a telemetry point - cid, sp, how many measurements were received, the size of the largest group agreeing on a single hash, the value of that hash, and whether a majority was formed.

⚠️

Checker nodes can easily cheat by retrieving the CAR file from a different provider, sharing the computed hash with other nodes, or looking up the hash from previous MERidian measurements.
That’s fine.
This milestone implements the groundwork needed for inspecting a group of measurements and searching for the expected value using the “honest majority” approach.
It also adds a bit of new friction, making cheating slightly more difficult. One baby step at a time.

M2.4: CID sampling alpha [stretch goal]

Remove the static list of job templates and replace it with dynamic (CID, SP) selection sampling data stored in FIL+ deals. Depending on the complexity of the “proper” CID sampling we envision, this milestone can implement a simplified version or a part of the grand solution.

The goal is to make it a bit more difficult for SPs to cheat, as they won’t be able to use our hard-coded list of job templates to prioritise certain retrievals over others.

Sub-task: HTTP retrievals only

→ https://github.com/filecoin-station/spark/issues/12

Rework our job templates to use only (CID, SP) pairs supporting retrievals using the HTTP protocol.

Enlarge our list of job templates to the top 1000 CIDs (or more). We need this for the next milestone, where we want to choose ~100 tasks per round. Selecting 100 random tasks from 200 templates wouldn’t work that well.

Keep the protocol field in our data model to support existing SPARK clients.

Rework the SPARK module and hard-code retrievals to use the HTTP protocol.

Earlier discussion about IPNI protocols: https://filecoinproject.slack.com/archives/C03PQG6UT2B/p1690560331187479
- HTTP Gateway transport has code 2336 (0x0920)

https://github.com/multiformats/multicodec/blob/master/table.csv

name tag code status description
transport-ipfs-gateway-http transport 0x0920 draft HTTP IPFS Gateway trustless datatransfer

M2.5 LabWeek preparations

More-detailed post-LabWeek roadmap
→ https://github.com/filecoin-station/proj-mgmt/issues/54

Prepare the talk(s)
→ https://github.com/filecoin-station/proj-mgmt/issues/35

M2.X Longer-term work in the background

While working on the milestones above, we should also keep making progress in the following work streams:

Retrieval Attestations in Boost

Design for CID Sampling

Fully decentralised design for SPARK

Post LabWeek

Retrieval Attestations rollout to SPs

Retrieval Attestations in SPARK

Introduce DRAND randomness, link SPARK rounds to DRAND rounds

Decentralised tasking with multiple committees each round

Proof of Data Possession (Blake3 inclusion proofs)

Join the Reputation WG and start pushing SPARK measurements to the Reputation DB
- This requires us to have provider ID f0xxxxxx for each measurement.

Proof of IPv4 address.
We need this to allow SPARK checker nodes to submit MERidian measurements directly on the chain, bypassing our measurement service.

Tweak fraud detection based on what we observe