SPARK Roadmap 2023

We decided to build SP Retrieval Checker as the first Zinnia module and the first module paying rewards to Station operators. You can find the design discussions in

The milestones below outline the engineering work. In parallel to the technical work, we need to answer the business & product questions, especially the most important one: Who is going to fund the FIL pool for paying out rewards?

We need to start working on rewards early on and parallel to the technical work, see

M1 June 13th: Reward-less Walking Skeleton

[19 days = 5 weeks of work]

Build a walking skeleton covering several functional areas. Implement as little functionality as possible while still delivering a meaningful system.

  1. Spark API - job scheduling (web2-style) [5 days of work]
    • Web2-style, let’s get to rewards as quickly as possible.
    • A cloud-based Orchestrator service that assigns checker jobs to Station instances. In the future, we should replace this with a smart contract.
    • The simplest job selection possible: embed a hard-coded list of (CID, providerEndpoint, protocol) into the deployed Orchestrator service and pick a random item whenever we need to schedule a job. (We will improve this in the following iterations.)
      • We can get some CIDs from Saturn folks or Rhea/Lassie logs.
    • The API implementation writes the randomly selected job into our internal database, together with a timestamp. It generates a unique JobId field, stores it in the DB and includes it in the response.
    • Let’s use Fly.io because it offers both Server hosting and DB hosting.
    • We should implement schema migrations using an automated tool (e.g. postgrator or something else).
    • The Station/Zinnia team will operate this orchestrator.
    • Setup CI, tests, linting. Setup automatic deployments (CD) on push or tag.
  1. A Zinnia module to perform the retrieval checks. [5 days of work]
    • We can use HTTP API to ask the Orchestrator about the job to perform.
    • Retrieve the CID using the IPFS Gateway or Saturn CDN, making HTTP requests via the Fetch API. (We will rework this in the following iterations.)
    • Submit retrieval logs & metrics to Ingester
      • Look at what Saturn and Rhea/Lassie is including in their logs.
      • TTFB, TTLB, error rates, retry count, download speed, etc.
      • Talk to Will and Lauren what data will be useful to them. Maybe we can make this dynamically configurable?
      • Let’s start with the easy metrics we can get from the Fetch API only.
      • When Station reports the job outcome to the Ingester, it includes JobId and WalletAddress in the payload.
      • We will need to mock/stub Orchestrator and Ingester for testing.
    • No verification of the retrieval content, we will add that via Lassie later.
    • The module will be deployed to Filecoin Stations, it will replace the placeholder peer-checker module.
  1. Improve Zinnia DX for building Station modules [6 days]
  1. Spark API - Ingester [2 days]
  • Add a new HTTP route to our backend monolith app to receive reports about completed jobs. This will share the same DB with Orchestrator/Job Scheduler.
  • When Ingester receives a job report from a Station, it attaches the current timestamp to the record stored. This will help us troubleshoot and may be useful later for fraud detection.
    • The JobId in the report allows us to verify that Stations performed jobs that our Orchestrator scheduled, as the job report must provide a valid JobId.
  1. DevOps & Monitoring [1 day]
    • We should build observability into our systems from the beginning. Make sure we are collecting the right data needed to understand what’s happening in the system, and that we are visualising this data in a way that enables the understanding.
    • I have heard good things about https://www.honeycomb.io, but since PL is heavily invested into Grafana, we can use Grafana too.
    • Error monitoring
      → Integrate Sentry
    • Do we have strong enough Fly.io instance to handle the load?
      → We can use Fly.io dashboard to see the stats
  1. Security [already covered]

    The trouble: Station installations are permissionless and anonymous. Anybody can run a Station and it’s easy to run thousands to millions of Station instances concurrently. The code of Station Modules is open source, attackers can inspect which HTTP APIs we are calling and call them in an automated way. It’s easy to flood our backend services.

    • We need to setup reasonable spending limits on the infrastructure running Orchestrator and Ingester, so that a flood of requests causes DoS but not an astronomical bill to pay.
    • This should be already covered, we will rely on spending limits provided by Fly.io.

M2 Jun 30th (+/- 1 week): Lassie Retrievals

[12 days + 6 days for unknown unknowns = 3-5 weeks]

Replace the code making HTTP requests to IPFS Gateway with a retrieval client like Lassie.

Important: Retrieval requests from this module should be indistinguishable from “legit” requests made by other actors in the network (e.g. Saturn). Otherwise SPs can prioritise checker requests over regular traffic.

Subtasks:

  • Write a Go/FFI shim to start & stop the Lassie HTTP server. Write Rust bindings for this shim and make sure we can use this from Rust! We will need to configure tooling for CI/CD and also write tests! [5 days]
  • Integrate Lassie into Zinnia (both zinnia and zinniad). (Start Lassie on start, stop it on exit.) [2 day]
    • We want to configure Lassie to use Module’s CACHE_DIR as the temp dir. Fortunately, there is already such an option exposed.
    • On process start, delete all files in Lassie’s temp dir. These files may be created if Station exits Station a CID retrieval request.
  • Implement Zinnia JS API for retrieving content for a given CID, optionally allowing the caller to specify the address and protocol to use. [3 days]
    • E.g. we can add a small fetch wrapper that will translate URLs like ipfs://bafy... into Lassie HTTP GET requests.
    • Maybe it’s better to create our own API that will allow us to support additional request parameters like the peer address and the protocol to use. This new API can return fetch response under the hood.
  • Lassie improvements [2 days]
    • Implement authentication - only Zinnia requests should be handled by Lassie. Requests from other clients must be rejected.
    • We want to configure Lassie HTTP mode to not cache blocks between retrieval requests.

M3 Oct 5th: Rewards Alpha

[Very roughly: 2-3 months]

Use the Generalized Impact Evaluator framework (see the whitepaper).

Note: We need to figure out the details. We can research this area in parallel to M1-M4.

M3.1 Measure

We are already collecting data about jobs completed; see . As part of the current milestone, we may need to add more fields to the data collected.

M3.2 Evaluate

Based on the measured data, calculate the impact of individual Station instances. As part of this evaluation, we must detect and filter out fraudulent reports.

M3.3 Reward

Map the impact of individual Stations calculated in the previous step to FIL rewards.

Have a pot of FIL for funding the rewards and a process for refilling it.

Finally, pay out the rewards from the funding pot.

M4 Nov 3rd: Station Beta Launch

Details to be determined. We want to do a soft-launch of Station with rewards, to do more testing before big public GA announcement.

The tasks in this milestone may include things like GTM messaging, marketing material, etc.

We will need a support person to help early adopters and make sure we answer their questions in a timely manner.

Next - after we ship Station Beta/GA

  • Decentralized Job Scheduling. No Orchestrator/central database, Stations self-select jobs.
  • Decentralized Log Ingester.
  • DoS Prevention - prevent malicious actors from (D)DoSing our Orchestrator and Ingester services by sending too many requests to them.
  • Hardened fraud detection and filtering.
  • And more based on feedback from real-world usage.

NextGen Verifiable Job Scheduling

[Very rough estimate: 1 month]

More web3-like, less centralised, moving towards trustless verification by 3rd parties

Replace the naive algorithm with a more robust solution that will make it harder for Stations to cheat or collude with Storage providers. Ideally, the new solution should allow third parties (like our orchestrator and ingested services) to verify that each Station is following the rules when picking retrieval checks to perform.

For example:

  • We can use DRAND as a source of randomness.
  • While still using a hard-coded list of jobs ((CID, address, protocol)), we combine Station ID with DRAND random number to pick a job to perform.
  • We should design the algorithm to be implementable as a smart contract in the future.
  • Using live data from indexers is out of the scope for this milestone.

Notes:

  1. We may need to know how many Stations are running to be able to distribute jobs to them. However: Stations can come and go frequently and we cannot report when a Station goes offline (e.g. because the user closed their laptop). The registry of running Stations must accommodate for this limitation. For example, Stations can report a heart beat every B minutes (B=1) and we can consider a Station as running if it reported a heart beat within the last W minutes (W=60?).
  1. How long should the job selection remain verifiable? For how long the input data must stay accessible?
    1. DRAND provides API to obtain the random value generated at the given round in the past.
    1. The orchestrator should avertise the triple (chain hash, round, random number)