Station Module: SP Retrieval Checker (Spark)

Background

The Station team has recently launched Station core and the Zinnia runtime. The team is now looking for a good first module with the following requirements:

  1. It will lead to lots of Station downloads (it helps if it makes sense on people’s desktops)
  1. Station operators will be able to earn FIL sustainably (i.e. clear business model)
  1. Not too complicated to engineer
  1. (Bonus) Alignment with the rest of the RM Lab
ModuleWorks on Desktop (needed to get maximum # of early stations)Clear Business ModelEngineering Complexity (relatively)Aligned with RM Lab efforts
BacalhauMaybeUnclearUnclearNo
PunchrYesNoMediumNo
L1 uptime checkerYesSaturn needs this but not on high priorityLowestYes
L1 hardware requirement checkerMaybe - but how do you get to 10gig check?Saturn needs this but not on high priorityMediumYes
SP retrieval checkerYesFilecoin needs thisLowYes
L2s for pinningNoToo early to sayHighYes

The clear winners are the L1 uptime checker and the SP retrieval checker. We have chosen to go for the L1 uptime checker first because it is more straightforward than the SP retrieval checker and can be viewed as a first milestone towards the SP retrieval checker.

Building blocks

At very high level, we need to figure out the following building blocks:

  1. How the Station Module decides which (cid, address, protocol) to check (measure)
  1. How to execute the (cid, address, protocol) check and what metrics to collect (measure)
  1. How to report check results (measure)
  1. Verify reported data and detect fraud (evaluate)
  1. Evaluate the impact of individual Stations (evaluate)
  1. Pay out rewards (reward)

Non-technical:

  • Who is going to pay for this service to fund the pool of FIL for rewards?
    • Clients who want their content to be retrievable
    • SP that want a proof that they provide good retrievability of stored content
    • Validation Retrieval bot used by Spade (contact person:
      • As of Apr 25th, we don’t think this will work. Validation bot needs a major technical revision, so SPARK shouldn’t aim to replace the current design
    • Filecoin+ includes guarantees for fast retrievability ()
    • Somebody else?
    • Feed our output into validation systems used by Spade, Estuary, etc.

Simple PoC

Protocol

  1. The module will periodically pick (cid, address, protocol) to test, where cid is ID of the content stored by the SP, address is the address of SP’s content provider node, protocol is “bitswap” or “graphsync”.
  1. The module attempts to fetch the CID content from the address using the selected protocol.
  1. The module reports the outcome.
  1. The system pays the Stations for their service and the SPs for their service.

Implementation details:

  • The first step can be initially implemented using a centralised service operated by PL or Filecoin Foundation. This service will provide an API endpoint to obtain the job definition triple.
  • Reports can be initially submitted to the same service too.

Caveats & attack vectors

  • One can spin thousands of Station instances running on the same machine. Then they will receive many jobs to perform. Since retrieval checks are relatively fast and cheap, they will be able to complete most of them and get a large portion of the FIL reward for that epoch.
  • A malicious node can skip the retrieval step and always report the content as not retrievable.
  • A malicious node can skip the retrieval step and report the content as retrievable, using fake data for the submitted metrics.

Roadmap

See

Advanced Trustless Protocol Proposal

SP Retrieval Checker Proposal

  1. All Stations are registered in a smart contract. The smart contract will periodically pick (checker, cid, address, protocol) to test, where checker is the Station,  cid is ID of the content stored by the SP, address is the address of SP’s content provider node, protocol is “HTTP”, “bitswap” or “graphsync”. The randomness can be gained from Drand’s randomness smart contract.
  1. The Stations listen to chain events to see if their node ID is emitted in an event.
  1. The module attempts to fetch the CID content from the address using the selected protocol, including a signature in its request headers.
  1. The SP wraps the Station’s signature in its own signature
  1. The SPs responds and the Station signs the signature of the SP
  1. The Station collects results in a Merkle tree and per

Phase 3: Proof

  1. The checker creates a Merkle tree root (equivalently KZG commitment) of all checker requests and submits on-chain. This commitment transparently includes the number of leaves (requests) that the checker has made.

Phase 4: Verify

  1. Every so often, the smart contract will select a proof and a leaf number to verify from across the set of submitted proofs. (It could do this for every proof.)
  1. The checker listens out for this on chain event. If they are selected to provide a proof, they must provide a Merkle proof (equivalently KZG proof) of inclusion in the tree for that leaf node.

Phase 5: Payment

  1. The checker is rewarded with the FIL equal to the number of checks multiplied by the price per check. The

This flow reduces the amount of traffic and storage used on chain whilst using randomness to ensure that nodes do perform every check and include every proof in the commitment, just in case it is checked in the verify flow.

Inspiration from Pocket Network Flow

Source of funds

This may have a relatively straightforward way towards incentivisation:

  • The money for rewards can come from the person making the storage deal or maybe from the storage fee itself.
  • I think we can get reliable proof of a job performed if we combine a random id associated with the job definition triple plus Drand for time-based randomness plus the content being retrieved and calculate a unique hash from that. We can also record the time when the retrieval check result was submitted to ensure the node performed the check around the time when Drand generated the random value.

Meeting Notes

Meeting with Bedrock on Apr 19th, 2023

  1. Who can get value from the checker?
    1. Make cid.contact data more accurate. ATM, they check only connectivity, not retrievability. Indexer is run as a public good, there are no incentives now. This requires work at the indexer side to convert data reported by Stations into a reputation system. That will take long time to build and deploy, thus it is not the fastest way to get rewards.
  1. How the Station Module decides which (cid, address, protocol) to check
    • Let’s pick one protocol for the initial PoC. Which one should it be?
    • cid.contact can differentiate between older and newer CIDs.
    • Maybe we can prefer more popular content over less popular? OTOH, more popular CIDs get more organic traffic, so we may not need to check popular CIDs so frequently.
    • We want to get to paying module as fast as possible. Let’s start with a fixed list of CIDs and a fixed list of SPs.
      • We can ask @Masih Derkani to give us a fixed list of CIDs to begin with. Or we can take the most popular CIDs from Saturn and map them to SP addresses using cid.contacts public API. Also IPFS Awesome Datasets.
  1. How to report check result
    1. Lassie provides API to record events about retrievals. We may want to report data from our system to Lassie, but we must filter out fraudulent records first.
    1. We need to check whether the reported event is valid or fraudulent. This filtering will be closely tied to how Station pays rewards.
    1. Then we can report these events via Lassie event recorder.
    1. We can also publish it to Pando from Ken Labs.
    1. We can talk to Validation Bot people to find out where they are submitting data to.
    1. There is also a SP reputation WG & DAO - https://reputationdao.super.site
    1. Let’s not depend on 3rd-party services to store our data. Let’s setup our own centralised storage service for the start.
  1. Who is going to pay for this service to fund the pool of FIL for rewards?
    • Clients who want their content to be retrievable
    • SP that want a proof that they provide good retrievability of stored content
    • Somebody else?
  • Impact evaluation:
    • Don’t consider checker retrievals as more valuable than other retrievals made by other parties in the network.
    • Reach out to to learn more about the support they can give us as the early implementers of Impact Evaluators.
    • Join SP reputation DAO or Hypercerts initiatives to get more funding & liquidity.

Comments from the April 11th meeting with Juan:

  • Juan thinks we should avoid being too Web2-like and start with a more decentralised architecture from the beginning. In particular, we should not build a centralised service to give stations the (cid, address, protocol) triple, but find a way how to allow stations to choose the triple from the list of known CIDs.
    • For the initial Poc, we can start with a short fixed list of SPs and CIDs that reduce the # of unknowns.
    • Later, getting a list of CIDs could be its own module / smart contract.
    • Land on the format of the list that we need to feed into the system. Build metrics around this system.
    • Work with Bedrock to figure out what should be on these lists.
  • More thoughts on CID selection:
    • A station should be able to sample multiple SPs in one round
    • IPNI breaks everything up by payload CIDs - we don’t want to requests piece CIDs.
    • What if the CIDs correspond to something huge?
    • Model of # of SPs, stations and requests to back into a business model
    • How do we cut down the size of # of CIDs based on previous tests?
    • Certain SPs reject unknown connections. What should we do with these nodes? The indexer can flag this. This module has to know which SPs are happy to play the game.
  • Traffic created by Stations has to be hard to distinguish from real traffic
  • We should respond with lots of metrics rather than just a proof of availability
  • We need to know how much bandwidth this will hit onto SPs.

Sync Notes 2023-04-11

  • A station should be able to sample multiple SPs in one round
  • We should respond with lots of metrics rather than just a proof of availability
  • Look into existing dealbots
  • Model of # of SPs, stations and requests to back into a business model
  • How do we cut down the size of # of CIDs based on previous tests?
  • We will likely need to have a log system like Saturns
  • Matt Freilich are intereste in IEs in smart contracts
  • Station UI looks good when there are lots of jobs happening. Does this module give us this?
  • Can we give a predictive earnigns graph?
  • Certain SPs reject unknown connections. What should we do with these nodes? The indexer can flag this. This module has to know which SPs are happy to play the game.
  • What if the CIDs correspond to something huge?
  • Simple PoC: short fixed list of SPs and CIDs that reduce the # of unknowns.
  • Land on the format of the list that we need to feed into the system. Build metrics around this system.
  • Work with Bedrock to figure out what should be on these lists.
  • We need to know how much bandwidth this will hit onto SPs.
  • IPNI breaks everything up by payload CIDs - we don’t want to requests piece CIDs.
  • Getting a list of CIDs could be its own module / smart contract
  • Should you get different payout to error vs lots of bandwidth.
  • Traffic created by Stations has to be hard to distinguish from real traffic